0% found this document useful (0 votes)

23 views151 pages

Vdoc - Pub The Method of Maximum Entropy

Uploaded by

timobechger

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views151 pages

Vdoc - Pub The Method of Maximum Entropy

Uploaded by

timobechger

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 151

ERRATA

T h e M e t h o d of M a x i m u m E n t r o p y
H. Gzyl

Even though to err is human, some "misprints" are plain dum, if not worse.
Here are some of the worst t h a t I caught a bit too late.

L. I. Csiszar's name is uniformly misspelt. Sorry for that.

2. P. 41, the first term on the R.H.S. of (5.2) should be - In ZQt9{t).
3. P 42, the L.H.S. of (5.3) should be

[Hfi/h)dQ
4. P. 82, line 6 from below, I should have written

qi = q2 = q3 = q4 = 1/4.
THE METHOD OF

MAXIMUM
ENTROPY
This page is intentionally left blank
Series on Advances in Mathematics for Applied Sciences - Vol. 29

T H E M E T H O D OF

MAXIMUM
ENTROPY
Henryk Gzyl
Facultad de Ciencias
Universidad Central de Venezuela

World Scientific
Singapore >New Jersey London* Hong Kong
Published by
World Scientific Publishing Co. Pte. Ltd.
P O Box 128, Farrer Road, Singapore 9128
USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Cataloging-in-Publication Data

Gzyl, Henryk, 1946-

The method of maximum entropy / Henryk Gzyl.
p. cm. ~ (Series on advances in mathematics for applied
sciences; vol. 29)
ISBN 9810218125
1. Maximum entropy method. I. Title. II. Series.
Q370.G97 1995
511'.42-dc20 94-23122
CIP

Copyright © 1995 by World Scientific Publishing Co. Pte. Ltd.

All rights reserved. This book or parts thereof, may not be reproduced in any form
orbyany means, electronic or mechanical, including photocopying, recording or any
information storage and retrieval system now known or to be invented, without
written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through
the Copyright Clearance Center, Inc., 27 Congress Street, Salem, M A 01970, US A.

Printed in Singapore by Uto-Print

Preface

This book is an outgrowth of a set of lecture notes on the maximum entropy method delivered in
1988 at the 1st Venezuelan School of Mathematics. This yearly event aims at acquainting graduate
students, and university teachers with trends, techniques and open problems of current interest. It
takes place during the month of September at the Universidad de los Andes, in the city of Merida.
At the same time I was being invited to give lectures, Didier Dacunha-Castelle passed by
and reported on his work on the subject. This happened not long after some astronomers friends
of mine from the CEDA (also in Merida) had asked me to go with them over some methods for
reconstructing images based on a maximum entropy procedure. So what else was left for me to do
but collect material for that course?
The more I looked around, the more applications of the method I found. My original goal
was to organize the material in such a way that the underlying philosophy of the method became
transparent and to try to understand myself why it works. I hope to convey some of that to you,
even though some of the whys are still a mystery (at least to me).

v
This page is intentionally left blank
Table of Contents

PREFACE v

CHAPTER 0
Introduction 1
CHAPTER 1
Basic Concepts from Probability Theory 7
CHAPTER 2
Equilibrium Distributions in Statistical Mechanics 17
CHAPTER 3
Some Heuristics 23
CHAPTER 4
Entropy Functionals 27
1. Basics 27
2. Entropy inequalities 35
3. Axiomatic characterization of entropies 39
CHAPTER 5
The Method of Maximum Entropy 41
1. Kullback's and Jaynes' reconstruction methods 41
2. Czizar's results 44
3. Borwein and Lewis' extensions 49
4. Dacunha-Castelle and Gamboa's approach to level 2 M.E.M. 49
CHAPTER 6
Applications and Extensions 61
1. Entropy maximization under quadratic constraints, or constraint relaxation 61
2. Failure of maximum entropy methods for reconstruction in infinite systems 63
3. Some finite dimensional, linear reconstruction problems 67
4. Maxentropic approach to linear programming 72
5. Entropy as Lyapunov functional. Further comments 75
6. Solving matrix equations 78
7. Estimation of transition probabilities 80
8. Maxentropic reconstruction of velocity profiles 83

vii
viii Contents

9. Fragmentation in a nuclear reaction 86

10. Maxentropic inversion of Laplace transforms 88
11. Maxentropic inversion of Fourier transforms 90
12. Maxentropic spectral estimation 93
13. Maxentropic solution of integral equations 100
14. Maxentropic image reconstruction 103
15. An application in systems analysis 105
16. Distributions with preassigned marginals and related problems 107
17. Maxentropic approach to the moment problem 110
18. Maxentropic taxation policies 113
19. The Stieltjes moment problem 115
20. The Hamburger moment problem 119
21. Applications to data analysis 121
CHAPTER 7
Entropy and Large Deviations 127
CHAPTER 8
Maximum Entropy and Conditional Probabilities 135
CHAPTER 9
Maximum Entropy and Statistics 139
1. Gauss principle and minimum discrimination 139
2. Sufficiency 142
3. Some very elementary Bayesian statistics 144
Chapter 0

INTRODUCTION

The Method of Maximum Entropy is an offspring of the Maximum Entropy Principle introduced
in 1957 in statistical physics by E. Jaynes. That principle has the esthetic appeal of all variational
principles in physics and its basic role is to characterize equilibrium states. It works as follows: a
functional which is a Lyapunov function for the dynamics of the system is defined on the set of
states. It is postulated that the equilibrium states are those yielding a maximum value of the
functional, compatible with a set of given values of some extensive variables.
In chapter 2 we shall explain these things further, here we direct the reader to [0.1] where
the original papers by Jaynes are reprinted together with many other interesting ones.
Actually, the possibility of characterizing probabiUty densities by variational methods was
already noticed by statisticians and information theorists well before 1957. Take a look at [0.2],
especially chapter 3. By now, the list of probabiUty distributions derived via the maximum entropy
method is pretty long. We can go back even further. The germ of the idea is already presented in
Boltzmann's writings. See [0.24] in the commemorative volume dedicated to his life and work.
The germ of the idea is also present in Gibb's work. See the paper on the Camot's principle by
Jaynes in [0.7].
See the volume [0.3] by Kapur which devotes several chapters to characterization of
standard probability densities by maximum entropy methods. Besides, a large variety of
appUcations in which the notion of entropy enters is presented. And, speaking of appUcations, we
bring the collection [0.4]-[0.11] to the reader's attention, where not only many appUcations of the
MEM. are collected, but a lot of space is dedicated to foundational matters, and to explain the
word "Bayesian" in the title of many of the volumes.
To explain the general philosophy of the M.E.M., and the underlying common approach of
the long list of successful applications, let us begin by saying that many inverse and/or direct
problems, aU of which we shall call reconstruction problems, lead to searching for solutions to
equations like

(0.1) Ax=y

where A: V, -> V2 is a linear transformation between two appropriate vector spaces V, and V2 ,
and we may be looking for solutions in some cone (or convex set) C,cV, while the data Ues in

/
2 The Method ofMaximum Entropy

some other cone (or convex set) C2cV2

It may happen that the number of variables in (0.1) is much larger than the number of
equations. For example, you may know a Laplace of a Fourier transform at a small number of
values of the transform parameter only. The question is how to find x in (0.1).
The MEM. enters at two different levels. At level 1, one way to tackle (0.1) is to define a
concave functional on an appropriate convex set C^cM^

(0.2) S: C,->5R

which will be the entropy functional and instead of solving (0.1) one sets up the following
maximization problem

(0.3) max {S(x): xe C\, A*=y}

thus, we see that if x* is such that S(x*) reaches a maximum value, and the constraint is satisfied,
one automatically has a solution to (0.1). The beauty about (0.3) is that many times solving (0.3)
is equivalent tofindinga minimum of a convex functional

(0.4) H(X)=lnZ(X)-KX,y)

where Z(X) will make its appearance below. In general H(V) is defined on a convex D c V\, and
when we are lucky D= V2 H(X) is some sort of dual to S(x) although not quite. Physicists
have nice interpretations for it. In (0.4) the X are the Lagrange multipliers for (0.3).
We shall write (x,y) for the scalar product of vectors x, and y , and when x e V and
A,eV, (X,x)s X(x) as usual.
The value A.* of the X that makes x(A.) a solution x* to (0.3) is obtained by minimizing the
convex functional H(X), which depends on as many variables as there are equations in (0.1).
We have thus transformed solving linear problem with more unknowns than equations into
solving a smaller minimization problem, hopefully without constraints.
Many of the initial applications consisted in looking for positive probability densities
yielding prescribed mean values for a finite collection of functions, taking S as the
Gibbs-Boltzmann entropy associated to a density was natural.
In 1967 Burg proposed another entropy functional which has proven very useful for
reconstructing densities when information about time series is given by a few correlations. We
shall come back to this below.
At what we call a level 2 reconstruction problem, the M.E.M. enters the following way:
On some appropriate measurable space (£1,3), see chapter 1, we consider a class P of probability
Introduction 3

measures, possibly absolutely continuous with respect to some fixed, preassigned a priori
measure, and a family of random variables X: fi->V,. Thus if PeP, the expected value of X
with respect to P, Ep (X), is an element i s V , . Instead of considering equation (0.1) we shall think
of random variables AX with expected value

E/>AX=AEPX=y

Instead of solving (0.1) we will search for measures satisfying EpAX=y. Now this becomes
a level 1 problem on a different space namely, we want to find

(0.5) sup{S(P): PeP, EPAX=y}

The rest of the comments made when describing the level 1 version of the MEM. apply
here as well.
I have been able to trace this approach to reconstruction problems at least to the work of
Rietsch [0.12], where it was applied to reconstruct the earth's density given its mass and moment
of inertia.
After recalling some basic notations and definitions from probability theory, in chapter 1,
we devote chapter 2 to a watered down presentation of equilibrium statistical mechanics. Chapter
3 consists of some heuristic arguments backing up the M.E.M..
In chapter 4 we introduce the most used entropy functionals and examine some of their
properties.
Finally, it is in chapter 5 where the MEM. is explained. There we borrow from the
important work by Cszizar, Dacunha and Gamboa, and make a few comments about the work by
Bowrein and Lewis.
Surely the appeal of the MEM. has to do with its success in a large variety of
applications, in some of which an entropy like concept is natural. But in many cases it is just
something you pull out of your hat and it solves a problem for you. It may be this fact what
prompts much of the work of explaining what the MEM. is about. Besides that, there is the
appeal of the concept of entropy that comes through the second law of thermodynamics in
understanding irreversibility; see [0.13]-[0.14]. To see how entropies help to understand issues
related to self-organization and/or chaotic behavior as explained in [0.14]-[0.16]. Some uses of
the notion of entropy in biology and economics have generated strong and sarcastic criticism. See
[0.17]-[0.18] and the reviews in [0.19]-[0.20].
An interesting collection in which the thermodynamic notion of entropy plays a role is
compiled in [0.21] and, of course, we should not fail to list at least one reference on the use of the
4 The Method ofMaximum Entropy

concept of entropy in information theory, [0.22] and in the theory of dynamical systems [0.23],
connections between entropy, complexity and several quantum issues are reviewed in [0.25],
It will be up to the reader to decide whether there is or there is not a common thread in
this list of references, many not directly related to the M.E.M., which explains its appeal beyond
the mere: it just works.
To conclude, it must be clear that we are citing references by square brackets, numbered
almost always by order of appearance, listed by chapter. Also, formulae, definitions and results
will be cited sequentially in each chapter within round brackets.
I would like to thank my colleague Aldo Tagliani for writing sections (6.19)-(6.21).
And last, but not least, my thanks go to Ms. Leda Calderon, who typed the manuscript
and went along nicely with my changing of mind now and then about a paragraph here and there.
The editorial staff at WSP did a fabulous job weeding out uncountable mispirints
To finish I want to acknowledge the support of the Facultad de Ciencias, U.C.V., and of
CONICIT forfinancialsupport during the preparation of the book.
Two references which I obtained during a brief visit to the CWI in Amsterdam, and added
at the last minute are [0.25]-[0.26].

REFERENCES

[0.1] E. T. Jaynes: Papers on Probability, "Statistics and Statistical Physics". Ed. Rosenkrantz,
E.D. Kluwer Acad. Publi., Dordrecht, 1983.
[0.2] Kullback, S. "Information Theory and Statistics". Dover Publi., New York, 1968.
[0.3] Kapur, J. N. "Maximum Entropy Models in a Science and Engineering' John Wiley,
New York, 1989.
[0.4] Justice, J. H. (Eds) "Maximum Entropy and Bayesian Methods in Applied Statistics"
Cambridge Univ. Press, 1986.
[0.5] Ray Smith C. and Grandy, W. T., Jr. (Eds) "Maximum Entropy and Bayesian Methods in
Inverse Problems". D. Reidel Publi. Co., Dordrecht, 1987.
[0.6] Ray Smith C. and Erickson, G. J. (Eds) "Maximum Entropy and Bayesian Spectral
Analysis and Estimation Problems" D. Reidel Publi. Co., Dordrecht, 1987.
[0.7] Erickson, G. and Ray Smith C. (Eds) "Maximum Entropy Methods and Bayesian Methods
in Science and Engineering": Vol I-Foundations. Kluwer Acad. Publi., Dordrecht, 1988.
[0.8] Erickson, G. and Ray Smith C. (Eds) "Maximum Entropy Methods and Bayesian Methods
in Science and Engineering". Vol II-Applications. Kluwer Acad. Publi., Dordrecht, 1988.
[0.9] Skilling, J. (Eds) "Maximum Entropy and Bayesian Methods" Kluwer Acad. Publi.
Co.,Dordrecht, 1989.
Introduction 5

[0.10] Fougere, P. F. (Eds) "Maximum Entropy and Bayesiam Methods" Kluwer Acad. Publi.
Co., Dordrecht, 1990.
[0.11] Grandy, W. T., Jr. and Schick, L. H. (Eds) "Maximum Entropy and Bayesian Methods"
Kluwer Acad. Publi., Dordrecht, 1991.
[0.12] Rietsch, E. "A Maximum Entropy approach to inverse problems". Joum. of Geophysics.
42, pp. 489-506, 1977.
[0.13] Atkins, P. W. "77K; Second Law" W. H. Freeman. New York, 1984.
[0.14] Prigogine, I. and Stengers, I. "Order out of Chaos". Bantham Books, New York, 1984.
[0.15] Klimontovich, Yu. L. "Turbulent Motion and The Structure of Chaos" Kluwer Acad.
Publi., Dordrecht, 1991.
[0.16] Mackey, M. C. "The Origin of Thermodynamic Behaviour" Springer Verlag, Berlin,
1992.
[0.17] Rifkin, J. "Entropy: Into the Greenhouse World'. Bantham Books, New York, 1989.
[0.18] Brooks, D. R. and Wiley, E. O. "Evolution as Entropy".Univ. of Chicago Press, Chicago,
1988.
[0.19] Morowitz, H. "Entropy Anyone"., in "Mayonnaise and the Origin of Life" Berkeley
Books, New York, 1985.
[0.20] Rothman, T. "Science a la Mode: Physical Fashions and Fictions". Princeton Univ. Press,
Princeton, 1989.
[0.21] Leff, H. S. and Rex, A. F. "Maxwell's Demon. Entropy, Information, Computing"
Princeton Univ. Press, Princeton, 1990.
[0.22] McEliece, R. J. " The Theory of Information and Coding" Vol 3, Encyclop. Math.,
Addison - Wesley, Reading, 1981.
[0.23] Martin, N. F. G. "Mathematical Theory of Entropy". Vol 12, Encyclop. Math.,
Addison-Wesley, Reading, 1981.
[0.24] Klein, M. J. "The development of Boltzmann's statistical ideas". ActaPhys. Aust. Suppl.
X, pp 53-106, 1973.
[0.25] Gelfand, I. M. and Yaglom, A. M. "Calculation of the amount of information about a
random function contained in another such function". Amer. Math. Soc. Transl. Series 2 ,
A.M.S., Providence, 1959, pp. 199-246.
[0.26] "Maximum Entropy and Bayesian Methods" 3 volumes edited by Grandy, W. T. and
Schick, L. H.; Ray-Smith, C. et al; Mohamed-Djafary, A. and Demoment, G. Published by
Kluwer Acad. Publishers respectively in 1991, 1992 and 1993 in Dordrecht, Holland.
Chapter 1

BASIC CONCEPTS FROM PROBABILITY THEORY

We will recall some basic concepts from measure theory and from probability theory. The
purpose of this chapter is to provide applied and other scientists with some standard vocabulary.
Most of the concepts and results are intuitive and obvious at times, even though the proper names
are not widely known.
A measurable space (E,$) consist of a set E and a a -algebra ^ of subsets of E. In ?are
the sets to which we will assign a measure (or, later on, a probability). It is a collection of subsets
of E closed with respect to:
i) taking complements: if As W then E-A=AC e %
ii) forming denumerable unions: if {A„s ^, n > 1} then uA„e $!
These sets operations when viewed abstractly, correspond with the logical operations not
and or This is what makes a-algebras a convenient realizations of events to which we want to
assign probabilities.
A measure m on a measurable space (E,?) is a function m: % ->[0,oo) satisfying
(1.1) v
m(uAn)=T.m(A„)
I ' i=i
where {A^: n> 1} is any countable collection in ? such that A, nAj=0. We add here that instead of
[0,oo) the range of values of m can be taken to be any space X on which an additive operation and
a notion of convergence are defined such that the right hand side of(1.1) exists. In such cases one
says that m is an X-valued measure.
Consider two measurable spaces (E,W) and (F,<^. We shall say that a function X: E-»F is
^/immeasurable if
X'(A)={x: X(x)eA}={XeA}e? for any A e s F
The a-algebra 5(9?) = B generated by the open (or the closed, or the compact) subsets of
9l=(-oo,oo) is called the Borel a-algebra, since real valued functions appear all the time, one
usually writes Xe ^instead of Xe W/B.
We shall say that f=g a.e(m) (almost everywhere with respect to m) whenever {f >g or
f<g}={x: f(x)>g(x) or f(x)<g(x)} is such that m({f>g or f<g})=0.
It is left for the reader to verify that the basic arithmetic properties, performed on Borel
measurable functions yield Borel measurable functions, that is linear combinations, products,
quotients (whenever defined), infima, suprema of measurable functions are measurable.

7
8 The Method ofMaximum Entropy

Also, pointwise limits and other limiting operations performed on sequences on

measurable functions yield measurable functions The reader is directed to either [1.1] or [1.2] for
this and much more about the basics on real analysis and probability
The process of integration is introduced stagewise. First of all the integral of X is defined
for X=Ean IAnwhere {A„: n>l} is a countable (disjoint) partition of E and the an are real. Such
X's are called simple functions. For such an X we set

(1.2) \X(x)m{dx) = I,anm(An).

For this to make sense it is required that the right hand side converges. This is easily
achieved when all exceptfinitelymany of the A„ are empty.
The second step is to realize that any positive Xe f can be approximated by an increasing
sequence of simple functions, that is X=lim X„ where X,, is simple,

X„ ='LanjclA„Jl, Xr»i SXn

Now define

(1.3) \X{x)m{dx) = Xvm\Xn(x)m(dx).

The third step is to take an arbitrary function X, decompose it as X=X+-X_, with

X±=(X±|X|)/2 and compute \X± dm as in (1.3). When both arefinitewe define

(1.4) dm = \X+dm- \X.dm.

We shall also say that X is integrable if

\\X\dm = \X+dm + \X-dm

is finite and write XeL, (E,?,m) (or XeL, whenever E,% and m are understood from the
context). Also we say that XeL p whenever j(X)p dm is finite.

Let m and n be two measures on (E,?). We shall say that m is absolutely continuous with
respect to n, and write m « n , if whenever n(A)=0 for Ae % then m(A)=0. In this case there exists
pe% such that

(1.5) m(A) = j p(x)n(dx) = \ IA(x)p(x)n(dx).

A
Basics ConceptsfromProbability Theory 9

The convention is to write p(x)=dm(x)/dn(x) or dm(x)=p(x)dn(x) and call p(x) the

Radon-Nikodym derivative of m with respect to n. It is reasonable straightforward to verify that if
m « n and n « q then m « q and dm/dq=(dm/dn)(dn/dq) (almost everywhere q).
Consider now X; E—>F and Xe <$7d^Let m be a measure on (E,$) and define n on (F,d^J
by

(1.6) n(B) = m(X'1 (B)) = m(X s B)

and note that since X"1: #"—»<?' preserves set operations, n is a well defined measure on (F,?).
Also, using (1.6) one can prove (going stagewise from simple functions onwards) that for any
positive measurable Y: F—»5R

\Y(y)n(dy)= \Y(X{x))m(dx).

Let us proceed to rewrite some of the former in terms of probabilistic language. We shall
say that (fl.d^P) is a probability space if (il,d&) is a measurable space and P is a measure with
range [0,1], such that P(fi)=l.
The points w in Q. are called elementary events, they may or may not be such that {m} is in
d^"The usual interpretation of oo is that it represents an experiment (sequence of measurements in
continuous or discrete times) performed on some system. The elements of d^~are called events
and, questions about experiments are described by the set operations of union, intersection and
complementation.
A (real valued) random variable X is a (Borel) measurable function defined on (Cl,d^.
From now on we shall consider a given (Q.d^P). Let X be a (real valued) random
variable. The distribution function of X is defined to be the function F x : SR ->[0,1]

Fx(x) = P(X<x)

and when X : Cl -» 9?" is such that each component function X^ is measurable, we define

Fx(X) = P(X1<xl,...,Xn<xn).
From now on we shall follow the standard notational convention and write

n {Xa e A a ) = {Xa eAa,aeI]

ae/

where X : (£l,d&) -> (E a ,? a ) are (Ea -valued) random variables, I is some set of indices and Aa
sE^for all I. (Warning: the set described above may not be in <#when I is not countable.)
Assume that X is an integrable random variable. We introduce the symbols
10 The Method ofMaximum Entropy

EY^{Y)= lY((o)dP((o)

and we will call it the mathematical expectation of Y (respect to P) or average value of Y.

When X is an 91" random variable represented by a column vector with components X|
which are L2, and X1 denotes the transposed (row) vector, the covariance matrix of X is defined to
be the matrix C=E(X-<X>)(X-<X>)' which is positive definite, i.e.

t UjCij = E\£ (x.-iXtm.l2>o.

IJ=I l(=i I

When m is a measure on 9?", we shall say that then 9?"-valued X has a density p with
respect to X„ if the measure induced on M" by F x is absolutely continuous respect to m and
dFx/dm=p. Then, for any bounded measurable G: 91" -> 9?

£G(X) = jG(x)p(x)rf/n.

As an exercise for the reader, we leave the proof of

Lemma 1.7. Let X be a random variable with strictly positive distribution, absolutely
continuous with respect to the Lebesgue measure on 9?. Then

EGiX) = | G(x)p(x)dx =lG(F-\u))du

-*, 0

where G is a bounded measurable function and F"' is the compositional inverse of F.

Probability theory begins to be different from real analysis (measure theory) when we
come down to define independence and conditional probability.
Definition 1.8. Let <&, , sSf be sub-a-algebras of d f We shall say that they are (P)
independent if

(1.9) P(A1r^A1)=P(Ai)P(A2)
for any A(€<rf^, A^ed^ .
This notion extends trivially to any countable family of a-algebras. When X,e^/^„ X;S
d^/<?2 we say that they are independent if

W,)gC*2) = W i ) W»)
for any bounded measurable functions defined on E, ,E2. We leave for the reader to verify that if
X, and Xj are uncorrelated, i.e.

£TiJf2 = £^,£^2
Basics ConceptsfromProbability Theory 11

they may not be necessarily independent. Also, for independence, it suffices to verify that

p{Xx eAi,X2 zA2) = p{Xi eA1}p{X2 e A2}

for classes of sets A,, A, generating the sub-a-algebras

o(X,) = X;lB = {X;l(B) : B 6 B(5R)}.

A related notion, that of conditional expectation is to be introduced in (almost) full

generality by
Definition 1.10. Let (£l,d%P) be a probability space and , let G be a sub-a-algebra of &
and X be either a positive or integrable random variable. Then, there exists a G-measurable
random variable, defined up to a e. equivalence, denoted by E[X|G], such that for any bounded,
G-measurable function Y, the following holds

(1.11) E[XY] = E[YE[X\G]\

The proof of the following lemma, asserting some basic properties of conditional
expectations is left for the reader.
Lemma 1.12. The notations are as above.
a)IfX>0,E[X|G]>0.
b) If X, is bounded, G-measurable, E[X,X|G]=X,E[X|G].
c) Let g be a bounded function on SR2 and X, e G, then

ElgiXX^G] = E[G{X,z)\G]^
where z is a constant.
d) E[E[X|G]]=EX.
e) For a,, a, in « , Efa^+ajXJG] = a^fXJGl+aJEfXjIG].
f) If {XJ is either a monotonic sequence of positive functions or a uniformly bounded,
pointwise convergence sequence, E[lim XJG]=lim E[XJG].
g) If the a-algebras a(X) and G are independent, then

E[X\G] = E[X].

h) Let G, cGj be two sub-o-algebras of &. Then

E[£L¥lG2]lGi].

i) Let be a convex function, infinite on the range of X, then

12 The Method ofMaximum Entropy

<D(£[XlG]) < E[®(X)\G].

Comment. Properties a, e, f, i, are shared by all integration like functional. And

properties b, e, h, show that conditional expectations are projection operators. As a matter of fact,
restricted to L2(Q,d%P), the operator
X^>E[X\G]

is an orthogonal projection onto L2(Q.,G,P). This result is important when dealing with Gaussian
processes and computing predictors optimal in quadratic distances.
Let us now recall some important factorization results.
Lemma 1.13. Let Y: (E,?)->(E',W ') be a W '-measurable and then X: E->91 is
measurable a(Y)/B if and only if there exists g: E'-> 9J, ^'/5-measurable such that X=g(Y).
Therefore, when G in definition (1.10) is o(Y), then there is hx: E'—> 9? such that

E[X\o(Y)] = hx(Y).

According to properties b, e, h of Lemma (1.12) the correspondence X—»hx is linear, and

behaves like some sort of integration with respect to X. To round things up we need the following
Definition 1.14. Let (E,?) and (E',<?') be two measure spaces. The function N(A ,Y):
<&E—»[0,oo) is called a positive kernel (or probability kernel when N takes values in [0,1]) if:
a) For every yeE', A->N(A,y) is a measure on (E,|).
b) For every Ae <?, y-»N(A,y) is ^'-measurable.
When the random variables considered take values in complete, separable, metric spaces.
The conditional distributions can be computed from a regular conditional distribution, that is:
Lemma 1.15. Let X:(fi,sfip>)->(E,?) be measurable and let E be a complete, separable,
metric space and % denote its Borel sets. Let Y be a random variable and f: E—> 9? be any
bounded function. Then there exists a kernel N(A,y) on <?x SR such that

Ef/W\y\= \Ax)N(dx,Y)

Comment. Usually life is good with us and there is a measure m(dx) on E with respect to
N(.,y) which is absolutely continuous with a jointly measurable density n(x,y) then

E[f(X)\Y\ = \Ax)n(x\y)m(dx)

and the notation

E[f{X)\Y=y] = lfix)n(x\y)m(dx)
Basics ConceptsfromProbability Theory 13

is usually employed. See [1.3] for nuances about constructing kernels and [1.1] or [1.4] for the
necessary measure theoretic results.
We shall need in chapter 4 a slight extension of the conditional expectation operator to the
case when (£2,<^jj.) is such that n is a cr-finite measure with a(£2)=+°o.
Let G c«P"be a sub-o-algebra. Let feL,(u,), and assume that (Q,G,n) is a-finite. Let us
state
Definition 1.16. We shall denote by EJf]G] the unique (up to appropriate sets of measure
zero) element of L,(n) such that

\gfd\x. = lgE»[f\G]d

for every bounded function ge G.

The properties of the conditional expectation operator f-»E[f|G] introduced above when
H was a probability hold true in this case as well.
Comment. When (Q,G,u,) is not a-finite these things do not make much sense. Consider
an atomic measure with infinite atoms.
Comment. It is important to realize that as a consequence of the next proposition we have
E„[1|G]=1.
Proposition 1.17. Let h, .h^eG, be strictly positive and such that dP,=h,dn and dP^hjdn
are probability measures. Then, for f e d ^ g e G

Q \gEPi\f\G]d\x = \gfd\i = \gEp,\f\G\dv.

O) EPl{f\G) =Ep2\f\G] a.e.\y.

Proof: (i) jfg du = j(g/h,) fdP, = J(g/h,)E,>, [f|G]dP, = jgE/., [f|G]du. Part (ii) follows
from (i) by taking g=(E/>, [f]G]-E/>2 [f]G]) and using the fact that Jg2d|j.=0 implies g=0 a. e. \x..
'Lei us consider some simple examples. To begin with note that if G={0,n} is the trivial
a-algebra, then
E[X\G] = E[X\

for any integrable or positive X. Assume now that G is the a-algebra generated by a partition
{A,.:k>l} of £2. That is, its elements are countable unions of sets of {\: k>l} and any
G-measurable function is of the type Z akhk for appropriate constants c^. In this case E[X|G]
must be something like
E[X\G] = Z.akIAk

and we have to determine the <\. For that, multiply both sides of the identity by I^for some j and
use (1.11) to obtain
/4 The Method ofMaximum Entropy

ak E[X;Aj]

- <o "= 7 ^*tJ X( )dP( ).

•VJ) P A
{ J)
w w

Here we convene to define E[X;Aj]/P(Ai)=0 whenever P(Aj)=0.

When G = a(Y) for Y: Q. -»E, E being a countable set, instead of the notation above we
have

E[X\Y=e,] =(P(Y=ei)r1E[X, Y=e(]

and correspondingly

E[X\Y] = Z(P(y=e,))" I £[X; F= <?,]/<**,)

that is, the function on the right-hand side is constant on the sets {Y^eJ taking the value
E[X|Y=e,] as specified above.
The other very common case is the following. Let X and Y be respectively 5R" and SRm
valued random variables such that the distribution of (X Y) has density p(x,y) with respect to the
product Lebesgue measure dx dy on 9?"*m. From the factorization lemma above, we know that the
bounded random variables, measurable with respect to cr(Y) are of the form g(Y) for g:9?" -> 9J.
Therefore computing both sides of (1.11) we see that for bounded f: 9?" —»9?"

EWftgWl = lAx)g(y)p(x,y)dxdy

where we denoted by hx(y) the function introduced right after Lemma (1.13). Since both sides are
equal for any g(y) we conclude that

(1.18) p(X\Y)=($(,x,Y)dxylp(x,Y).

For the sake of future reference, we shall now present two variations on the theme of
Bayes formula, which are at the basics of both applications and interpretations of the maximum
entropy method.
In what follows, we shall denote the intersection C, oC 2 of two sets C, and C2 by C,C2.
Let {AJ be a countable partition of tl, i.e., a denumerable exhaustive collection of mutually
exclusive events. Then for any events B, C

P(B\Q= Z,P(At\BQP(B\C)

and also
Basics ConceptsfromProbability Theory 15

P(B\Q = Y.P(A,B\C) =Y.P(B\A,C)P{AI\C)

I I

since

PiAm = «ga = ^a^i = P ( 5 U,O^,IO.

Exchanging the roles of A, and B we obtain for any i

P(A,\BC) = P{B\AtC)^

which is known as Bayes identity. Substituting P(B|C) by the second summation displayed above
we have

P(A \r>r\ - *(BKi<Wilc)

In this identity P(AJC) is interpreted as the "a priori" probability of Aj given C, taken to
describe the knowledge we have about the event Aj given the preliminary information in event C.
The left-hand side tells us how much does our knowledge about A change when we collect the
information contained in event B. Theright-handside is the recipe for computing the change. The
left-hand side is called the " a posteriori" probability of \

REFERENCES

[1.1] Bauer, H. "Probability Theory and Elements of Measure Theory". Holt, Rinehart and
Wilson, Inc., New York, 1972.
[1.2] Gihman, I. I. and Skorohod A. V. "The Theory of Stochastic Processes F
Springer-Verlag, New York, 1974.
[1.3] Getoor, R. K. "On the Construction of Kernels" Led. Notes in Math. N°465, pp.
443-463, Springer-Verlag, Berlin, 1977
[1.4] Rudin, W. "Real and Complex Analysis". McGraw-Hill, New York, 1966.
Chapter 2

EQUILIBRIUM DISTRIBUTIONS IN STATISTICAL

MECHANICS

Statistical physics owes its birth to the inconvenience and impossibility of describing
systems having very large numbers of particles by specifying the behavior of each individual.
Loosely speaking, the aim of statistical physics is to describe the "collective" or ''macroscopic"
properties of a system of particles in terms of appropriate averages of its "microscopic" motions.
The words in quotation marks are the key ones. Macroscopic refers to the ''properties of
the system as a whole", the properties "visible by the naked eye". Microscopic means to the
description based on the exact evolutions laws (classical or quantum) describing the motions of
the particles.
Even though today's supercomputers can follow the individual motions of large numbers
of independent particles, there is yet nothing able to handle 1023 individuals.
To begin with, we will only consider systems to which equilibrium thermodynamics
applies, i.e., systems whose external, macroscopic changes are very slow compared to the
microscopic, internal motions and can be considered at every instant to be in equilibrium.
One of the cornerstones in physics, important for the outlook at the world that it provides,
is the second law of thermodynamics. According to this law, whenever an isolated system evolves
its entropy can only increase and, when an equilibrium state is reached its entropy attains the
highest possible value compatible with the values of the macroscopic extensive parameters of the
system.
I want to emphasize that this is not the standard formulation, see [2.1] or [2.2], for I have
made the characterization of the equilibrium state part of the statement of the second law.
Actually, when the entropy functional happens to be a Lyapunov functional for the evolution law
of the system, the characterization of equilibrium states as maxima of the entropy functional is
obvious. See section (6.5) for some more on these issues.
The connection between the macroscopic and microscopic descriptions can be traced
down to ideas of Maxwell, Boltzmann and Gibbs. The ingredients involved depend on the kind of
system under study: how do we describe its microscopies states and the changes of state, i.e., its
evolution on one hand and, on the other, on the method we choose to average over the
microscopical states. To be more specific, we may consider either classical or quantum
description of the particles making up a system and in the latter case, we may consider the
17
18 The Method ofMaximum Entropy

particles to be distinguishable or not. But regardless of all these subtleties, the basic philosophy is
always the same.
The presentation that follows is essentially due to Jaynes [2.3] see also [2.4].
Consider, to simplify as much as possible, a system whose states can be described by a
countable set E, which are unaccessible to observation. The microscopic observables describing
the properties of the system are described by real valued functions
F: £->5R.
The basic assumption is that the macroscopic values of these observables are obtained as
average values
<f>= \F(t)Pu
ieB

where the Pi are the probabilities offindingthe system in state i e E, or the fraction of time the
system spends at state i when it is in equilibrium. With all these ingredients we state the

Principles of Maximum Entropy

The state of equilibrium is characterized by the assignment of probabilities {P,: isE} such
that

(2.1) s(Pt) = -/fclP.lnP,

attains its maximum possible value among all distributions {Pt -. ie E} such that

(2.2) LAk(i)P, = (A,) = a,. i= 1,2,3, ...,M

If we proceed as most physicists do and apply elementary variational analysis using

Lagrange multiplies we would obtain

(23) P,*=^exp41^,0),

where the partition function ZA(X) is defined by

ZA(X) = S exp-j Z XkAk(i).

IsE " k=\

Observe that S({P;}) defined by (2.1) is a concave function defined on the convex set
P={ {P,: ie E, ZPpl}} of all probability measures on E. Also, ZA(X) is only defined on a subset of
5RM specified by
Equilibrium Distributions in Statistical Mechanics 19

DA = {XeW": ZA(X)<oo}.

When E isfinite,DA= 5RM but when E is notfiniteDA is only a convex subset of SRM One
should always know how big the set DA is.
Anyway, when {P^} given by (2.3) is substituted in (2.1) we obtain

(2.4) H(X) = k]n ZA(X) + (X, a)

which is convex on DA(X). We shall verify these assertions below, in a more general setting. We
have set
M
(A,a) = Z X,a,

and ln(s) denotes the natural logarithm of s>0.

What is the beauty of this set up from the point of view of optimization theory? Well H(X)
is S written in terms of the extensive variables X and is a kind of dual to (S{P}) and minimization
of H(A) with respect to X yields a value X* such that 5H(X" )ldXs = 0 which is equivalent to

Z Aj{f)(ZA(X))" exp-(z U , p ) = aJy j= 1 M.

That is, the value X' at which H(X) reaches its minimum is that of the Lagrange multiplier
which insures that {P;*|ieE} given by (2.3) satisfy the constraints (2.2). Note that the number M
may be much smaller than the cardinality of E, and H(X) is convex. These are the two facts that lie
behind the appeal of the M.E.M..
We shall remark as well that the condition for X' to be a minimum, namely that
d^H (X' )/dX,dXj be a positive definite form has strong consequences in thermodynamics. See
[2.2] and [2.4]. In terms of covariances of the observables A,, it looks like this: the matrix

* g g = (A,Aj) - (At){Aj) = ((Aj-iAMAt-iAd)

is (obviously) positive definite. Actually, in some cases the function H(X) may be very flat near X'
making the numerical search for X* hard (especially for large M).
A similar procedure can be followed to obtain the probability distribution in the (simplest)
quantum case. For that we assume we are studying a system of noninteracting particles, each of
which may be found in any of the states of a denumerable set. To distinguish between the two
classes of identical particles we introduce two possible sets of states for the system.
We shall assume that the bosons have as (denumerable) set of states the set
20 The Method ofMaximum Entropy

Hb = {¥ : S ->• N, ¥ ( 0 * 0 finitely many i e 5}

whereas the state of the fermions will be described by

Hf = {*F : S -> {0,1}, "P(0 = 1 y?»/te/v ma«y i e 5

In each case we shall assume there are two observables whose average values are
accessible to us, namely
E:H^>% £(v) = E E(i)y(i)
teS
(2.5)
N.H^m, JV(\|/) =E \|/(0

where E: S-» SFt is to be interpreted as a microscopic energy of the individual quantum states.
Now (2.1) looks like

(2.6) S({PQ¥))) = - k E / f f ) lnPOF),

where H stands for either Hb or H:.

Again, a small computation leads to the partition functions

Z(X)= £ exp(-^(v|/)-^(H/))

which have to be computed separately for bosons and fermions.

For the first case notice that

Z(X) = E exp-i E ^ O M O + S ^ v K O

= I, r i e x p - i p ^ O + ^ M O
K
ye.HbieH

=E £ « P ~ 1(^(0+Xa)

= n ( l - C e x p -(40)/*?))- 1

where in the last line we set C =exp-(A_/k) and A., = 1/T to conform with standard notation in the
physical literature.
Equilibrium Distributions in Statistical Mechanics 21

To compute the partition function for the fermionic case, we proceed in the same way,
except that now instead of summing over all n as in step 3, we sum only over n=0 and n=l
obtaining

exp--iffl/kT)).
Z(X) = n (1 + qc, cxp-£(0/M)).
Z(X).
ieff
I'e//

The point of these exercises was to show how the specification of the set of states and the
choice of the observables determine the final result.
To complete this mock approach to equilibrium statistical mechanics we shall verify that
the functional (2.1) is a Lyapunov functional. We shall assume that the probability distributions
{Pj(t): ieE} evolve in time according to

(2.7) j.iptP(i)=
W = -LPjPj,-P,P,j
XPJPJ,-P,PiJ
i

starting from some initial distribution P,(0). The P,j are given in advance and we assume them to
satisfy the microscopic reversibility condition P^P,,. If we compute the entropy of {Pt(t)}
according to (2.1) then its rate of change satisfies

f =
= -/tz -kXiKW-kZJf,

=. - * I -kI.P IJ{P
-kI.P 1-P,)\nP
IJ{P 1-P,)\nPl
V
V

=
-n
= --1^P.jiPj-P,)
T.PtffJPd \nP,
InP, +
+ ff ZP,j(P,
ZP(,(P, -- Pj) \aPj
InP;

=
-ffvt -^P-^Pj(Pj-P
j(Pj-P)la(P
i i )la(P/P
ii /P )
ii j

where we used the symmetry of Pj and the fact that SP,(t)=l to go from the first line to the
second, a simple symmetrization to go from the second to the third line. It is clear that the last line
is always positive. (Verify that (l-s)lns is always negative for s > 0.)
The interesting fact here is that in equilibrium, a distribution Pe such that the left-hand side
of (2.7) vanishes, provides a local maximum for S({P,}). In his book [2.5] Gibbs somehow
postulates a distribution like (2.3) as an equilibrium distribution for the evolution provided by the
Hamilton (Newton) equations of motion in phase space.
It was Boltzmann who used (2.1) as a Lyapunov function for his equation describing the
time evolution of particle a density function. See the reprint [2.6], And, as we said in the
22 The Method ofMaximum Entropy

introduction it was Jaynes who characterized equilibrium distribution in terms of a variational

principle, even though the ideas were already present in Boltzmann's work.
To finish, we direct the readers who are interested in statistical physics to [2.7] and [2.8]
for more on entropy and to [2.9] for an application of conditional probabilities to obtain all
equilibrium distributions from the grand-canonical distribution. Also check with [2.10] for more
references on the different entropies and for the role of entropy production in the characterization
of stationary states. For recent work on these issues, refer to [2.11] and references therein.

REFERENCES

[2.1] Atkins, P. W. "The Second Law" W. H. Freeman and Co., New York, 1984.
[2.2] Callen, H. "Thermodynamics and Introduction to Thermostatistics" John Wiley, New
York, 1985.
[2.3] Jaynes, E. T. "Information theory and statistical mechanics" Phys. Rev. 106. pp. 620 -
630, 1957. See [0.1] for more along related lines.
[2.4] Tribus, M. "Micro and macro-thermodynamics" Am. Scientist 54, No.2, pp. 201-211,
1966.
[2.5] Gibbs J. W. "Elementary Principles in Statistical Mechanics". Dover Books, New York,
1960. Reprint of the 1902 U. of Yale Edition.
[2.6] Boltzmann, L. "Lectures on Gas Theory". California Univ. Press, Berkeley, 1964.
[2.7] Wehrl, A. "General properties of entropy" Rev. Mod. Phys. 50, No.2, pp. 221-260,
1978.
[2.8] Lindblad, G. "Non-Equilibrium Entropy and Irreversibility" D. Reidel Publishing Co.,
Dordrecht, 1983.
[2.9] Gzyl, H. "A unified presentation of equilibrium distributions in classical and quantum
mechanics". Ann Inst. Henri Poincare. 32, 1980.
[2.10] Jaynes, E. T. "The minimum entropy production principle" Ann. Rev. Phys. Chem. 3_1,
pp. 579-601, 1980 (Reprinted in [0.1]).
[2.11] Garcia-Colin, L. S. "Entropy and irreversibility macroscopics issues". Rev. Mex. Fisica.
Supl. 1, pp. 198-201, 1992.
Chapter 3

SOME HEURISTICS

Here we follow Jaynes [3.1] and Papoulis [3.2] in developing some heuristics that sheds
light on the concept of entropy and on the method of maximum entropy.
Let X be a discrete random variable taking n values x, ,Xj .....x,, and let p, be the
probabilities of observing the events A,= {X = x j . We shall write

(31) H(x) = H(pu ...,pn) = -tp, \nPl

and call it the entropy of X or the entropy of {p„....p,,}.

We may think of p, as relative frequencies

(3.2) p, = Urn jjff, = hm i | I^Q®.

N, being the number of times the value x, appears in a run of N independent observations
ofX.
A different way of looking at (3.2) is the following. Each of the possible results of
observing X N consecutive times is an element of EN=Ex...xE, E={x„...,x0}. If teE N then

P(i) = P(X, = r„ ...X, = t„) =p^..pn"

where X,,...,XN are independent copies of X and N, denotes the number of times the value x( is
repeated in the listt,,...,!^.
Now, if N is large enough so that Pj~N,/N then
P(t) =p"i"../„>" = expAr S pi \npi = exp -NH(X).
If we say that the configuration t is typical whenever N,~NP,, it follows that the number
of typical configurations W(typ) given by

(3.3) W(tyP)~±: = SKPNH(X)

23
24 The Method ofMaximum Entropy

from which we obtain

H(X)^^\nW(typ).

Below, and in the next chapter, we shall discuss these relations in some detail. For the time
being let us compare the number of typical configurations for two distributions corresponding to a
random variable having six possible outcomes, that is a die.
The first distribution {p,,...,pn} is the distribution that maximizes

6
H(pu...,p6) = -T. pi \np,
subject to the constraint
Z iP, = 4.5

instead of the "classical" on corresponding to the equiprobable situation

i S ' = 3.5.

Simple optimization shows that

p, = e'xt Z e~XJ
J=I

and with the aid of a computer one obtains the X such that Sip; = 4.5. It is X= -0.37105 from
which it follows that

(3.4) (p,,...,p6) = (0.0543,0.0788,0.1142,0.1654,0.2398,0.3476)

to which corresponds an entropy

(35) i/ max = 1.61358 .

We leave for the reader to verify that the distribution

(36) p, - ( ^ ^ ( l - / > ) « * 1***6

also satisfies Sip =4.5 for p=0.7. But this distribution has entropy

(3.7) #=1.4136.
Some Heuristics 25

The quotient of the two numbers of typical configuration corresponding to (3.5) and (3.7)
is given by

eNHm
= eM
e
A H ) r e ee2 0 0 >>1 0 8l u4
^m ~
IVW~ ~gNH ~

for a sequence of N=1000 throws of the die.

Had we taken N=50 and computed N ^ N , the numbers would be (approximately)
{3,4,6,8,12,17} and {0,1,7,16,18,8} and since the number of ways a configuration N„...,N6
comes up is N!/N,!,N2!,...,N6! in this case the corresponding quotient is

^ = 38220.

That is, the number of microscopical configurations (i.e., chains of 50 throws) for which
thefrequenciesN/N correspond to the maximum entropy distribution (3.5) is 38220 times more
frequent than the microscopic configurations corresponding to the distribution (3.6).
This is a statement about how different assignments of probabilities are reflected in the
outcome of an experiment.
In a statistical physics, in systems with 1020 particles the use of asymptotic methods is
quite justified. This is the reason why, in almost the first line of any book in statistical physics, the
following typical phrase appears: the number of microscopical states accessible to the system,
compatible with its macroscopic constraints is '

W=expSlk

which is equivalent to assert how disordered a system is, or equivalently, the higher the entropy,
the higher the more states it can occupy. By the way, according to the third law of
thermodynamics, which asserts that at 0°K the entropy vanishes i.e. S=0. Therefore at 0°K there is
only 1 configuration available to the system.
To read about Boltzmann's ideas about these subjects check with [0.24] and with Jayne's
essay in [0.7].
Later on we shall see how does the entropy concept appear with the theory of large
deviations. The issue is to find the bridge between the two aspects of probability assignments.
To finish, I cannot but help directing the reader to chase back from [3.3] where an
application of the maximum entropy method to find "dishonest" dice is explained.

This is written in Ludwig Boltzmann's epitaph.

26 The Method ofMaximum Entropy

REFERENCES

[3.1] Jaynes, E. T. "On the rationale of maximum entropy methods". Proc. IEEE 79, No.2, pp.
939-952, 1982.
[3.2] Papoulis, A. "Maximum entropy and spectral estimation". IEEE Trans. Acoust. Speech
and Signal Processes. ASSP- 29, No.6, pp. 1176-1186, 1981.
[3.3] Fougere, P. F. "Maximum entropy calculations on a discrete probability space". See [0.9].
Chapter 4

ENTROPY FUNCTIONALS

1. Basics.

We shall take a look at some properties of several functionals defined on the set of all
positive, a-finite measures and on the set of all probability measures on a measurable space

The definitions and results are variations on the themes developed in [4.1]-[4.2]. We direct
the reader to [0.2] where besides the results, many references to original work and applications to
statistics are developed. We shall also describe briefly some of the results from the review on
entropy inequalities compiled by Dembo, Cover and Thomas, [4.3].
For any measurable space (Cl,d?j, M(fl) and P(fl) will denote the sets
M(f2)={positive a-finite measures on (fl,F)}
P(fl)={probability measures on (Q.,d&)}.
Definitional. Let y eM(fi) and PeP(fi). We define

5 Y (P)=-!flnfrfy

P « y and In dP/dy eL,(P). We set S T (P)^°o otherwise.

Comment. We shall from now on convene on setting

01n0=0, lnf=+°o, 0(+oo) = 0

unless explicitly computed to be something else.

Definition 4.2. Let X: (fi,c^P)->(M,M,u) be an M-valued random variable such that the
P-distribution of X is absolutely continuous with respect to the a-finite measure \i, having density
p(y) = P(Xe dy)/u(dy). We set

W = -\p<y) in/>(v)rfn(y)

if lnp(X)eL,(P) and -oo otherwise.

27
28 The Method ofMaximum Entropy

The following two cases are the most frequent. When M is a countable set, ^ m ^ - l (i.e. u
is the counting measure) and P(X = m,) = p,. Then

SI1(X) = -Z/>0)lnp(/).

The other case being M = 9T and u(dy) = dy being the Lebesgue measure. In this case we
shall just write
S(X)= -\p(y)\np(y)dy.

Definition 4.3. Let y,u,v eM(fl) be such that u « y , v « y and u, v are finite. Define

*0(u,v)=j£ln(^>a + v(n)-u(n)

whenever ln((du/da)/(dv/da)) is in L,(dp.) and +00 otherwise.

Some times the particular case corresponding to u = P and v = Q being probability
measures has a different symbol associated:
Definition 4.4. For P,QeP(f2) and seM(fi) we set

/o(P,0=jflnfe)rfa
again when ln((dP/da)/(dQ/da)) is in L,(dP) and +00 otherwise.
The proof of the following obvious lemma is left for the reader.
Lemma. When u. « v (and P « Q) K(u,,v) (and I(P,Q)) are independent of a and
denoted by K(|x,v) (and I(P,Q) resp).
The functionals K and/or I have many names: Kulback Leibler information number,
information for discrimination, information distance, information gain or entropy gain of u (or P)
with respect to v (or Q). The functionals SV(P) or SV(X) are called u-entropy of P (or X).
Lemma 4.5. With the notation introduced above we have
i) SV(P) is concave in P.
ii) K(u,v) is convex in u. When u(il) = v(fi), K(u,v)>0, the identity holds true when

dyJdo - dv/do.
Proof:
i) The function -x lnx being concave on [0,oo) yields Sv(aP,+bP2)>bSv(P1)+bSv(P2).
ii) The convexity of K(u,v) can be obtained similarly. When |i(fi)=v(ft) and setting
c={dP/da>0} we have
Entropy Functionals 29

-JCQioO- J£h^*Sta{f(^)ifaSlnj42sO.

When K(u,v)=0

0=lnl£fe>^ln 1 S* <0

and the result follows from the strict concavity of lnx and the following lemma. (Nevertheless, see
the simpler proof in Theorem 3.1 of [0.1].)
Lemma 4.6. Let g be a positive function defined on (Ild^PY Then \n\gdP>\\ngdP
with the identity holds and only if g is constant a.s-P.
Proof: The inequality is the obvious concavity of lnx. Recall that lnx is strictly concave,
i.e., lnfSa^HCa; lnxj whenSa=l, if and only if x,=x2=...=x11.
For any a such that P{g<a}>0 and P(g>a)>0 we have, when the identity lnjg dP=Jlng dP
holds, then
r
ln\P(g>a) | gdP/P{gZa) + P{g<a) J gdP/P{g<a)

i P C r ^ l h , ^ JLtgdP + P(g<a)la^ J^gdP

= | \ngdP + J \ngdP = \gdP.

Igia) {g<°}

The assumption implies that the middle term equals the first term and therefore, the strict
concavity of the logarithm function implies that

J gdPIP{g>a}= J gdP/P(g<a)

P(g<a) \ gdP = P(g>a) J g=(l-P(g<a)) f gdP.

<£>a} {«£"> {&)

From which we obtain

P(g<a)lgdP= J gdP or \gdP<.:<ar.

30 The Method ofMaximum Entropy

Similarly we would obtain

P{g>a)\gdP = \ gdP or \gdP>a.

Since this is impossible, we conclude that for some a, P(g = a) = 1.

Lemma 4.7. Let u,v eM(fi) be such that P « v , P « u and In (dP/du), ln(dP/dv) are in
L,(dP). Then

5¥(P)-Sp(P) = 4 l n S ] = 4 l n ^ J .
where X =au+bv and 0<a,b, a+b=l. When u,v are finite

s,CP) + v(n) - S^P) - n(n) = ^(v, u).

Comment. Instead of A.=au,+bv we could use any a such that X«a.

Lemma 4.8. If P « Q „ P « Q j , Q 2 «Q, then

K(P, Qi) - K(P, Qi) = £/.[ln dQ2 I dQi]

We leave these as exercises for the reader. We only add that when v = Q is a probability
measure
SV(P)= -X»(P,v).

The next lemma contains the basic behavior relative to changes of variables.
Lemma 4.9. a) Let O: (Q,,d&}->(Q,\d&~) be a measurable mapping. Let a, n, veM(fl), P,
QeP{a) and c', u', v'eM(n') and P'.Q'ePCn) be related by o' = o(*'' ), etc... Then

SW(P') * S»(P); KjQif, vO < K0(ix, v),/ 0 ,(/", Q<) < Ia(P, Q).

b) Let X: (n,<J?H>(M,M) be as in definition 4.4 and f: (M,M)->(M',M') be a measurable

transformation. Set X ^ X , ) , Q=P°X-', Q, =P°X,-1, u, =u(f'). Then Q « u implies Q, « u , and
S(i,(X,)=S(l(X).

c) As in (b) but now Q^veMOVT) with density q. Then Sv(f(X))>SM(X).

Entropy Functional 31

Comment. These results are a variation on the theme of sertinn 4 chapter ? r>f [0 9]
fiwg£ •' It is easy to verify that when P is restricted to <D~'(fi^J (see lemma (1.13))
dP/du=(dP7dn>4>. Thus

Swif) ~ S.(P) = -\% l n f ^ u ' + I f I n f </u =

- I f . n f e ) ^ u + J f l n ^ = ^ l „ ^ _ , 0

Similarly

f I..I ../\ V ,.. -A {(ivlia^ ,_ (dp/da) f(,dv/da) dy _,_

Xa,(H , v ) X0(u, v) j^ j In [(V/<fe/)/( ^, /da / )H , io do

which < 0 due to (4.5-ii). Instead of proving (b) we prove (c).

Sv(/V0) - S^X) = - \\nq(f{Xj)dP + jlnP(A)aP = \P(x) l n ^ H ( A )

which is positive
Next we shall introduce some variations on the theme of conditional entropies.
Definition 4.10. In the same setting as in (4.1) assume further that v restricted to the sub
a-algebra G is a-finite. Then we shall define

S.(P/G)= - j£.[(flG)]ln£,[(f)lG]^= - E,[lnEv[(*) I G ] ]

similarly.
Definition 4.11. Let u.v.o be in M(il) with u,v finite and absolutely continuous with
respect to a. Let G be a sub-a-algebra of #"such that s is a-finite on G. Set

^lv)=lMg-(fi)-,(n).
When u,=P, v=Q are probability measures we obtain

^l0 = ^ > " f S

and we set (compare with chapter II of [4.1])
32 The Method ofMaximum Entropy

f j p I (dPlda)/E<,[(dPld<s)\G]
K'°{P\Q)-- ' {dQlda)IEc[{(lPld<5)\a\

Notice that Es [dP/da|G] is a density for P restricted to G with respect to a restricted to G.

Also as measures on (fi,G), (dP/do)/Es[dP/da)|G] is the density of Es[dP/da|G] da with respect
to (dP/da)da.
The analogue of theorem (2.3) in chapter II of [4.1] is
Theorem 4.12. For G c s^and P.O. a as above

%(P\Q)=l$(P\Q) + i%°(P\Q)

and IsFKr (P|Q) > 0, the identity being satisfied when

f/* = & r [ ! l a,.P.

* *• 4§IG]
Proof: The first assertion is obvious. For the second use a variant of lemma (4.5) and the
comment preceding the statement of the theorem.
Comment. When 0=a and P « 0 . we restate (4.11) as

IF(P\Q) = P(P\Q) + IFSa{P\Q).

Also r^G(P|Q) > 0, the identity sign occurs whenever dP/dQ=EQ [dP/dQ|G].
Comment. Similarly, when G, cG, are such that s restricted to G isfinite,then

(4.13) fi\P\Q) f°\P\Q) -== t2\P\Q)

&(P\Q) ■+
v&°&°\P\Q)
\P\Q)

and similar assertions to those in (4.11) also apply here.

The previous notions of conditional are of interest to statisticians. Associated with
definition (4.2) there are two notions of conditional entropy of interest in information and coding
theory and in the theory of dynamical systems. Let us rewrite things as follows.
Let (ML ,M, ,n{) be two a-finite spaces and let M=M, xNL,, M=M, ®M2, v=v, ®v2 be their
product space.
Definition 4.14. Let (Q.&P) be a probability space (X^YY Ci ->M he a random variahle
such that
P((X, Y) e dxdy) = P(x,y)vx(dx)\2(dy).
Entropy Functionals 33

Let P(x|y)=P(x,y)/JP(x,y)v, (dx) be the conditional density as in (1.18). Then set

Sv[X\Y](y) = -jP(x\y)\nP(x\y)\i(dc)

SV(X\Y) = ISv[X\r\(y)P2(y)\2(dy) = ~\P{x,y)\nP(x\y)\i(dx)\2(dy),

where P2(y)=Jp(x.y)vi(dx)=P(Ye dy)/v2(dy). Notice the difference of notations. Sv [X|Y] is a

function of y where SV(X|Y) is a number. Some arithmetic is enough to verify
Lemma 4.15. With the notations introduced above

Sv(X,r) = SV(X\Y) + SV(Y).

Lemma 4.16. Also Sv (X|Y) < Sv (X). The identity holds when X and Y are independent.
Proof:

SAX) - SV(X\¥) = \P{x,y) l n ^ v , ( < & ) v 2 ( ^ )

= \P(x,y) I n j ^ j V , m v 2 { d y ) > 0

where the last step follows from lemma (4.5). Here P,(x)=fP(x,y)v2(dy). Again, applying lemma
(4.5) to the last term we conclude that the identity Sv (X/Y)=S(X) holds whenever
P(x,y)=P,(x)P2(y), i.e., when X and Y are independent.
A repeated application of this lemma yields.
Lemma 4.17. Let {X, |l<i<n} be random variables with values in (£2, ,M, ,v,) respectively,
etc. Then

SAXU ...,x„) = £ SAX,\X,-U ...,Xi) <£ Sv.yr,)

i=l 1=1

the identity holds when the X- are independent.

Let usfinishthis section with some definitions.
For |ieM(Q) and 3>: S2—>5Rta measurable function, the (|l,<l>)-partition function is
defined by

(4.18) ZM>W = Jexp-(r,*)4l

34 The Method ofMaximum Entropy

for t s 5R* when the integral exists. Whenever (u.O) remain fixed, we will not mention them
explicitly. We shall introduce the notation

(4.19) D(\i,<b) = {t s MlZiu.O) <oo}.

The O - Hellinger arc of m is the family of measures, absolutely continuous with respect
to du,, defined by

(4.20) dp.t = (exp-(t,*)/Zm+(t))rfu

and for fixed c e SR*, we set

(4.21) P(c, 3>) = P e P(n)l£ P («>) = c.

Observe that for t e Z)(u, <X>) and P e P(c, 4»), P « u.

(4.22) K^P, n?) = H^(t) - ^(P)

where

(4.23) #M>(*) = lnZn,o(t) + <t,c>

is related to the physicists free energy.

The important comment to make here is that, for given and fixed t, c in 5Rfc to minimize
KJ^P, \i?) over P(c,<S) is the same as to maximize S (P) over the same set.
Whenever D(u,,) is convex and nonempty, and t* is the point at which H 0 (t) reaches a
minimum H*(c,), then, ^(P.u,?) reaches its minimum at ufand S (P) reaches its maximum at
(xfas well and both sides of (4.23) vanish. We shall have more about this below when we study
the maximum entropy methods.
Lemma 4.24. The set D(u,$) is convex when nonempty and InZ 4 (t) is convex.
Proof: Let 0 < a,b be such that a + b = 1 and set p = 1/a, q = 1/b. For t, .^ eD we have
exp a(t,) is in Lq(|j.).Then

\t\Z(ati+bt7) = ln}exp-Kar1+6r2,1>)rfu<lnZ(/i)aZ(r2)i' = alnZ(fi) + ftlnZ(f2)

where the inequality is obtained by applying Hoelder's inequality.

Entropy Functionals 35

Comment. This lemma implies that H(u,,,t) is a convex function oft.

Consider the Cauchy distribution on -co < x < oo with density c/(c2+i^)n with respect to
Lebesgue measure dx. For 0(x)=x we have D(dx,x) = <>| the empty set and when we restrict x to
[0,oo) then D(dx,x)=(-oo,0). Notice that in this case

\xd\x.t <°o for t>0 but \xd\ia = oo

Lemma 4.25. When D(n,<t) has a nonempty interior D°((i,4>), the mapping
D°(n, <!>)-» W given by/-»J<J>rfu< is defined and \®\d\x, = 3 In Z(t)/dti. When the co variance
matrix exists and is of full rank, the mapping is 1:1.
Proof: Let h be any fixed vector in 91* and c > 0 such that t + sh e Z)°(u,0) for
/ s D° (n, <t>) for 0 < |s| < c. Since Z(t+sh) is finite

■sj(/i,0)4i, = {lne*A-'1Vn,<lnJe*<I,>4i,<oo

because of the concavity of lnx. Since the sign of s is arbitrary we conclude that f (h,Q)d\it <<*>
for any h.
It is not hard to see that Z(t+sh) is twice differentiable at s=0 and

£ja Z(t+sh) I = fa, O)2 dp, - (f (h, 4>) ./u,) 2 = (h, Ch),

where C is the covariance matrix of <1>, which happens to be the Jacobian of the map t -> \<bd\x.t.
Even though D° (jo., $ ) is convex, the range of the map described above is not necessarily
convex. For an example take a look at the third page of [4.4] where some properties of
exponential families are investigated. For some other metric properties associated to the Kullback
I-divergence see [4.5] and [4.6]. Below we shall be quoting extensively from the last one.

2. Entropy inequalities.

Let us now describe, somewhat scantily, a few results from [4.3], We commented at the
end of chapter 3 that setting

(4.26) A^(P) = b expaS^P)

(which will be called the entropy power of P relative to u) gives us an intuitive way of
understanding how big is the support of P. For example, when Q is a finite set fi={l,2,...,n} and
36 The Method of Maximum Entropy

H is the counting measure and P{i}=l/n then S=lgn and, when a=b=l then N„(P)=n, the number
of states that can be occupied.
When (fJ,(#J=(<R",B), u(dx) is the Lebesgue measure and P has Gaussian density with
covariance Kij=EpXiXj and zero mean, setting a=2/n, b=l/27te, we obtain N(l(P)=|K|"°
When Q is a product space f2, ®fi2 and P is a probability on Q, absolutely continuous
with respect to a product v = Vi ® v2, then Lemma (4.17) asserts that

5 v (P)<S5v/(P,)
therefore, from (4.26) we have

(4.27) NV(P) <tfv,(Pi)Nv2(P2)

with the identity holds whenever P is actually a product of its projections P, and P2.
Notice that (4.27) does not depend on the nature of the measures involved. In the papers
by Shanon and by Stam quoted in [4.3] the following is proved: let X,Y be two independent 91"
-valued variables having density with respect to Lebesgue measure, such that S(X) and S(Y) exist.
Then, the Shanon's entropy power inequality asserts that

(4.28) N(X+ Y) > N(X) +N{Y).

But, notice that when X,Y takefinitelymany values the opposite inequality seems to hold.
For example, let X,Y be independent such that P(X = ±1) = P(Y = ±1) = 1/2. In this case S(X) =
S(Y) = ln2 and S(X+Y) = 3/2 ln2. Using (4.26) with a = b = 1 we obtain

(4.29) N(X+ Y) = 23'2 < NQQ + N(Y) = 4.

One way to understand this is to consider finite sets £1, and £2j on which probabilities are
defined and a map 0: fi -> Q' such that F is into and |Q'| = |Q, |+|Q 21-1 = number of diagonals
(or antidiagonals). Thus the conjecture here is that if we set P = (P, ® Pj)°0' then
N(.P)<N(Pi)+N{P2) as suggested by (4.29).
When the density of the distribution of the 91" valued random variable is continuously
differentiable and together with its first partial derivatives decays sufficiently rapidly at infinity,
then between the Fisher information of X defined by

(4-30) J{X)=\(%)dx

and the entropy N(X) the following relationship exists.

Entropy Functionals 37

Theorem 4.31. ( D e Bniijn's Identity) Let X he. defined as shnve and 7 he a TVT(0 1) (R»
valued random variable. Then

(4.32)
(4.32) ±S(X+J£2)=\J{X)
■ysz) == f4*)
2W«
and furthermore, the isopermetric inequality for entropies states that

(4.33) \J(X)N(X)>\

The Fisher information matrix J(X) is defined by

(4-34) J(^=jfe)fefe

Let vj/(x), (y) be conjugate elements in L2 (9t"), i.e. *(y) is the Fourier transform of
V|/(x). Define X, Y to be random variables with densities

D (x) _ l<**>l2

Let Kx and KY be the covariance matrices of X and Y respectively. Then Stam's

uncertainty principle asserts that
167t2A:r -J(A)>0
(4.35)
l6it2Kx -J(Y)>0

(where 0 for matrices means positive definite). The Cramer-Rao inequality asserts that

(4.36)
m-■*2 >o
JOO -Kr>00
JGO -Ky>

and the combination of these two yields the analogue of the Heisenberg - Weyl uncertainty
relations in quantum mechanics in four possible equivalent statements

167t 2 ^j--A^?>0

16K2Kx -K?>0
38 The Method ofMaximum Entropy

(4.37)
\6%2KfKyKf -I7t0

IttfKfKxKf -I>0.

To conclude we shall present an important lower bound to KuUback's information again.

As in definition (4.4), let P,Qs P(fl) and u s M(fi) be such that ^(P.Q) isfinite,then

eNHm
= eMAH)ree200>1084
IV ~ gNH

There are several proofs of this inequality. See [5.5] for references. Here we present
KuUback's version. It appears as a set of exercises at the end of chapter 3 of [0.2] in which
references to the original papers can be found. Let us set f, = dP/du, and fj = dQ/du.. They are
positive, integrable functions (with respect to du.). An easy application of Cauchy-Schwartz
inequality yields
JOV2)" 2 rfu^l

the identity holds only when f, = £,. Rewriting the integrand above as f, (fj /f, )1/2 and using the
concavity of the logarithm functions we obtain

-21n|(/i/2) 1 / 2 rfu</i:u(P,0

and since for x< 1 logx<x-l we obtain, using the normalization \f\ d\i = \fid\x=l , that

K,(P,Q)>2{x -{0V2)1/2rfu) =f ( M 2 ) V

The last step consists of verifying the inequality

l(/? ~jff^\{Wx ~/a|)

for which you rewrite |f, - ^ | = |f,1/2 - f2'n | |f,1/2+ f,"21 and use Cauchy-Schwartz inequality again.
Though lengthy, this is straightforward. ActuaUy, a slick, but hard to hit upon, trick is presented
in [4.13]. Observe that for v > 0, u > 0

(u - v)2 < ((2«/3) + (4v/3))(« In (u/v) + v - u)

Entropy Functionals 39

substitute f, = u, ^ = v, take square roots and use Cauchy-Schwartz inequality to obtain the
inequality.
This was just a sample of an interesting class of inequalities. I hope to have wetted your
appetite enough.

3. Axiomatic characterization of entropies.

Certainly the entropy functionals we introduced in section 1 do not seem to be the obvious
convex (or concave) functionals, having metric-like properties, to be defined on P(f2) or M(Q).
This has prompted many people, to postulate "natural" assumptions on the functionals to be
studied, which would lead to functional equations to be satisfied by the desired functionals. Then
they would prove that either the entropy functional or the Kullback-like directed divergence were
the unique functionals having these properties.
But then again, one is always left wondering why the chosen postulates are natural.
Anyway, a line of research traceable to Shannon's work of 1948 was summarized in the
book by Aczel and Daroczy [4.7]. The results there concern entropy functionals on probability
spaces having finitely many atoms. eNHm
= eMAH)ree200>1084
To obtain S ^ ) or yP,Q) IV ~from
gNH axioms on functionals defined on {Pe P (9?"): P«fi} or
2
{Pe P(9J"): P « n } see the work of Forte and Sastri [4.8]-[4.9] and that of Johnson and/or
Shore.
Assume that a concave functional

F : { F e P ( 9 ? " ) : F « p . } - > 91

i) Is in subadditive with respect to projections, i.e. when P, and P2 are restrictions to 91"' and to
91 "2 (identifying 9J" = 91 "■ ® 9t"») then

eNHm
= eMAH)ree200>1084
IV ~ gNH

where the identity holds whenever n = u, <8> u, and P = P, ® P 2 .

ii) If T: 91" -> 9? " is u-preserving and f(T) is the density of P °T' (see lemma 4.9) then

F(F(r'))=F(F)

iii) If P„ is such that dPn/dn t dP(dn), then F(,P„) -> F(P).

Then F(P) = -aS^+bn+c lnn(dP/du>0).
40 The Method ofMaximum Entropy

If finiteness is insisted upon, then c = 0 whenever n(5R") = °o. The constant b can be
different from zero in applications where n can vary. For example, in statistical physics the
entropy is an extensive quantity depending on the number of particles.
In [4.10] the postulates of uniqueness, invariance, system independence and subset
independence are manipulated to obtain not only the functional forms of S ^ ) or IM(P,Q) but
Jaynes principle of maximum entropy and KuUback's principle of minimum cross-entropy, as the
uniquely correct methods for inductive inference when information is given in the form of
expected values.

REFERENCES

[4.1] Kullback, S., Keegel, J. and Kullback, J. "Topics in Statistical Information Theory". Led.
Notes in Stat. No.42. Springer - Verlag, Berlin, 1987.
[4.2] Dacunha, D. and Gamboa, F. "Maximum d"entropie et probleme des moments" Ann.
Inst. Poincare. _26, No.9, pp. 576-596, 1990.
[4.3] Dembo, A., Cover, T. and Thomas, J. "Information theoretic inequalities". IEEE
Transac-Info. Theory. 37, No.6, pp. 1501-1518, 1991.
[4.4] Efrom, B. "The geometry of exponential families" Ann. of Statistics. 6, No.2, pp.
362-376, 1978.
[4.5] Rodriguez, C. C. "The metrics induced by the Kullback number". In [0.9].
[4.6] Cizar, I. "I-Divergence geometry of probability distributions and minimization
problems". Ann. of Probability. 3, No.l, pp. 146-158, 1975.
[4.7] Aczel, J. and Daroczy, Z. "On measures of information and their characterization".
Acad. Press, New York, 1975.
[4.8] Forte, B. and Sastri L. "Is there something missing in the Boltzmann entropy". Jour -
Math. Phys. 16, No.7, pp. 1453-1456, 1975.
[4.9] Forte, B. and Sastri L. "Representation of the entropy functional for a grand canonical
ensemble in classical statistical mechanics" J. Math. Phys. 18, No.7, pp. 1299-1302,
1975.
[4.10] Johnson, R. W. "Axiomatic characterization of the directed divergences and their linear
combinations". IEEE Trans. Info. Theory. IT-25, No.6, pp. 709-716, 1979.
[4.11] Shore, J. E. and Johnson, R. W. "Axiomatic derivation of the principle of maximum
entropy...". IEEE Trans. Info. Theory. IT-26, No.l, pp. 26-37, 1980.
[4.12] Borwein, J. M. and Lewis, A. S. "Convergence of Best entropy estimates" SLAM Jour.
Optimizat. I, No.2, pp. 191-205, 1991.
Chapter 5

THE METHOD OF MAXIMUM ENTROPY

1. Kullback's andJavnes' reconstruction methods.

In this chapter we carry out the program described in the introduction for the
reconstruction problems at levels 1 and 2 of the M.E.M.. For the sake of a presentation that more
or less follows the chronological development of the results, we will be somewhat repetitive.
To begin at the beginning, one of our basic reconstruction problems was to find the
measure P « u, realizing

(PI) inf{ii: (l (P,0|P«n;£i 3 [] = c},

where $ : fl-»SRk is a given measurable, finite valued, function and cs 9?k, Q«u, is a fixed
measure. Our first result dates back to the fifties. The following theorems or variations on the
themes of theorems 2.1 and 2.2 of chapter 4 of [0.2].
Theorem 5.1. With the notations, and assumptions introduced above, assume that
P |1 (c,*)={PeP(n)|P«n, Ep(4>)=c} is not empty, that ceint(D(c,4>))=int{te 9tk|ZQ(„(t)<oo}.
Then, i)

(5.2) K^P, 0 > lnZe,®(f) - (c, r), c = VelnZe,<»(r)

and (ii), the solution to (PI) is given by

^ = cxp -(t,©)f/Z^(f)

where t* is the element in intD(Q, 0) such that

c = VtlnZe,<p(r)lr

Proof: Fix tsD(Q,«I>), t0 e 9? and define L on (0,oo) x Q.

L(s) = s In s + s(t, $ ) + sto

41
42 The Method of Maximum Entropy

where we dropped the variable co and we shall be dropping as many super and subscripts as
possible. L(s) has, for each given a, a maximum at s0()=exp-{(t,O)+t0+l} at which L(s0)=-s0.
Since L"(s)=l/s, there is S,(<D), lying between s0(co) and s such that

/ 2 2
L(s) = L(s
L(s)--= L(s00)) ++(s(s-So)L
- s0)L'(so) (s--s0)s0/2si
(s0) ++(s- ) /2si

2 2
=-s-s 0+(s-so)Jo)
0+(s- /2s/2^i
l ..

Actually, this identity defines %. Set dP/du=f„ dQ/du=fj and substitute f/f, for s in the
identity above. After integrating with respect to dQ=£,dp., we obtain

(5.3) ln(/-1//2)rfe = -jexp-((t,(t.)-ro-l)c/e + | | (| (| -f e- ex xp p- (-((t(,t(,t()t))-)f-of-ol -) Jl ) ) 2 ^

from which we obtain

( i \ 0 > - l n-lnZ(t)-
^K»(P,Q)> Z ( t ) - ( t(t,c)
,c)

where we used the fact that J#fj<lu=c, In the lemmas below we prove that

#(t) ==(t,c)
H(t)-- (t,c) + lnZ(t)

is negative, analytic and, whenever the hypothesis of the theorem are met, (ii) is fulfilled.
Before doing the lemmas, we extend Theorem 5.1 as
Theorem 5.4. With the same notations as above, but now we consider P « u and Q « u
with respect to which <S is integrable and

£/.(<£>,
E P(<&, t )) :^
S(t,0)
(t,0)

for t s int D(Q,<D), then

/ ^ ( P , 0 > - l-lnZ(t)-
K»(P,Q)> n Z ( t ) - ( (t,0)
t,0)
where 0 = V,lnZ(t).
Proof: Denote by Q* the measure with density
Proof: Denote by Q* the measure with density
(exp-(t,(|))/Z(t))(d2/rfn)
(exp-(t,(|))/Z(t))(d2/rfu)
with respect to |i. Then
with respect to |i. Then
K»(P,®-- -K^P,Q )- k -J(t,*y,rfn- lnZ(t).
k
K»(P, Q) = K^P, Q ) - j(t, Oy.rfu - lnZ(t).
The Method of Maximum Entropy 45

Lemma 5.5. For teint D(Q,<t>), the complex valued extension of Z(t) obtained by
replacing t by t+iu=Z is analytic in Z and the following hold

(a) V,Z(t) = -J<kr<w»rffi

(b) Hess(Z)(t) = \ Wfle-W»dQ > 0

where the identity holds only when Q({$=0})=1.

Proof: From Jensen's inequality it follows that Z(t) is well defined, continuous in Z when
ReZ, e intD(Q,3>) and from Morera's theorem applied to each component we get the analyticity.
Let teintD(Q,<D), ne SRk and C>0 be such that t+sneintD(Q,4>)
for all |s|<C Since
-s(u, $)exp (t, 4>) < exp (t + su, * ) - exp-(t, <3>)

the first identity drops out. To obtain (b) start from (a) and invoke differentiability through
analyticity.
Lemma 5.6. When the inequality in (5.5)(b) holds, the function 9(t)= -V,lnZ(t) defined on
intD(Q,) is one-to-one.
Comment. The range of the mapping thus defined may not be convex. See example at the
end of section 2 of [4.4]. It is nevertheless an open set.
Proof: For the reader. It is based on (5.5b).
We shall denote by t(9) the value oft e intD(Q,S>) for which 0= -VlnZ(t). It is also easy
to see that.
Lemma 5.7. £el(T-9)2=Hess(lnZ(t))=J(9). Also J(0)J(t)=J, where J(0) and J(t) denote the
Jacobian matrices of the mappings t->9(t) and 9->t(0).
Lemma 5.8. Assuming that the inequality in (5.5-b) holds, let t(9) be the inverse function
to that defined in Lemma 5.6. Then H(9) = H(t(9)) is negative and has a strict maximum at 9(0)=
JcfcdQ.
Proof: Note that
Ktl(Q?,Q) = -H(t)>0.

Then so is -H(0). Using H(t)= -(9,t)-lnZ(t) with 9= -V,lnZ(t). Solving for t(0), differentiating,
using (5.6) and the chain rule one obtains V0H(9)=t(9). Using Lemma 5.7 and the assumptions, it
follows that the quadratic form associated to Hess H(9) is strictly positive. Therefore H(9(0))=0
is the maximum value of H(9).
To completely finish the proof of theorem 5.1, notice that when the datum c is in the
image of int(D(Q,0)) by - V,lnZ(t), then
44 The Method ofMaximum Entropy

mi{K»(P, Q)/P «\i, E{P) = c}= -sup {V(t, c) + lnZ(t)}

and the supremum is reached at t* such that c= - V,lnZ(t).

The reconstruction problem corresponding to the method of maximum entropy at level 1
consists of finding

(P2) suP{Sp(P)/P«\i;EP() = c},

where the meaning of the symbols is as above. Also, now Z(t) stands for Z^ 0 (t). The solution to
(P2) is contained in
Theorem 5.9. For a given c in the image of intD(u.,«I>) by -V,lnZ(t), let t* in intD(n,<f>) be
the point at which H(t)=(c,t)+lnZ(t) reaches its maximum value. Then (P2) is solved by
d\x\- = exp-(f, <)>)rfu/Z(t).
Proof: Flip a few pages back, and check (4.23). It asserts that if t e intD(|j.,<l>) and P « u ,
Ep (<t>)=c, then
KviP,v.t)=mfy-sv.{p),

where H(t)=(c,t)+lnZ(t'). Since KM(P,n?)>0 we have H ^ S ^ C P ) . Also, since at t. H(t)

reaches its maximum value, we see that K (P,ji*)=0 only when P=|J.* or S (P) reaches its
maximum there.
It is not hard to realize that things have been set up so that (PI) and (P2) have as solutions
specific elements of the O-Hellinger arc of Q or u. The interesting results by Czizar in [5.1] assert
that i) if the solution to (PI) exists it is an element of the 4>-HeUinger arc of Q and (ii) whenever
D(Q,<D) has a nonempty interior, then (PI) has a solution.
In an appendix at the end of this Chapter we present, without proof, a few results taken
from Chapter 9 of [5.8] about the relationship between the range of the map V,lnZ(r) and the
support in 9?* of the induced measure \\.o^~x

2. Czizar's results.

Here we present a summary of [5.1], with some obvious changes. In section II of chapter
4 we obtained the lower bound

(5io) §f|g-fU*(W0)i
The Method of Maximum Entropy 45

from which the proof to the following result rapidly follows.

But first some notation. We will set Spl(Q,p)={PeP(n)|KM(P,Q)<p} and will call it the
K^-sphere of radius p and center Q. Also, if £ is a convex set intersecting SM(Q,°o), a Pe£ such
that

K/J(P/Q)=inf{K)jiP',Q) /P'z E)

will be called the K^" projection of Q on £. Now we are ready to state.

Theorem 5.11. If the convex set £ is variation closed (i.e., closed in the distance specified
by the left hand side of (5.10)), then each Q such that S (Q,°o)n£* 0 has a K^-projection on £.
The proof of this result is similar to the proof of projections on closed subspaces of
Hilbert spaces. It depends on an application of (5.10), the triangle inequality and Fatou's lemma.
For the statement and proof of the following lemma we need the following variation on
the theme of (4.8)

eNHm
(5.12) eMAH)ree200>1084
IV ~ gNHKv.<P,R)-Kli(P,Q)=EPln^
=

which, when Q « R reduces to Epln(dQ/dR).

eNHm
Lemma 5.13. Assume that MAH K) r(Q,R)
e e 2 0 0 >are
1084
IV ~KgNH
(P,Q)
= e and finite. The segment joining P and Q
(in the class of measures absolutely continuous with respect to u.) does not intersect S^fR.p) with
p ^ Q . R ) . That is, K(l(P,R)>p for each P a =aP+(l-a)Q if and only if Ep(ln(dQ/dn)/(dR/dn))>p.
Also, if Q=aP+(l-a)P', 0<a<l, then K>I(Q,R)<QO implies K^(P,R)<oo, and the segment
joining P and P' does not intersect S ^ p ) (with p=K(l(Q,R)) if and only if

E»{in(M)) =K,(Q,R).

Proof: Let pa=a(dP/du.)+(l-a)(dQ/du.) denote the density of P with respect to u,. From
the convexity of plnp it follows that
/a = s ( p » l n P a - ^ l n ^ J
decreases to
eNHm
= eMAH)ree200>1084
IV ~ gNH

Note that, f, is u,-integrable by assumption, and by the monotone convergence theorem

46 The Method of Maximum Entropy

K»(Pa,R) = &(ln ( f / § ) ) -*„(&/?)

from which it follows that if EI,(ln(dQ/dn)/(dR/dn))<Kll(Q,R) then there is 0<a<l such that

K»(Pa,K) <Kll(P0,R) =K»(Q,R).

The converse is easier. The hypothesis implies that K(l(P,R)>K(l(Q,R) by (5.12), and
therefore K ^ J R ^ K ^ Q . R ) because of the convexity of KM(P,R) in P.
The second half is left for the reader. It is also left for the reader to verify that the lemma
and (5.12) imply the next proposition.
Proposition 5.14. A probability P is the K^-projection of Q on the convex set of
probabilities £ if and only if every P'e £oSM(Q,oo) satisfies

(515) V.fija^P'^+V.g).

If the K -projection P is an algebraic inner point of £ (i.e. if for every P's £, there exists P"
e £ with P=aP'+(l-a)P", 0<a<l) then fcS^Q.oo) and Ep.(ln(dP/dn)/(dQ/du))=K^(P,Q) and (5.15)
holds with the equal sign.
Before we consider the first of the results we want, observe that if £ is any set of
measures, and if there exists Pe£ with u,-density cexpg(x)(dQ/du) where JgdP, =JgdP2forany
P„ P2 in £, then K/P|Q) = inf{K(l(P',Q)|P'e£}. More exactly, in this case

(5.16) K^P', Q) = K^P>,P)+K»(P, Q).

The particular case we are interested in can be rephrased as: Let £={PeP(£2)| P « u ,
Ep(0)=c}, where <T>: fl-> 5Rk is given measurable mapping and ce SRk, then if a Pe £ exist and is
of the form dP/du=c exp-(t,<D)(dQ/du), then it is the K^-projection of Q and (5.16) holds. We are
now ready for
Theorem 5.17. Let {fjae A} be an arbitrary set of real valued, measurable functions
defined on Cl and {cJaeA} a set of real constants. Let £={PeP(fi)|Ep(f1)=ca, aeA}. Then, if a
probability Q « ji has K^-projection P on £, its u,-density is of the form

(5lg) ±=\cexpg(a)^ onN

* 1 0 offN
The Method of Maximum Entropy 47

where N has P'(N)=0 for all P'sEr.S^Q.oo) and g() belongs to the closed subspace of L,(Q)
spanned by the f,'s. Conversely, if a Pe£ has Q-density of the form (5.18) with g belonging to the
linear space spanned by the f,'s, then P is the K^-projection of Q on £ and (5.16) holds.
Proof: It follows from Proposition 5.14 that P is the K-projection of Q on £, then for N=
{dP/du=0} it is necessary to have P'(N) = 0 for any P' e £nS (Q,°o).
Let £t£, the class of P'e£ with dP'/dP<2. If P'e£', there is P"e£' with dP7du=2-dP'/dn
and P=(P'+P")/2. (Given P's £, define P"=2P-P' and verify it is in £■.) Thus P is an algebraic inner
point of ^.Applying Proposition 5.14 to £■ instead of £ we obtain Ej,[ln(dP/du)/(dQ/du)] =
K,(P,Q)or

(5.19) Jln(**)(£-l>-0 VP'ee".

Now for any measurable h: Q —> 5R such that |h|< 1 and

(5.20) \hdP = Q, \/ahdP = 0 VaeA

there exists P'e£' with dP7dP=l+h. Thus, (5.19) yields

eNHm
= eMAH)ree200>1084
IV ~ gNH

for all such h, and therefore for all heL(dP) satisfying (5.20).
Therefore, ln((dP/du)/(dQ/du)) belongs to the (closed) subspace of L,(P) spanned by 1
and the fa's. For, were this not the case, the Hann-Banach theorem ([1.4]) would imply the
existence of a bounded linear functional on L,(P) vanishing on the said subspace but not at
ln((dP/du)/(dQ/dn)). Since the dual of L,(P) is L„(P), this is a contradiction.
To prove the second part, suppose that (dP/du.) is of said form. Since g is a finite linear
combination of f,'s, Jg dP is constant on £ and

j l n ( ^ ) d P / = lgc + Jgtff>'=constant=^(P / ,/ J ) for P' £.

But for P'e£ both Kpi(P,Q)<oo (by hypothesis) and KJ¥\P)<CD. Therefore
KI1(P',Q)=KJP'F)+KJ?,Q) as desired.
So far we know that if a solution to (PI) exists it has the desired form. Let us see how
Czizar settles the existence question.
48 The Method of Maximum Entropy

Theorem 5.21. Let P^/cO) be as in statement of Theorem 5.1 and A=fcs SR'IPeP^c,*),
K(l(P,Q)<oo}. Assume that D(Q,<X>) is open. Then, the KM-projection of Q on PM(c,<D) exists for
every c in the interior of A.
Comment. The obvious question is: what is the relationship between the interior of A and
the image of D(Q,<J>) by -V,lnZQ4(t)?
During the proof we shall need the following lemma, the proof of which is carried out in
[5.1].
Lemma 5.22. For anv measurable function O such that e(t-<>) is Q-integrable for small |t|,
K|1(P„,P)^-0 implies J<X>dP„ -> J<D dP.
Proof of Theorem 5.21. Since K^fP) is convex in P, the set A is convex, and

inf {K^P11 ,P)fP'

F(a) = inf\KySP
F(a)~- ,P)IP' s P»(a,
P^a.*)}
*)}

is afinite,convex function on A. Hence, if a is an inner point of A there exists t such that

(5.23) F(b) > F(a) - (t, b - a) Vb e A

Let us verify that teD(c,0). Let Pn eP^(c,$) be such that KM(Pn,Q)->F(a). Then by
(5.10), P„ converges in variation to some P. Set <J>Jn) = */ if -t, l <¥^ and <E>, = 0 elsewhere.
Here K„t oo. Let Pn be probability distribution with

dP„
(5.24) 4>
=JV«fe=A-.«p-(t,*«)fi

From (5.24) and the definitions

(5.25) Kll(P „,Q) = EK(ln(^)) -(t,EK^>in))) + (t,EK^).

Since, 0sD(Q,O), the components of O are Q-integrable, (see Lemma (5.5a)) and Pn is
integrable as well. Therefore, for n large enough, JO(n) dPn is arbitrary close to J«X>dPn = b° say.
Choosing the K,, property, get the JO'-'dP',, close to J<J>dP'n=a. Compare (5.25), (5.12) to
(5.23) with b" in the role of b and obtain K ^ . Q J - ^ O . Thus, on account of (5.10), the Pn with
densities (5.24) converge to P with density N exp-(t,<J>)(dQ/du) with respect to u. And also,
teD(Q,). Setting c=Ep*, similarly to (5.25) we have

^(P,0=^(ln^)-(t,c-a)
The Method of Maximum Entropy 49

from which, using (5.12) and (5.23) we obtain ^ ( P ^ P ) - ^ . Since we are assuming that D(c,*) is
open, lemma stated above implies that J G> dP=lim JdP'n=c. This completes the proof.

3. Borwein and Lewis's extensions.

The problems we have been dealing with are actually particular cases of linear
programming problems which can roughly be described as follows:

minimize {^(p(s))\i(ds)\p s K, l®(s)d\i(s) = c} .

where *P is an appropriate convex function defined on 9?, K is a convex set of functions p defined
on a measure space (Cl,3,\i), <X>: Cl -> SRk is a measurable mapping and ce 9?k.
Even though I am hardly describing the results of this interesting line of work, the reader
should at least take a look at [5.2] and at some of the references there, in particular to the
pioneering work by Rockafellar in [5.3]-[5.4].
As an appetizer, I will only mention a few of the examples described by them.
Let Cl, be [0,1] and 9 its Borel sets. Let ^ u ) ^ " (p>l) or T(u)=l/u for u>0 or 0<u<l.
Or, let xP(u)=u lnu or 4'(u)=-lnu for u>0. Supply with appropriate linear constraints to obtain a
problem as above.
Anyway, their treatment relies heavily on convex analysis, particularly on duality theory.
The general idea is always to go from the original problem (on an infinitely dimensional space) to
a dual (finitely dimensional) problem, and to verify that there are enough conditions under which
both lead to the same solution. Heavy stuff, but quite general, and useful!

4. Dacunha-Castelle and Gamboa's approach to level 2 M.E.M.

Even though the original idea behind the present approach seems to date back to Rietsch's
paper in 1977, see [0.12], the development of the method in its full french generality is due to
Dacunha-Castelle and Gamboa, see [4.2] and, for further extensions and generalizations by
Gamboa and Gassiat see references [5.5]-[5.7]. Before presenting the main results in [4.2], and to
motivate further, consider the following problem.
Suppose you want to solve

(5.26-a) Y.AijXj=y, \<i<M, l<j<N

50 The Method of Maximum Entropy

in which perhaps N>M and to make it worse, it is known that

(5.26-6) x, e { 0 , l } .

The level 2 MEM. way of solving this problem is the following. On fi={0,l}N with the
obviousCT-algebra. For to s Si, P(to) is the probability of configuration to . We define X,:
Q. ->{0,1} by the obvious thing Xj(co)=to(i). On {0,1} we define a probability m(0)=mo ,m(l)=m,
and on £2 we define 1^= m®... m in the obvious way. This m is some "a priori" measure on fi.
We shall define <J>, on fit by *,=5:^ljX1 and we shall look for P'e P(fi) such that

(5.26 - c ) EP[<b,]=y,

and such that the entropy

(5.26 -d) Syif) = -|pInp4*(to) = - I p(to)lnp(to)u(to)

where P(to)=p(to)|x(to) for every to s Q.. Having established the notation, there is no problem in
verifying that
Z(f)=Z e-w"<<'')u(to)=n (m0+m1e-^'",i').

Now see that if we find t'e JRMsuch that

#(t) = (t,y) + lnZ(t)

assumes its minimum value there, then p*(co)=exp-(t*,<J>)/Z(t*) and the X; we want are given by

Xj = EpnXj =Z ArJ(to)p*(co)u(a))

which, taking into account the product nature of p*,|i.,Z, yields

Xj =miexp-S ttAij\ ma +miexp-£ t*Au I

for l<j<M. If all goes well, the numbers t*„ l<i<M will be such that Xj is very near zero or very
near one, (in practice it is somewhat hard to go beyond such statements).
The Method of Maximum Entropy 51

In its most general form, the reconstruction problem via the M.E.M. goes like this: Let B
be a locally compact, topological vector space and B* its dual. Let |x be a reference measure on B
and X a B-valued random variable and P its distribution.
Let <t>s(B*)k and cs 9?k The M.E.M. reconstruction consists of finding P that maximizes
S ^ ) subject to Ep(<4>,X>)=c, PeC.
But we shall not aim at such generality here. For us B will be C([0,1]) the class of all
continuous, real valued functions defined on [0.1]. To be specific we shall consider the cases

Ci = {g e C([0,1]) - oo < a <g(x) < b < +°o }

C2 = {geC([0,l])\lg2dx<l}

I
Also, let O: [0,1 ]-»SRk be continuous, and set (O, g) ={ <f>{x)g{x)ax.
o

Since dealing with measures on infinitely dimensional spaces is hard, the thing to do is to
discretize and verify that a solution to our problem exists in the limit . This explains the reason
behind our regularity assumptions.
We want to solve the problems: For i=l or 2

(5.27-a) find {geC,l($,g) = t } .

To discretize, we shall consider the discretization I„={i/n| i=0,...,n-l} of size n of [0,1] and
for hsC([0,l]) the trace h„ of h on L, is defined to be {h(i/n)| i=0,..,n-l}.
Let C(i)„=(a,b)" and for any measure m on (a,b) we set u,n=m®... m. And, when dealing
with C2, C(2)n will be the unit ball Bn in SRn and we shall take u.n to be the uniform measure on Bn.
In any case, Xu=(X(")1,...,X(n,11) will denote the obvious coordinate vector and we will
search for measures Pn on C(i)n such that P„«u n and

(5.27-6) i£ / .„[Z*(r/n)^" ) ] = c

The Pn will be provided for us by the M.E.M. and it is natural to state

Definition 5.28. Let (C(i)„, u,„, O, c)neW the datum for a sequence of maximum entropy
problems. We shall say that it is convergent if
a) For any arbitrary large n, the ME-problem of size n admits a unique solution P'„.
b) For all x, and i(n) such that (i(n)/n)-»x as n-» oo,
52 The Method of Maximum Entropy

eNHm
= eMAH)ree200>1084
IV ~ gNH

c) \<P{x)gk(x)dx = c.
0
The following lemma asserts that discretization yields feasible solution at every stage. The
proof is in [4.2].
Lemma 5.29. Let C be an open convex of C([0,1]). Assume that the constraint is
realizable, i.e., there exists g such that <0,g> = c. Furthermore, assume * is of rank k, i.e., for
any a e 5Rk such that (a.O(x)) = 0 for all x in [0,1], we must have a = 0. Then, the constraint is
realizable in C„(i) that is, there exists a„ s Cn(i) such that £ S 9(i/n)&„ = c .
The following lemma asserts the existence of solutions to the M E M . problems of size n.
Lemma 5.30. If the hypotheses of Lemma 5.29 hold, and if the convex envelope of
the support of n„ contains Cn, and if

£>(H„) = [u e SR i l|exp[-i2:(«,*(i))<]rfn„(x n )<« ) )

is a non empty open set, the ME-problem of size n admits a unique solution defined by

eNHm
*5(*.)IV= JEJ5«P[-4
~ J («,*(*))*i]dMc)
MAH) 200
gNH
= 84e
■
ree >10

Proof. According to Lemma 5.30, there exists c n such that (l/n)S4>(i/n)ov' = c. Thus we
only have to quote Theorem 5.21 above to obtain a measure P*n such that

eNHm
= eMAH)ree200>1084
IV ~ gNH

Let us now concentrate on problem (5.27-a) for C,. Under the following assumptions, we
shall prove that the sequence of ME-problems yields a solution to our (5.27-a).
Assumptions on the measure m on (a,b)
Al) (a,b) is contained in the convex envelope of the support of m.
A2) The set D(m) = { te 9? | Jexp-ty m(dy) < QO } is a non empty, open set on which we define
C(t)=Jexp-ty m(dy) and >P(t)=lnC(t).
A3) The set V= {u e 5Rk| (u ,4>(x)) eD(m), Vx s [0,1]} is non empty and coincides with V={u
e Mk| *F((u,<P(x)) e L,([0,l]),dx)}.
Denoting the u in Lemma 5.30 by un and setting An = un /n we can restate Lemma 5.30
as
The Method of Maximum Entropy 47

(5.19) Jln(**)(£-l>-0 VP'ee".

Now for any measurable h: Q —> 5R such that |h|< 1 and

(5.20) \hdP = Q, \/ahdP = 0 VaeA

there exists P'e£' with dP7dP=l+h. Thus, (5.19) yields

eNHm
= eMAH)ree200>1084
IV ~ gNH

j l n ( ^ ) d P / = lgc + Jgtff>'=constant=^(P / ,/ J ) for P' £.

Proof: Let i s W . Since W is open, there is x" and 0 <A. <1 such that x=A.x"+(l-X,)x"
From the concavity of K we obtain

AT(A, x) > \K(\, x*) + (1 - \)K(\, x " ) > XK(A, x ( ) + (1 - X)Z,(x") .

Now use assumption (iii), and the fact that x* is fixed to obtain the desired result.
Proof ofLemma 5.32: Set

V„ = {A S SR*l(A,o(i)) e D(m), i= 1,2,...»|

0(d) = Ce 9?*|C =J <D(x)g(x)A, so/we g s d

o J

which is an open, convex set by the open mapping theorem.

Define H„: V ^ C , ) - * SR and H: Vx<t>(C,)-> 91 by

Hn{K, c) = i £ <F((A, <J>(i) ) ) + (A, c) A e V„

H(K c) = 7 ¥((A, <D(x))>fr + (A, c) AeV

—ay

Certainly Ffn(A,c)->H(A,c) uniformly on compact sets. Thus if we show that H(A,c) has a
minimum at A„ e V, we will be through. For that, let us begin by verifying that H(A,c) satisfies
the assumptions of Lemma 5.33.
For A'sV set c* =J ®(x)x¥'(AmMx))dx.
o

Then A* is the minimum of H(A,c*) and (i) of the lemma holds. Let geC, be such that
c=<*,g>. Set
eNHm
= eMAH)ree200>1084
IV ~ gNH

As we have seen above, in Lemma 5.31,

eNHm
= eMAH)ree200>1084
IV ~ gNH
and also
X(P;)=JffB(A„,C„).
The Method of Maximum Entropy 55

Consider the law Q n « un with a density defined to be

ft e x p f - x i C I " ) - 1 ^ ) ) - « P o C P 0 - , ( * ( i ) ) )

and introduce, for y in (a,b)

ym(y) =y(Kei)~l(y) - * ° 0J")~'O)

the Cramer transform of the measure m.

Therefore
eNHm
eNHm
= M0A
e M A H )=r e ee 2 0 >H1 )0 r8e 4e 2 0 0 > 1 0 8 4
IV ~ gNH IV ~ gNH
and also
eNHm
= eMAH)ree200>1084
IV ~ gNH

The law Qn verifies the constraints cn, which implies that

-i 1 Y-(«(0) s X ( ^ ) = #(A„, c<">) <ff„(A, c<">)

for every A e 5Rk Taking limits as n—> °o we obtain

rm(g) = -\ ym(g(x))dx<H(A,c)
o

for any A e SR k. Since Tm(g) isfiniteon ge C„ due to the continuity of ym on (a,b). Therefore, (ii)
holds as well. Fatou's lemma yields (iii), and therefore Lemma 5.32 can be invoked to conclude
that the minimum of H(A,c) cannot be reached at U*
All these scattered lemmas amount to proving
Theorem 5.34. Let O:[0,l]—» 5Sk be a continuous mapping. Let cs SRk and consider
problem (5.27-a) for C,. The following are equivalent
1) (5.27-a) has a solution.
2) These exists ^(t), such that expY is the Laplace transform of a measure m on (a,b)
satisfying Al, A2 and A3 such that (5.27-a) has <P'(A„,<I)(x)) as solution, where A„ verifies

c=\g(x)V>(K„,g(x))dx.
o
56 The Method of Maximum Entropy

3) For any *F such that exp*F is the Laplace transform of a positive measure satisfying Al,
A2 and A3, (5.27-a) has a solution y(A«.g( x )) with A„ satisfying

)g(x)y>(A„,g(x))ax=c
o
And to finish we have
Theorem 5.35. Define T: C, -> 9? by

r(h) = -]ylh(y)]dy
o
where Y(y)=yOP )" (y)-[ F°OI")- ](y). Then g*(x)=,P,(<A».*>) is the unique element at which
1 1 t 1

sup{T(h)|he C',<,h>=c} is achieved.

Proof: We saw in the course of the proof of Lemma 5.32 that T(h) < H(A«,,c) of every h
in C, such that «I>,h> = c. But a simple computation shows that H(A„,c) = T(g*), which asserts
that g* is a candidate to solve the maximization problem stated above. Since T(h) is strictly
convex the solution has got to be unique.
This rounds up things for problem (5.27-a) for Cv We shall now consider the equivalent
problem, but for C2. Recall that now the reference measure u.n is the uniform measure on the unit
ball Bn in 9Jn, of volume vn.
In order to compute
Z„(u) = £ i exp [4<(u, * . ) , *"> W

(beware, <(U, „), x"> = | (u, « ( ^ ) r ) )

it suffices to note that for te 9t°

J exp(t,x")A"=l ex.p[\\t\\y](l-y2)^v^dy
-l

where ||t||=(t,t),/4. Denote this last integral by L,(t). Therefore

Z„(u) = £/„(±||(u,)||)
from which we obtain that
«sW] = dku». Ml)"1 ( »», *(;) )G.(U.)
where un has to satisfy
The Method of Maximum Entropy 57

id(u„,<b„)\\)-LG„MX ( u . , * ( i ) ) « ( i ) =c

and Gn(u) is defined by

/.(i(n, * . ) ) " ' 1 yexp\yKu,n)y[\n](l-y)t?dy

' -i

Set now A„= UnG^uJ/IKu,,,^)!!"1 and the n-th maxentropic approximant to the desired g*k
is given by
g;(r) = z(A„,<i»(i))x([^>i])(x)
where we set x(A)(x) equal 1 or 0 depending on whether x is in A or not (i.e. the indicator
function of A in the parlance of measure theorists but not in that of convex theorists).
Anyway, above, An is the unique minimum of the convex functional

tf„(A,c) = £j:(A,<&()))2+(A,c)

which converges, uniformly on compacts, n-> °o, to

H(A, c) = | ) (A, <D(x))2tfc + (A, c).

o
This achieves its unique minimum at -M 0 'c, where M 4 is the matrix

\ a>(x)®+(x)dx
o

which we assume invertible. Thus the g"n(x) tends as n-> oo to

g*(x) = -(MilcMx))

which is the least square solution to the problem under study.

APPENDIX.
55 The Method ofMaximum Entropy

Before starting to quote results from [5.8], it is convenient to translate the scheme of
section 1 to 5R*. By means of <& : Cl -»SR^ we can associate with each measure P, U, |i on fi a
corresponding measure P°<b~l = <S>(P), etc - o n 91* We shall assume that the range S of 4> is a
Borel set in 9?* and we shall denote by C the closure of the convex set generate by S. And we
shall write 71(x) = x for the identity mapping on SR*
Instead of considering the translates of p(u) = [ P e p(G)\P « u ] by * we shall
consider only the translates of the Hellinger arc

4i2> = ^exp-(A,0>)4i

and if, for short, we denote by u the measure <£>(u,) on 5R* we have the exponential family
H(\i) = {u, : X s .DJwhere
eNHm
= eMAH)ree200>1084
IV ~ gNH

where as usual Z(X) = J exp-(X, T)d\x. and D = {X e 9?*|Z(X) < °o}.

We shall set K(X)=loZ(X). Throughout we shall assume that D° =int(D) is nonempty and
that H(/J) is full, i. e., that the Hellinger arc generated by any element of H(\x) is H(\i).
Let us write h(c) = inf{(X,c) + k(X)\X e SR*} for any c e 5R* and set
s(c) = -/i(-c) = sup {(X, c) - HX) IX s 9J*} for the convex conjugate of k(X) now we can restate
the results we need from section 9.1 of [5.8] as
Theorem 5.36. With notation introduced above we have
i) s'(X) = MX).
ii) k is closed, strictly convex on dom /t=int D.
iii) s(t) is a closed, essentially smooth, convex function and

infC< doms<C.

Theorem 5.37. k(X) is steep if and only if d(D°) = infC, where 3: D° -> SR* is given by
3{X)=VxKX) = VxlnZ(A.).
Theorem 5.38. Let t be a boundary point of C. If there is a hyperplane H supporting C at t
and satisfying u(H)=0, then tg dom s in particular, whenever u, is absolutely continuous with
respect to the Lebesgue measure on SK*, we have that dom s= infC.
To explain a bit some of the words, we mention that a closed convex function is just a
convex lower semicontinuous function. That 5(c) is essentially smooth means that
doms= {c e 5Rl.s(c) < oo} is open and s is differentiable on int(dom s). In this case, s(c) comes
out being steep, that is J^(c' + X(c - c')) tends to infinity as Xio, where c' is in the boundary of
dom s and ce int(dom s).
The Method of Maximum Entropy 59

For more about these facts, the reader is directed to Chapter 5 of [5.8] where appropriate
references to the treatise on convexity by Rockafellar are given.
In our situation K(0) is the natural logarithm of a Laplace transform, and therefore it will
be differentiable on the interior of D whenever it is not empty. According to Theorem 5.37 this
will happen whenever int C is not empty. Then, the first thing to examine is the support of p. in
SRl. If it has a nonempty interior we proceed to find D. If S is a finite or countable set, then we
proceed according to
Theorem 5.39. Let S be a finite or countable set. Then conv Seldom S. In particular dom
S=C if Sis finite.

REFERENCES

[5.1] Czizar,I "I-divergence geometry of probability distributions and minimization problems"

The Ann. of Prob. 3, No 1, pp. 146-158, 1975.
[5.2] Borwein, J. M. and Lewis, A. S. "Partially-finite programming in L:: entropy
maximization" Res. Rep. COOR 91-05, Faculty of Mathematics, Univ. of Waterloo,
1991.
[5.3] Rockafellar, R. I. "Integrals which are convex functionals". Pacific Journal of Math. 24,
pp. 525-539, 1968.
[5 4] Rockafellar, R. T. "Integrals which are convex functionals". Pacific Journal of Math. 39,
pp. 439-469, 1971.
[5.5] Gamboa, F. and Gassiat, E. "Maximum d'entropie et problem des moments: cas
multidimensionel". Probab. and Math. Statistics, pp. 67-83, 1991.
[5.6] ibid. "Large deviations and generalized moment problems" To appear in Probability
Theory and Related Fields.
[5.7] ibid. "M.E.M. for solving moment problems" Technical Report 91-23. Univ. Paris-Sud.
Orsay.
Chapter 6

APPLICATIONS AND EXTENSIONS

This chapter is made up of bits and pieces. It is a collection of sections, not related in any
logical order, the contents of which can be considered as either comments on the material of the
preceding chapters, extensions or variations on the theme of some of the topics, or applications
mostly taken from the literature, and presented in no particular order at all, hopefully to break the
monotony.
Since this chapter is very long, results and formulae will be numbered by section.

6.1 Entropy maximization under quadratic constraints, or constraint

relaxation.

Many reconstruction problems have inexact data, and instead of wanting to solve for x in
Ax=y one decides to look for x's such that

(6.1.1) AieF M (y,T)

where

(6.1.2) KM(y,v)= {z : (z-v,M(z-^)) < Y}

where M is a symmetric, positive definite matrix.

Besides the issue of undefined constraints, the technique of relaxed constraints is useful in
solving infinite dimensional reconstruction problems by maximum entropy methods. The basic
references for this section are [6.1]-[6.2]. To make (6.1.1) more precise, let A: 9?" -» 9J* be
identified with its matrix in the canonical basis, say.
Assume, to keep things simple, that we do not have constraints on the set of possible
values of x and let X: ft -> 91" be an 9J" -valued random variable.
Let |x be a a-finite measure on 91" and consider measures P on (ft,#" ) with P °X"'
having a density p(x) with respect to u, and set as before

61
62 The Method ofMaximum Entropy

Z^X) = EP[exp-(X,AX)] = J p(x) e ^ ^ x )

for U S ' and, of course, ZJX) isfinitefor X in

D(M)={Xe<Hk: Z^(X)<oo}.

Let K(n)={ PsP(Q: S)J(P)<oo} and

BM(y,y) ={Pe P(n) : AEPX s VM(y,y)}

and tofinishthe list of symbols

2 M (n, Y) = {y e SR^My. r) <->«&!) * 0 >

With all these notations Gamboa proved the

Theorem 6.1.3. If D(u)=SR* and Y.M(\i,y) is open with y e ZM(H,Y), the problem of
finding
sup{S^P)/PsBM(y,y)}
has the unique solution

i) (frfu^'PoX- 1 if S d\i<°o and (frfu^'PoX- 1 e BM(y,i)

u) Otherwise dPy.- = - ,,

Herey* minimizesH(X,y)=lnZx(n) + (X,y) + y(M~' X)'-4

Proo/: It is clear that when (jrfu) P°X~X e -6M(y,y) it maximizes S^CP). Assume that (i)
does not hold and let T) e FM(0, y) such that y + T| e V^fy, y) We know from chapter 5 that
X i (y + T|) minimizing

^o(X,y + Ti) = lnZ1(n) + (X,y + i1)

exists and
#o(Xi (y + TI), y + TO = S^P^om)).

Then to find the sup {SJP): PsBM(y,y)} it suffices to find sup{Sli(Pxl(y„) ): T| e VM(0,y) ) . The
final step consist in applying the min-max exchange theorem in [6.3].
Applications and Extensions 63

Comments. Actually, instead of VM (y,y) we could have considered any convex set K. The
issue would then be to find the analogue of HIX,y).
There is one very important sense in which relaxation is of real help. Notice that when y
gR(A) the range of A, then there will be no hope of finding a minimum of lnZx(|i)+(X,y). We
have to consider finding x such that Axe VM(y,y) and VM(y,y) oR(A) is not empty. There will be
a critical value of y below which no solution will exist.

6.2 Failure of maximum entropy methods for reconstruction in infinite

systems.

Here we present some examples, borrowed from [6.2], in which the maximum entropy
solution to a linear reconstruction problem does not satisfy the associated dual problem. We also
present, without proof, the way around this difficulty proposed by Bowrein and Lewis.
Consider a measure space (£2,^u) and a vector subspace X of Lp(fi,n) 1 <p < °o , on
which a functional S^ is defined by

(6.2.1) S„(x) = Jcp(x(5))u(<fc)

where (p : 5R -> [-oo, oo] is a closed concave function. The maximum entropy problem consists of
finding

(6.2.2) sup{5 9 (x): ^x = y}

where A: X-»Y is some continuous linear operator. Some examples of <p are

a) Burg entropy
cp(x) = -lnx

b) Boltzmann entropy
(p(x) = -xlnx

c) Fermi-Dirac entropy
cp(x) = -x\nx-(l -x) ln(l -*)

d) Lp norm
(p(x) = -xp/P
64 The Method ofMaximum Entropy

e) Lp entropy
, . f xpIP x>0
<p(*) = 1
[ -00 X < 00 .

The Fenchel conjugate of S9 is defined on X* (the dual of) and is given by

V © = I<P'(^))«
where q>*(£)=/«/" {(p(x)-(4,x)} is the Fenchel conjugate of (p. Here we use (4,x) to mean !;(x) for
5eX',xeX.
The conjugates of the functions listed above are respectively

a) cp*(0=l+lnK)

b) <p*(x) = e^-'

c) (p*© = ln(l+e-5)

eNHm
= eMAH)ree200>1084
IV ~ gNH

e) (p^^max^}'/?.

When life is nice, we have

Theorem 6.2.3. Assume that <p* is everywhere finite and dififerentiable, X=L,(fl,u) and Y
is finite dimensional. Then.if the sup in (6.2.2) is finite, it is attained whenever the following
qualification constraint holds:
(C.Q) There is a measurable function x solving

Ax = y with inf dom <p < infx < sup x < sup dom (p.

Then (6.2.2) equals

(6.2.2/ inf{V04*A.) + ( X , y ) : X s r } .

Moreover, the unique optimal solution to (6.2.1) equals

(62.4) x .=_^(i4.r)
Applications and Extensions 65

where X" realizes (6.2.2)'.

This result is recalled just to make explicit where it fails when Y is infinite dimensional. It
is the identity of (6.2.2) and (6.2.2)' that breaks down.
To begin with we shall consider L2(fi,u) and assume it has a base. On L2(£2,|i) we define
S9 as above with (p(t)=-t2/2 and (A,x)n=xn/4", where the components are taken with respect to the
given base. Also, define y as the vector with components yn=l/8°. The solution to (6.2.1) is given
by
x* = -AX'
where X' has to satisfy (6.2.4), i.e.
A(A*X') = -y.

Let us verify that the range of AA* can be larger than the range of A. Notice that x*n=l/2°
satisfies our problem, but no X" such that A (A*X")=y can be found. Since (A A'X )v =X J42v
and y„=l/8" we would have Xn=2" or X not in L2(Q,n). In other words the maximum entropy
problem cannot be solved using duality theory, a real handicap.
One may consider solving the finite dimensional problems (Ax)0=yn for (Kn<N and then
attempt taking limits. But observe that in this case xN=(l,l/2,...,l/2N,0,...,0...) and
X'N=(1,2,...,2N,0,...,0). Even though the solution to the full primal (6.2.2) is the limit of the x*N,
the Xn cannot converge. This is related to the fact that

H(X, y) = Sv- (A * X) + (X, y) = S X2„/24"+1 + X„/23"

is strictly convex, has a unique minimum at zero, but notice that if er is the n-th basis vector,
H(ne„,y)->0 and ||neJ|-» °o , that is H(X,y) is not coercive.
A different, but similar example, is the following: let fl, and Q2 be compact metric spaces
endowed with Borel measures ix and v respectively. We shall assume that fi2 is separable as well.
Let X=C(£2,), Y=C(£22) denote the continuous functions on Q, and fij respectively considered as
vector subspaces of L,(fl, ,u.) and L,(n 2 ,v).
Define A:X->Y by
(Ax)(m2) = \ a(a>2,coi)u(rfcoi).

If X denotes a measure on C22, (A*X)(dco,) is given by

(A*X)(d(Oi) = [ | a(ci)2,coi)X(rf(D2) u(rftoi)

66 The Method ofMaximum Entropy

that is A* X is absolutely continuous with respect to n with a density in L,(Q, ,u)

If the infimum in (6.2.4) were attained, an x* given by

eNHm
= eMAH)ree200>1084
IV ~ gNH

would have to satisfy Ax*=y, but again, A((d<p*)/(dt)(R(A*))) may be smaller than R(A) and
duality will fail. To use Theorem 6.2.3 and verify that the situation is not circumvented by going
first to the finite dimensional case, take X=L,(D,n) and Y=L,(fi2 ,v) but the rest as above. Let
{(B°2: n> 1} be a dense subset in Qj and consider the problems of size N, i.e. find

sup {■?„(): (Ax)(a>i) =y(mj); = l,2,...,Jv}.

By invoking (6.2.3) we obtain

xN = -fy(A-xN)

(6.2.5) x((Di) = -£cp(j: «(a»*,oi)x^).

If we set
eNHm
N t
= eMAH)ree200>1084
IV ~ gNH
A,// = Z Xsr 5 k
eNHm eNHm
= e M A H ) r e e=2 0e 0
M> A
1 0H
8 4) r e e 2 0 0 > 1 0 8 4
IV ~ gNH IV ~ gNH

where 5m* denotes point mass at cok2eNHm

. If = eMAH)ree200>1084
IV ~ gNH
N , . ,
sup S X^ <°o
then there is a subsequence of X,, converging to a X* in the weak-star topology on M (H,). The
same subsequence would ensure the pointwise convergence of A'X,, to AkX" and therefore, if
(d(p*)/(dt) were continuous

x„ = -^x»)^£(,«*r) = x-.
That is, if the xN's are a bounded sequence in L,(Q,|j.), x' would be as in Theorem 6.2.3.
But when A(d/dt q>'(R(A*) )) is smaller than R(A), that could not happen and X^, cannot be
bounded.
Applications and Extensions 67

Borwein proposed in [6.2] two ways of going around these difficulties. We shall cite one
of them and urge the reader to see [6.2] for details and for the method of penalization.
Theorem 6.2.6. (Relaxation). Assume q>* is everywhere finite and differentiable. Take
X=L,(a>, ,u) on a complete measure space and (Y, || ||) to be some normed space. Then the
supremum in (6.2.2) is attained (when finite). In this case consider, for e > 0, the relaxed problem

(ME), max {$,(*) : \\Ax -^ < s}.

The value of (ME)E equals the value of the dual problem

(DE)e min {S^-(A'X) + (X,y) : A. e y*}

and the unique, optimal solution to (ME)C equals

xt=-&iA'ks)

where \ is any solution to (DE)6. Moreover as 6 ->0, xek converges in mean to the unique
solution x* of (ME)E and
Sv(xt)^S9(x').

6.3 Some finite dimensional, linear reconstruction problems.

Not too long ago I attempted to show algebraists how to use standard maximum entropy
methods to solve linear equations. Much to my surprise, since many years ago a journal like
Linear Algebra and Applications has been publishing papers on the subject. Besides passing down
some more references, missed in [0.3], given to me via R. Brualdi, it will be the gist of this
section to compare the level 1 and level 2 approaches.
Consider for example the problem offinding{Ps: i=l,...,n} such that

(6.3.1) ZPiC^c and Z / \ = l, P,>0.

We shall generalize shortly. For some applications see chapter 13 of [0.3] for example.
You could think of (6.3.1) as a problem of resource allocation, the Pj being the fraction of the
total resource allocated to mode i, or think of (6.3.1) as the problem of determining how loaded a
68 The Method ofMaximum Entropy

die may be from the knowledge of the mean earnings of a player that has bet (enough) on each
possibility.
Anyway, the standard approach consists of finding the {P,} maximizing

{-LPi lnP, : EP, = 1, EC,Pi = c, PtZ 0}

As we have seen many times so far, a candidate for P{ is exp-XCt / Z(X) with
Z(X)=lLexp-XC.. The Lagrange multiplier X is to be determined by minimizing the Hamiltonian
(the name physicists employ for the dual of the Lagrangian) H(X)=lnZ(X)+Xc.
Again, in chapter 13 of [0.3] the analysis on conditions for c to be in the range of
-5Z(X)/8\ is carried out. Certainly, since ECP;=C is a convex combination, we have to have
minC, < C < maxC,. If c is generated by experimental data, as in the second motivational situation
above, that will certainly be the case.
But when the consistency condition is not satisfied, you either drop your towel, or relax
your constraints and look for {P,} such that

£ P , = 1, |2P,-c|<s

which leads to an extended Hamiltonian

H(X, E) = lnZ(X) + \C+e\\\.

Here, unless |c-min{C,}|<s or Ic-maxfC;}!^ we will not have a solution.

Instead of proceeding as above we could assume that P, is the mean value of a random
variable X, with respect to a probability density p(x„...,xn) on D=[0,oo)x...x[0,oo). Instead of
(6.3.1) we have

(6.3.2) J (2 CyQ pGOrfnd) = c, | (£,)pfyWi) = 1

D D

where du.(x)=exp-(Ex,.)dx. We now look for densities p(x) maximizing

■SV(P) = -Jptolnp0c)4iCc)
D

under the constraint (6.32).

Here we set D=[0,oo)n instead of [0,1]" only because it facilitates an explicit computation
that comes below. The partition function is now
Applications and Extensions 69

Z(XU X2) =J exp(-X,(ZCJX.) + XiGlxd) cxpi-Uxddx =YI (XtC + u)"1

D i=l

where we set u=Xj+l. Once the values X*, and X*2 that minimize

H(\u X2) = bxZ(Xu X2) + XiC+ %z

are found, the P; we want are

Pi = \IX\Ci + X\ + \
and the maximum entropy density yielding these values is

p(x) = exp-(z (X\d + X'2 + l)XjJ /Z(X\, X'2).

When n=2, a simple but lengthy computation shows that for c,<c<2, the direct solution or, any of
the two maxentropic methods yields

Pi = (c2 -c)/(c 2 - c i ) , P2 = (c-ci)/(c 2 - c i ) .

Actually the same is true for any square, invertible, reconstruction problem.
It may happen that minimizing H (A., ,XJ becomes too difficult for it may be too flat near
the minimum. In this case it may be convenient to use a genetic algorithm to minimize something
like
y ! i \w+ \y —- Aw'
r r
I J ' ' I I I ' ' I

where the W, W are weights chosen "a piacere"

To make critics and detractors happy, notice that if instead of taking D=[0,°o)°, had we
taken D=9l" and d(i(x)=exp-(Zx2/2) dx, we would have obtained

Z(Xi, X2) = expf E (XiCi +X 2 ) 2

which yields
H(XUX2) =£ (Xic, +X2)2 +Xic + X2
^1
which has a minimum at
XJ = (lc,-c,-l C-I.C,, X\=\Z,CiCj) -Lc]-c-LCi

which yield for the P; the values

70 The Method ofMaximum Entropy

S v=
eNHm eNHm
2 0e0M> 1A0 8H4) r e e 2 0 0 > 1 0 8 4
eMAH)ree=
Pi-~ gNH
IV IV ~ gNH

which is what you'd get solving (6.3.1) by proposing the solution PpX^+Xj and finding the right
eNHm
= eMAH)ree200>1084
IV ~ gNH
Before returning to the mainstream of this section, let us recap what we have done in the
following.
Comments.
i) Level 1 and level 2 approaches to reconstruction problems may yield the same answer
to a reconstruction problem.
ii) When using level 1 approach the choice of SM(P) is arbitrary whereas, when using a
level 2 approach, one agrees up on the Boltzmann-Gibbs-Shannon functional and plays with the a
priori knowledge one has, or can assume, about the range and distribution of the Xj. But, of
course, the choice of the entropy functional and the reference measures is totally arbitrary.
The following problem: find {x,: 0<Xj<l, i=l,...,n} such that

(6.3.3) A\ = \>

where the nxm matrix A and the vector be 95m are given. For a review about work in this
problem see [6.4], and for conditions for the existence of a minimum of the dual problem, i.e., a
minimum of H(X) associated with the standard maximum entropy problem see [6.5].
The way the second level maximum entropy method applies to (6.3.3) consists of
assuming an a priori measure, say du(x)=dx on f2=[0,l]°. By ep . : fi -»SR we denote the
coordinate map <pi(x)=xj. We look for measures P on O (equipped with the obvious Borel
c-algebra <&) having density p(x), that maximize S(l(P)=-Jp(x)lnp(x)dx subject to the constraints (
SAj,<p1(x))p(x)dx=bj, j=l,...,m.
The usual arguments provide us with

(6.3.4) p(x) = ^ex P -(C,(p)Ct)

m
where C,=Z^ i=l,2,...,n, of course
7=1

Z(X)=\e-lc-*h*=fi(t£L)

and the Hamiltonian to be minimized to find the X is

Applications and Extensions 71

(6.3.5) H(X)=£ln(^-)+(X,b).

Whenever life is nice to us and the minimum in (6.3.5) is reached, we know from chapter 5
that
PME(X) = ^ e x p - ( C * , <p)(x)

is the distribution on the set of all images that maximizes S (P). The maxentropic reconstruction is

(6.3.6) x, = \ <pi(x)pME(x)dx = -£ In Z(X) = 1/C* - \l{ec> - l ) .

To finish this section with a calculation that we shall make use of in the next one, assume
that we know or we have to impose the condition that the solution to (6.3.3) belongs to the set
{0,1}". In this case the measure du,(x) on fi=9T is

4i(x)=n{±(5„(<£c) + 5,(<fe))}

where 6,(dx) is the Dirac measured concentrated at ae 91 . Here, what we take il to be is

irrelevant as long as it contains the convex closure of the support {0,1}° of du,(x). In this case the
partition function is
Z(X) = J e-t^iWrfuCt) =fl j ( l + e~c')

and once the X that minimizes

H(X) =£ ln(l + e~c>) + (X, b)

r=l

is found, provided it exists, the analogue of (6.3.6) is

(6.3.7) xt = (e? + l)

which suggests one should look for values of X that make the absolute value of C^S \fo very
large when searching for X's that minimize H(X).
72 The Method ofMaximum Entropy

6.4 Maxentropic approach to linear programming.

When we put out [6.6] and [6.7] we neglected to look through the published literature for
related work. To patch this up a bit, here we mention some maxentropic approaches at the linear
programming problem. Consider references [6.8]-[6.11] for example. Our approach is
nevertheless quite different. We consider the problem of finding

(6.4.1) sup {(A0,x)/x eDu Ax = c)

where D,={xe 91" ttSx^l, i=l,...,n), A0 is a fixed vector in SR", A is an nxk-matrix and c is a
fixed vector in 9?.
We shall assume that the nx(k+l)-matrix

eNHm
= eMAH)ree200>1084
IV ~ gNH

obtained by adding A0 to A as first row is of rank k+1. We shall also assume that A has at least
one row, say the first one, with all entries positive.

We leave for the reader to verify:

i) By an obvious transformation the domain

D2 = {xe'Sfi" : ai<Xi<b,; l<i<n)

can be transformed into D,.

ii) Using the assumptions we see that for each l<i<n,

0<Xj<Ci/Av

or, in other words, our choice of D, is natural.

We shall set

*"(*)■ Ht)
The way we go about solving (6.4.1) is to find a c„ in 9? for which the maxentropic
solution to Ax=c fails to exist. The first such c0 will be the one we need.
Applications and Extensions 73

So, pick a c0 and consider the reconstruction problem

(6.4.2) Ax = c, x,e{0,l}

and on {0,1} put the measure m=l/2{80+8,}, which induces an obvious Q on fi={0,l}°. The
elements co of {0,1}° are configurations with probabilities P(co). As always, we denote by X,(co)
the i-th element of co and think of (6.4.2) as

(6.4.3) EpZAjiX,; = cj j = 0,l,...,k.

Now, maximization of -EP(co)ln((P(co))/(Q(co))) subject to (6.4.3) leads to a partition

function
Z(A) =fl [ ( l +exp-(A,5 1 ))/2]

where A1 is the vector (X0,A., ,..,XkH) and A, is the vector corresponding to the i-th column of A.
Again, A is to be found by minimizing

#(A,c) = lgZ(A) + (A,c).

The main result in [6.6], modulo misprints, is

Theorem 6.4.4. For a given c0, there will be an xsD, such that Ax=c if and only if
H*(C|„c)=min{H(A,c): A e 5Rt+1 }, and the problem will have no solution whenever H*(c0,c)=-oo.
More precisely, let K be the compact convex set

K={c e 9?*+1; Ax = c for some l e D , }

then
1) If c e K, there exist infinitely many solutions to Ax=c, one of which is of the form

(6.4.5) x' = (^1 + exp-S XjAjij i = 1,2,...,n.

o
2) If c e K-K, then there exists a solution xJ; to Ax=c of the form

x* = 1 ieP
x\ = 0 ieN
Xj = Xi 0<Xi<\,iePuN
74 The Method ofMaximum Entropy

where P and N are disjoint subsets of {l,2,...,n}.

o
Proof: Part (1) is easy and left for the reader. To see (2) let c0 be such that c e K-K.
o
Then there is a sequence cn in K with cn ->c as n-> oo. For each for them there are a A„ and a jr t
such that h(xkJ=H*(A«)SO. Here h(x)=-£x; lnx^Kl-xJInO-X;). By passing to a subsequence if
necessary, let x*eD, be such that x^-^x*. Now h(x*)= H*(c)=lim H(A„,c„) > 0
From (6.4.5) we see that
/»={!: x; = l ) = i: ££ A ; ^ - » O O |

and depending on the vectors A;, we may have (A„ ,Aj) converging to a finite limit despite
||A"|| -> oo.These comprise the third case.

N={i: xki=0} = \i: t A*A}i -> co as n->oo|.

If c0 is such that ce K, then H*(c)= -oo. This is due to the fact that H(A,c) is convex on
5R*+I, and if

(exp-(M,M«)(l +exp-(A,5,)) / = 0,...,k+l

then the minimum of H(A,c) cannot be achieved and H*(c )= -oo.

The practical way to use this result would be to start from a variable XQ, i.e. such that
Ax„=c holds. Then compute c°0=(A0,x()) and start solving maximum entropy problems for c0 larger
than c°0 until the max-ent procedure breaks down.
Instead of (6.4.1) we could have considered the problem of maximizing (A0,x) when the
constraints are known with uncertainty, namely we want to find

max {(A0) s)lx E f l , n {£|A£ - c e Bu(S)}}

where cs 5R*, D, is the unit cube in SR" and BM(8)={ye 5R*|(y,My)<8} for a positive definite,
symmetric kxk matrix M.
Here instead of minimizing H„(A,c)=lnZ(A)+(A,c) to find the vector A of Lagrange
multipliers, standard procedure leads us to minimize

Hs(A,c) = H0(A,Z) + 8(A,M-1Ay.

Applications and Extensions 75

Recall that K = ( ^° with A e SR*

6.5 Entropy as Lyapunov functional. Further comments.

This section provides more substance to the way the second law of thermodynamics was
phrased in chapter 3. To make things simple we shall assume that the states of our system are
discrete and that microscopic dynamics are described by an infinitesimal transition matrix (or rate
matrix) W^. We shall denote the set of microscopic states by S.
Thus, if PXO denotes the probability offindingthe system in state i at time t, when at time
t=0 the distribution was known to be P,(0), then

(6.5.1) flPt=ZWiPj(t) = Z,Pj(t)WJl.

When there is no loss of probability,

Z.Wj, = -Wj,

and setting A(=W^ (this is the mean holding time at state i) we have Qjj=Wjj/A/ for jump
distribution (see [6.12] for more on this and other stuff).
Although we do not need to assume symmetry Wj=W» we do need to assume the
existence of a measure |iL with respect to which our dynamics satisfies the detailed balance
condition

(6.5.2) niWy = \ijWJi, all i,j

which, summing over i implies the invariance condition for u

(6.5.3) £n,»V = 0.

For references to detailed balance and applications see [6.13]-[6.14],

If we denote by P^t) the solution to (6.5.1) with P^O)^, then setting

H,0)=EM^(r)
/
76 The Method ofMaximum Entropy

we see from (6.5.3) and (6.5.1) that dm(t)/dt = 0.

We shall say that a function f(i) defined on the state space is invariant (or harmonic) if

(6.5.4-a) Zrv0)W)=M
j

and making use of (6.5.1)

(6.5.4-6) EWV(/) = 0

and hence the name harmonic. We assume we have at least a few invariant functions. If we set for
any probability distribution {P,} and a given invariant distribution {\i-}

(6.5.5) 5,hl(P) = -EP,lnP,/u J

we see that if P,(t) satisfies (6.5.1)

j-S^P) = -X ^ I n i V u , = -XX (PjWfl -P,W t ytkPM

= -XE { £ u ^ - £ u , ^ } m i V u , -

= -\ SS tyWjflPjtVLj)-(/Vu,))(lniVu, -lnPj/\ij) > 0.

Thus, for any initial value p,(0), S (p) increases until Pj=m for all i. From the point of view
of physical applications, we need a supply of invariant measures u.
Let f,,...,fN be N invariant functions, let m be any invariant measure on the set of states
and, as above, let
Z(X) =E mt exp-(X, fj)
i

P(F) = { / > s P ( S ) | P « m , (P,f) = F}

where F is an element in W with components F ; , A and f(i) are in SKN with components A;, fj(i),
j=l,...,N. Also, we set <P,f > for the vector with components 2P,f](i)for j=l,2,...,N. If we think of
P as a row vector, then P(t)= PP(t) is also a row vector.
Applications and Extensions 77

Notice that for PsP(F), P(t)eP(F) for <P(t),f>=<P,P(t)f>=<P,fc> since the fj are
invariant.
Consider Sm(P) restricted to P(F). From what we know from before, there is a unique P'
in P(F) such that Sro(P*)=sup{Sm(P)|PsP(F)}. Also

Sm(P') = i/(A) = lnZ(A') + (A, <f»

where A* minimizes
//(A) = lnZ(A) + (A,<F».

Assume, which is reasonable for physical applications, that for PeP(f) lim Pft^P,^ exists,
and denote by P*(t)=P*P(t) the time evolved of P" Since Sm(P*(t)) is increasing on P(F) and its
smallest value Sm(P") is already the largest value of Sm(°) on P(F) it follows; from the uniqueness
of P*, that
P'(f) = P'F(t) = P*

Theorem 6.5.6. The measure P* yielding a maximum for the entropy Sm(P) over P(F) is an
equilibrium measure for the microscopic dynamics given by P(t).
It is not hard to conceive all sort of extensions of these results.
Let us say a few things about the use of the entropy as a Lyapunov functional.
Assume that {u,,} is an invariant distribution and {P:} is any distribution. We saw in Lem
ma 4.5 that -S (P) > 0 (here we let counting measure on S to play the role of what we denote by
u, there). Above we saw that dSydfeO, when we let P(t)=PP(t). Notice that when P happens to be
invariant, then S (P) is constant in time. We shall set

(6.5.7) A(\x)={PeT(S)\dSll(P(t))/dt>0 some t>0, P*u}

and we shall call it the attractor of u.. Notice that we exclude n from it.
Theorem 6.5.8. If Pe A(u) then P(t) tends to u as t tends to infinity whenever S ^ f ^ t O .
Proof: Consider first t = inf{t >0\ dS/dt = 0). Note from the computation of dS/dt given
above that the right hand side vanishes if and only if P.(t„) = n,.Therefore if t„<°o, P^t) = u- for
all t>0.
Consider now the case dS^/dtX) for all t>0. From (4.38) we obtain that

i(z|P,(0-n,|) 2 ^-W0).

Since the right hand side goes to zero, by passing to a subsequence if required, we obtain
P^t)-)^ for all i.
78 The Method ofMaximum Entropy

Comment. Note that assuming that P(t)->Peq is not enough, it may happen that S^CPJ^O,
and (S |Pi(t)-M.J)2 may only oscillate in the interval (Q,SJ[P^)).

6.6 Solving matrix equations.

Let us state a few problems leading to search for solution of the matrix equation

(6.6.1) AX=C

where A, X, C are respectively nxm, mxk and mk matrices. Here A and C are given and we shall
require the unknown matrix X to have its components in a preassiened convex set. We direct the
reader to [6.4] and [6.16] for more on related issues, namely, different problems leading to matrix
equations like (6.6.1) and their solution via the level 1 maximum entropy method.
Example 1. Let Ay denote the intensity of spectral band i, l<i<M of a substance j , l<j<N.
Assume that the intensity Cj in the i-th band for mixture is known and we want to know the
concentration xi of substance j in the mixture. Certainly the normalization 0<x.<l, for l<j<N is
natural in this case.
Example 2. Consider the problem offindingthe generalized inverse X of a matrix A. The
whole thing here is that A may not be a square matrix. The matrix equation defining X is

(6.6.2) AXA=A

For a very fast review and analysis of best solutions in norms other than 12 see [6.16],
Example 3. Consider the extension of (6.6.2) to either of

(6.6.3) AXB = C or Y.AjXB, = C.

Example 4. Relating stimuli to responses by means of linear maps. Suppose you encode
stimuli by vectors in certain Wand have m of them, described by {S^ l<i<n, l^j<m}. Assume
that the system under scrutiny responds linearly to the stimuli to produce k different responses
encoded by vectors in 9?m You want to know the mechanism, or transfer matrix such that

(6.6.4) SX=R.
Applications and Extensions 79

You may need different k and m because, say, the independent or different stimuli may
yield common or related responses. Besides that you may know in advance, or need to assume
that
the x^ are to take values in some preassigned set, {-1,1} say.
For the fun of it, we shall look at the problem of finding a lxn-matrix X , the inverse of
the nxl-matrix A, such that (6.6.2) holds and -||A||<X<||A||. On ft=[-l,l]° equipped with the
Borel a-algebra we shall define the measure m(dx) with density 2° with respect to dx=dx,,...,dxn.
Denote by £f the coordinate maps ^i(o))=coi and by 3>((<B) the map

a , IZafcjim)
a, afabo)
i

where ^ denotes the j-th component of A. We shall look for measures P on ft such that

==aja,
Ep$>j --
Ep®j

for j=l,2,...,n. The standard level 2 procedure yields

dP = (Z(X)r'exp-(>., <b)m{dx)

with Z(A.) such that

lnZ(X) =Z [(a,(k, AJIlAlT'sinh a,-(M)||A||].

Then the minimum of H(X)=ln Z(X) + (X,A) is reached at a X* such that

HXIIS^coth (ar,(X, A)|A|0 - w/(X, A) = 1

and when X.=E,, t,t is computed, it comes out as

X, = ||A||coth(a,(r, A)||A|D- Va,(V, A)

which satisfies (X; ,i$= 1 so that AXA=A (So, we can go to sleep with a feeling of being
consistent. Some may think that this is a dum way of writing x^A/UAH2)!. This would be true if
||X||=1/||A|| which is not apparent from the result found above. Also, try to find X using singular
values decomposition.
80 The Method ofMaximum Entropy

6.7 Estimation of transition probabilities.

The following is a variation on the theme of a nice paper by Bard, [6-18], in which state
probabilities are estimated, using the MEM., from the knowledge of probabilities of a collection
of sets.
Suppose S is a finite set, with atoms J„..,J n and the a-algebra S consists of the collection
of subsets of S. Any probability P on S is then determined by the P({SJ).
Also, if (XJ denotes a time homogeneous Markov chain, having S as state space, then the
transition matrix

(6.7.1) Pij = P(Xi=J]\XQ=J?)

determines the time evolution of (XJ.

It may happen that for every starting point s ; , only the aggregate transition matrix

(6.7.2) Pv=P(XleC,j\Xa=:Jl)

where the C,j are a collection of not necessarily exclusive nor exhaustive events. In terms of the
Pj,, the P,j can be rewritten as

(67.3) ?> = E P,JCll(Sk)

keCy

where, for any set A, IA is the indicator function of A.

Observe as well that the C:1 may be different for each starting point i, but that is not an
essential assumption. Also, for eachfixedi,

(6.7.4) SP,j = l

is a condition satisfied if the chain is conservative (When it does not hold, we throw in a cemetery
state to enforce it).
So our problem becomes that of determining for each i, a collection P. satisfying (6.7.4)
when all that is known is (6.7.3). Dropping any reference to the index i, we are in the situation
discussed by Bard. So, we will follow him.
Applications and Extensions 81

Given sets Cj, j=l,2,...,K; we denote by DJ; l<j<M the partition of S induced by the Cj(
that is, Dj is a mutually exclusive, exhaustive collection such that any set in the a-algebra
generated by {C^} is a uniori of sets from {D^. In particular

Cj=<J (CjC^Dx), obviously M>K.

Observe that if we know, for each j=l,...,K

(6.7.5) Pj=EP(SK) j=l,2,...,K

SteCj

the most we can hope for is to be able to find

Qi= S P(SK) /=1,2,...,M.

The constraints being, of course

eNHm iA
(6.7.6 -a) = eMAH)ree200>1084
IV ~ gNH

From the solution to this problem, the original problem drops out, for the procedure can
be carried out for each starting point i.
Observe as well that once the Q, are known, we can setup the problem of finding Pj;
j=l,...,N such that

(6.7.7) XP, = l

(6.7.6-6) 2 Q^Pj J=\,-,k.

So Bard's technique., applied twice in succession, provides us with the complete collection
of (transition) probabilities. Let us apply the MEM. to solve the set (6.7.6). Again, denoting by
X the Lagrange multiplier oorresponding to (6.7.6-b) we would obtain, after an application of the
level 1 routine

(6.7.8) Q, = exp-I Xjx(D, n Q/Zft.)

j
82 The Method ofMaximum Entropy

where we set x(A)=l or 0 depending on A being empty or not. Finding the X* such that (6.7.6-b)
are met provides us with the Q*, that maximize the entropy and satisfy (6.7.6-a).
If you compare (6.7.8) with Bard's results you will notice some differences, stemming
from the fact that he does not have condition (6.7.6-a). As a simple minded application, that can
be worked out by hand consider the problem of figuring out the probabilities of the different
outcomes of a die throw when you only know

P,=P(2,3,4) = i ? 2 = (3,4,5) = i

The sets C,={2,3,4} and C2={3,4,5} determine the partition D,={1,6}, D2={2}, D3={3,4} and
D4={5} of the sample space.
According to (6.7.8) the outcomes of the throw fall in these sets with frequencies

Qi = \IZ, Q2=e-X</Z, Qj = e-^'-^VZ, Q4=e-X*/Z

where Z= 1 + e~x' + e-<x'+x^ + e'x

An easy computation shows that

e X
~ '=7T e~^ =P2/{\-P2)

Z=l/(l-P,)(l-P2)

which, when Pi = P2 = \/ , yield X,=X2=0 and therefore

9i = i 92 = \, qi = f, q* = j

and repeating this procedure yields the frequencies

Pi = i P* = i P^\, P* = i P5=i PS = 1

for the different throws of the die.

But had we begun with P({l,2,3,4})=P({3,4,5,6})=2/3, we would have obtained
P,=P2=P3=P4=P5=P6. So knowledge counts after all. Curious huh?!
Applications and Extensions 83

A larger scale application of this technique could be the following random search
algorithm. Let 1=1,...,N label the points of a grid and let P, denote the probability offindingthe
particle, individual, oil, or water at i.
Assume you have a way of assigning areas to detection procedures and you determine

P , = P ( C , ) = E Pk
keC,

by some experimental procedure. For example, P, is the fraction of successful detections in region
Ck. The C|, i=l,2,...,k are some not necessarily disjoint nor necessarily covering of the whole
domain. The procedure outlined above would yield the P,.

6.8 Maxentropic reconstruction of velocity profiles.

Even though we shall be following [6.19] we urge the reader to take a look at [6.20] on
which it is based. Especially the section devoted to the choice of the a priori profile. Instead of
directly applying the results in section 4 of chapter 5 I shall repeat myself a bit and, restate the
results of [6.19]
At given instants t,,...,^ during an interval [0,T] the following mean square averages are
some how determined

(6.8.1) d, = tjV(tj) 4 V(s)ds =| Pj(s)V*(s)ds

o o
where p (s) = 1 for s < t and 0 for t < s<T. These averages are somehow computed from ID
seismograms. Our problem is to find a positive V(s) such that (6.8.1) holds. Actually we could
refine things a bit and have V,< V(s) < Vu for two physically reasonable upper and lower bounds.
Again, since we do not want to fool around with measures on function spaces, we will
discretize [0,T] by introducing a partition of size N and replace (6.8.1) by

(6.8.2) ' / * " & ) = £$Pj*)t = 4

where PJk = pfyk - 1)), **= ^ ( j t f - l ) ) k=l,...,K

84 The Method ofMaximum Entropy

Also, to avoid ridiculous complications we assume that the tj happen to coincide with
points of the partition. And we will want to think of the xk as E ^ and, as usual of (6.8.2) as

(6.8.3) EP[l.fiPjhXk] = dJ.

Even though somewhat unrealistic from the physical point of view, we shall assume that
the random variables X,^ take values in the interval [L,<») and on f ^ r L , * ) N we shall define an a
priori reference measure dQ(x) with density

(6.8.4 - a ) qN(*) =ft ^ e x p - ( x r , -L)Hxa{i)-L)

with respect to the density dx=dx, ...dxN on rL,°o)N. The x„(i) are chosen so that x„(i)>L and
therefore

(6.8.4-6) ( I#(J)A=I,(I), i=\,...,N.

The Q-entropy of P is defined by

5
e(^) = - f I p(x)lnp(x)V(x)ax

and the partition function is now

ZN(X) = I exp-| S (X, <p(/))x' \q(x)dx

n« Lr=i J

=jft ( x o Q - i r ' l W ) - ! ) - 1 +0.i,<p(0)) 'exp-(A,<P('))

where <p(i) is the n-vector with components pit and X is in SRW The maxentropic P will have
density PN(x) given by

PN(x) =fi a,e",^'-L^

1=1

wherea
' = ^b+(^(p('))-
Applications and Extensions 85

The maxentropic estimate or reconstruction is

(6.8.5) x,, = \ x,PN(x)ax = L+ Va(i).

Except for the fact that we have to find the X minimizing

(6.8.6) HN(X) = | lnZ(X) + (X, d)

where d is the n-vector with components dt, j=l,2,...,n. We are through. Setting xjxjr^fjii), L=v,2
and letting N tend to infinity we would have

(6.8.7) F ( 0 = V\ + [(pj(0 - « ) " * + (X, cp(/)))

instead of (6.8.5) and X is to be determined minimizing

/U» =J In { , 7 7 " U + (X, d)

where of course (p(t) is the n-vector with components Pj(t). To simplify, as in [6.19], we set
V^O, V02(t)=Vu2 constant, which is to be added to Xa.
It is easy to verify that

d} =1 —f = dH + U - trll[vl+l Xk) ) .
i

Inserting this into (6.8.6), we note that for tH < u < tj

K«) = ^ r = —k- = ttftH vk'MM - W

which is a standard result in models of layered earth .

86 The Method ofMaximum Entropy

6.9 Fragmentation in a nuclear reaction.

This is a part of a project, once started with L. Dohnert, based on [6.21]. The problem
studied there is to understand thefragmentationof a heavy nucleus by a fast light nucleus.
The everyday language description of the process consists in supposing that the large
nucleus gets "hot" when it absorbs the kinetic energy of the smaller nucleus. Upon cooling down,
it condenses in globules that fly away. The problem is to find the distribution of the fragments.
To be precise, we specify the outcome of a reaction by giving {n(ij): 0<i<j, ij integers},
where n(ij) is the number of fragments of mass j (measured by the number of nucleons) and
charge i (measured by the number of protons).
The macroscopic constraints on n(ij) are

eNHm
= eMAH)ree200>1084
IV ~ gNH
pi

(69.1) e({«(',y)})=2'»(»j)
pi

M{{n{i,j)})=lMi,j)
pi

The meaning being: F({n}), Q({n}) and M({n}) stand for the number of fragments, the
charge, the mass of the distribution {n} respectively. We want to find the measure P({n}) defined
on the set of all possible configurations, and such that

EP({«}) =1

TP({n}mW) = N0
(6.9.2)
ZP({n})Q«n}) = Z0

ZPan})M({n})=A0
{«)

where the numbers on the right hand side denote the average number of fragments, charge and
mass respectively. The maxentropic procedure would yield a probability

(6.9 3) P({n}) = exp-[X,F({n}) + X2Q({n}) + A.,M({n})]/Z(X)

Applications and Extensions 87

where, as usual

Z(X) = Z e x p - { A . , ^ { « » + X2Q{{n}) + X s M W ) }

= E IT exp-(Xi + X2i + X%J)n(i,J)

(6.9.4)
=Tl [ Z exp-(Xi +\2i + X$j)nJ

= n (1 - exp-(Xi + X2i + X3J))'1

fit

from which, differentiating with respect to the appropriate X, we obtain

No = ~g~ In Z(X) =S (exp (Xi + X2i + Xij) -I)"

/>•

(6.9.5) Z 0 = -£-\nZ(X) =S /(exp(X, + X2i + Xsj)- l)" 1

A0 = -^MnZ(X) =Z/(exp(Xi + X2i + X3J)- l)" 1

3
s>>

Obviously, we can interpret (exp(X1+X2i+A.3j)-l)"1 as the probability of finding a fragment

of charge i and mass j .
What is left to do is to rewrite lnZ(X) as

00 00

lnZ(X) =j du J dv\n [1 - exp-(Xi + X2u + X,v)]

0 0

which is what any decent physicist would do. By differentiating we obtain the integral analogues
of (6.9.5) and the X can be found by minimizing

H(X) = In Z(X) + X1N0 + X2Zo + X3A0

over the set D={X|Z(X)<°o}, which has to be precisely determined. This set seems to be the
positive orthant in 5K3 Can you enlarge it?
88 The Method ofMaximum Entropy

6.10 Maxentropic inversion of Laplace transforms.

Suppose you want to recover a continuous function fix) defined on [0,co) such that either
f(t) tends to zero as t goes to infinity or, that its growth rate is such that for some a 0 >0
f(t)exp-a0t tends to 0 as t goes to infinity. Suppose that you know

/"(a,) =J/f)exp-(a,0^ i=\,...,M

where 0<a0<a, ,...,0^,, and you want to recover f(t). Note that the change of variables t=-ln s
transforms that problem into recovering x(s)=f(-ln s) from

l
(6.10.1) x(a,)=\x(s)sa'-lds i=\,...,M-
o

The analysis below is modeled on the computations in example 2 of [4.2]. Let W be

Wiener's measure on the class fi of all continuous functions co(s):[0,l]—» 9J and, let d^"denote the
a-algebra generated by all cylindrical sets.
We denote by Ep[H] the expected value of the functional H:(il,<^J-»(9J,Z?) with respect
to measure dP.
Let Xo(s) be a continuous function on [0,1], such that x0(0)=0 and we define T:H —> Q. by
T(co)(t)=co(t)+Jx0(s)ds. The famous Cameron-Martin theorem asserts that the measure W on fi
changes according to
EPo[H)=Ew[HMo]
where
l l
M=exp lx0(s)dB(s)-{lxo(s)2cls,
0 0

and B(s): il -»SR, B(s)(<n)=co(s) denotes the standard brownian motion on [0,1]. Here,

I
\x0(s)dB(s)
o

denotes the standard It0 integral of x„(s) with respect to B(s). All these probabilistic constructions
are described in [6.22].
Again, standard reasoning yields
Applications and Extensions 89

(6.10.2) EPo[B(t)]=lx0(s)ds
and it follows that under P0, and if we denote by <D(s) the SR" -valued function on [0,1] with
components ^(s) = s", note that

(6.10.3) EP, \ ^dB(s) =1 r0(5)O(s)f = xo

0 0

where x0 is the vector whose components are the Laplace transform of (the initial guess) x/s).
Ours maxentropic problem now is to find a law P on (Cl,dF) such that Sro(P) achieves the
maximum value over the class of measures Q on (£l,d&) such that Q « P0 and

(6.10.4) EQ\^(s)dB(s) =i

where x is the M-vector with components x, as in (6.10.1).

Standard argument says that dP/dP0 is

§-a = exp-(7 | (A., *>dB(j)) /Z(X)

where Z(k) can be explicitly computed (fly me in and I'll tell you how, but it is really simple)

Z(X)=EPt exp-{]<X.,<Ks)>tflKs)] = exp i } {(k,*(i))2/s2 - 2<X,Q)xo(s)}ds

from which it follows that to obtain X' that makes P the measure that satisfies (6.10.4) we have to
minimize
HQ,) = f 1 {<X,«I»(s)>V -2(\Ms))xo(.s)}ds + (\,i)
o

= ±{(k,a.)2-2(k,z0)}+(h*y

where C is the matrix with elements

CtJ =\ \ 9t(s)<bj{s)ds = (a, + a , - 1)"

The minimum of H(\) is achieved at

90 The Method ofMaximum Entropy

(6.10.5) r=C-'(x0-x).
Note that when the Laplace transform of xje'1) coincides with that of x, then X' is 0 and P0
is already the maxentropic measure on (£!,<#"). The maxentropic reconstruction of x(s) is

x(s) = £EP[X(S)] = x0(s) - (X'Ms))

and the reader is invited to do the arithmetics involved in verifying that

jx(s)<Hs) = x.\
o

In the original variables on [0,oo) the maxentropic reconstruction of f(t) is

/(/) = x „ ( 0 - £ ^ - ° ' ' -

j=i

6.11 Maxentropic inversion of Fourier transforms.

In this section we shall present two approaches to the problem of reconstructing a

function x(s) on [0,1] from the values of its first 2M+1 Fourier coefficients

(6.11.1) bk=\ek(s)x(s)ds, b-k = bk, k=l,...,M

where ek(s) = exp2niks/ J2n .

The first approach is a variation on the theme developed in the previous section, the second
evolves according to a discretization procedure as in section (5.4). We direct the reader to [6.23]
for a level-1 like approach.
The (inessential) difference with the setup of (6.10) is that we allow for the initial point
B(0) of the brownian motion on [0,1] to be started with a distribution such that the Wiener
measure, W on (fi,<^) satisfies W,'(B(0)s A)=u(A) and j£u(d!;)=0.
The measure P,/1 is similarly defined, i.e., for any measurable functional H

E»[H) =E^[HM0]
Applications and Extensions 91

1 1
with Wo = exp J x0(t)dB(t) - \ j xQ(t)dt
z
lo o
Again, x0(s) is the a priori knowledge we have of x(s). Instead of (6.10.2) we now have

(6.11.2) ££[B(r)] = l&p&i x0(s)ds =j x0(s)ds

0 0

by assumption, and analogously to (6.10.3) we have

(6.11.3) }<Hs)dB(s) =\xo(s)&(s)ds = b0

where O(s) is the (2M+l)-vector valuedfiinctionwith components *l(s)=ek(s), -M<k<M, and b°

is the (2M+1) Fourier components of x(s).
As always, we search for a measure f on and analogously to (6.10.3) we have (Q.,d&)
such that SpCP*1) achieves its maximum possible value over those P* satisfying

(6.11.4) £?Q*(s)dB(i)] =b

where b is the vector in (6.11.1). Again, mutatis mutandum, everything is as before, with

eNHm
= eMAH)ree200>1084
IV ~ gNH

where X* is the point at which

tf(X) = }(\,CX.)-(A.,b-b°)

achieves its minimum. Now things are simpler due to the orthogonality properties of the
{ek(s):-M<k<M}. Actually C=C"' is a rather simple matrix, to wit.

0 /
1
/ 0

Again, as in the previous section, the maxentropic reconstruction x*(s) of the function x(s)
92 The Method ofMaximum Entropy

(6.11.5) x-=jc 0 +(C(b-b 0 ),<D)

which, when written in terms of coordinates becomes

x'(s) = Xo(s)+_| (*t - bl) e-k(s).

The (nice) interpretation of this version is left for the reader.

Quite a different approach is obtained when a discretized version of the constraint is used
and some a priori bounds for x(s) are brought in. Let us rewrite (6.11.1) as

(6.11.6) ^ek(jyxj=bk, b-k = bk, *=0,,...,M,

where x^Q/N) and ek(i)=ek0/N)- We shall assume that each x; is the mean value of the j-th
element of a collection (X|„...,XNI) defined on Q=[-1,1]N as Xj(x)=xp and the a priori reference
measure Q(dx)=dx/2N is defined on the Borel sets of fi.
We want to find a probability P(dx) having density p(x) with respect to Q(dx), yielding a
maximum value for
^ = -^Jp(x)lnp(x)A
subject to the constraints

(6.117) £ , [ i 1 ek{i)Xj 1 = i Ntl ettfilriOXjik = bk.

j=Q } j=0

The corresponding partition function Z(X) is

Z(X) = £ e [ e x p - ( 1 ' (X,eQWj) ] = n [sinh(X, e(/))(A, e(/))"']

where X is the 2M+1 vector with components X_M ,..., X0,..., \ , and e(j) is, for each j=0,...,N-l,
the2M+l vector with components e.M(j),...,e0(j),...,eM(j).
The value of X that makes the usual maxentropic p(x) satisfy (6.11.7) is obtained by
minimizing
H„(X) = jj\nZ(X) + (X,bk)

which when N tends to infinity tends to

eNHm
= eMAH)ree200>1084
IV ~ gNH
Applications and Extensions 93

The maxentropic reconstruction of X is

* ; = [<A,e(,)>-l-tanh(X,e(0>]

which, in the limit N—>0, j/n—>s for an appropriate sequence, tends to

W = {o^» -tanh <> e ( - * }

where, in this case, X is to minimize the H„(X).

6.12 Maxentropic spectral estimation.

Here we present a few basic results on the problem of reconstructing a time series, or to
be more precise of reconstructing a second order stationary process, from the observation of this
values at afiniteset of times. In almost any of [0.3]-[0.11] there is at least one paper on this issue,
but what follows is lifted mainly from [4.1] and [6.25]-[6.27]. For even more references and
applications see [6.28].
To establish notation, let us recall a few basic facts, the proof of which appears in chapter
9 of [6.29] and chapter 9 of [1.2].
Let {X^nsZ} be a sequence of random variables. We shall say that it is a weakly
stationary, centered process if for any n

EX„ = 0; EX„X^ = C(k).

We shall say that {Xn} is Gaussian (or {XJ is a Gaussian process) whenever for any finite
collection {n,,^,...,^} of integers and anyfinitecollection {a,,...,^} of real numbers, the random
variable a,X(nl)+a2X(n2)+...+amX(nra) has a Gaussian distribution. Some authors phrase it like:
{Xn} is a Gaussian process if and only if the vector (X(n,),...^(n.J) is a Gaussian K"1 valued
random variable.
Anyway, in terms of the Fourier transform of the distribution of (X(n,),...,X(nnl)), the
Gaussian property is stated as

£[exp/£a)fcAT/u)] = exp-j(a*,Ca)

where C is the symmetric, positive definite matrix with elements

94 The Method ofMaximum Entropy

(6.12.1) C„ = EX(n,)X(nj) = R(n, - nj).

The joint distribution of the (X(nl),...,X(nk)) has density

(6.12.2) p(x) = [(27t)A,detC]""2exp-|(x, C~li).

The following two basic results show why the correlation function is important for the
reconstruction of the process {X,,}.
Theorem. (Bochner-Herglotz). There exists a positive bounded measure u. on I=r0,27t)
such that

(6.12.3 - a) R(n) =| einad\x.{a).

To state the next result, we need the concept of random measure or random kernel. Let
(fi,<#",P) be a probability space, let B(7) denote the Borel a-algebra of subsets of I=[0,27c). Then
Definition. Z: B(7)xQ-»rO.°o) is the random kernel associate with the measure u on I if
and only if
a) A—>Z(A) is afinitelyadditive function and £Z(AT1) converges to Z(A) in L2(fi,dP).
b) EZ(A)=0, E[Z(A)Z(B)]=EZ(AoB)2=u(AoB).
Bochner-Herglotz theorem is to be complemented with the following.
Theorem. Let ufdV) be the spectral measure associated to RfnV Then

(6.12.3-6) Xn=\emaZ(da).
i

Comment. In terms of an auxiliary dimensional brownian motion {BCtV t>0}

Z([a, b)) = B(F(b)) - B(F(a)), F(a) = u([0, a))

for any [a,b)el.

Let us cite two of the standard examples.
i) Let s„ be an uncorrelated (independent due to Gaussianness) collection, having Een=0,
Esn2 = a2. Thus R(n) = a2 6V 0 and in this case u.(da)=dct/2jt.
ii) Let (X„} be the AR(p) (autoregressive of order p) process satisfying

(6.12.4) £akX„-k = z„
fc=0
Applications and Extensions 95

where e„ is as in (I). We leave for the reader to verify that n(da)=g(a)da with

(612.5) &) = $iW)\

where
q(x)=I, akxk

the relationship between g(X.) and R(n) is contained in the identity

(612.6) gia.)=±Z.R(.n)e^

Now we can state the reconstruction problem as:

Given a sequence X„,...,XN of observation of the process (XJ, we want to find, estimate
or compute R(n). Of course from this, using (6.12.3-a) we compute n(da) and using (6.12.3-b)
and the comment we reconstruct (or construct a realization of) X„.
How to proceed, being the good maxentropists that we are? From the observation
X0,...,XN we have to reconstruct the density pN(x„,...,xN). Assume that the measurements allow us
to compute the mean values

(6.12.7) Ak = lpN(xo,...,xN)Ak(xo,...,xN)dxo,. ..,dxN, \<k<M.

Then the maxentropic distribution, compatible with the knowledge provided by (6.12.7) is
given by

(6.12.8) PN(x0, ...,xN) = exp-(X, A(x))/Z(A)

where of course
Z(X) = iexp-(X,A(x))A, (X,A)=|x*A t .

The X which makes PN(x) satisfy (6.12.7) can be found by minimizing

HN(K) = \nZ(X) + (x,A)

or solving (6.12.7) for X as usual.

96 The Method ofMaximum Entropy

Since we want our process to be stationary, Gaussian, it is reasonable to assume that

(6.12.9-a) Ak = j^NikXjXJ+k 0<k<M

and the left-hand side of (6.12.7) can be completed from the data as

~ N-k
(6.12.9-6)
5.12.9-6) Ak = j^Nfx^.

To symmetrize things a bit, we shall set

A-k=Ak, A-k=Ak

and have a (2M+l)-dimensional Lagrange multiplier X=(X_M,...,A,.i ,h2,...,XM) instead, in terms of

which
N
PN(X) = exp 2 XiA,/ZCK)
-M

which after an obvious rearrangement becomes

(6.12.10) PN(x) = exp-|(x, Ax)/(27t)iT(detA)"2

where A is the (N+l)x(N+l)-matrix with elements

[ \H \i-j\ < M
A# =
1 0 \i-j\ > At.

If we denote the eigenvalues of A by g and if we set

g(Z)=ZX i Z*
-M

then for N » M , the eigenvalue § tends to g(Z) with Zj=exp(27tj/N+l)i and therefore

&ym= -TS 1u0 , n ^)

00
,B
- , n 2 7 t -*■ ~ S 1 In27cg(ea)rf9.
)rf9.
Applications and Extensions 97

The first identity follows from computing the entropy of P,XX) given by (6.12.10), and the
limit is an exercise in Riemann integration theory. The important fact is that gj -> g(Zj), which can
be found in [6.30].
The nonlinear relationship between the R(k) and the \ is contained in

2*
R
(6.12.11) ^ = -W^2{\) = \-^rdQ ~M<k<M
0 «(.«'

where, recall g(Z) =S XZ.

-M

The left-hand side of (6.12.11) can be computed from the data, and once the \ are known
for |k|<M, the right-hand side of (6.12.11) determines R(k) for |k|>M. Therefore, the procedure
outlined above can be applied.
Let us now consider two more, equivalent, inductive ways of getting the R(n). The first
approach consists of computing the entropy of the joint distribution of the first N+2 variables
^.....X^,., of a Gaussian process as

(6.12.12) S(N+ 1) = lg(27te)~[det A(N+ 1)]3

where A(N+1) is the correlation matrix

*(0) R(N) R(N+ 1) ")

R(0) R(N)
\(N+ 1) =

R(N+1) R(N) R(D HO)

Remember that our problem is to find the R(k) for |k|>M. Take N>M to begin with and
assume that R(0),...,R(M) are known. It is clear that the value of R(M+1) that maximizes det A
(M+l) is the same that maximizes S(M+1). We denote also that

rf^detK(M+ l)]/(dR(M+1)2) = -2det A(M- 1) < 0.

98 The Method ofMaximum Entropy

Since det A (M+l) is a quadratic function of R(M+1) with a negative derivative, then it
has a unique maximum. The allowed values of R(M+1) fall between the values of y(N+l) that
make the det A (M+l) zero.
Choosing the R(M+1) that maximizes det(A(M+l)) we maximize the entropy S(M+1). It
is clear then that this procedure yields the R(k) for |k|>M.
The other procedure that produces the same result is to assume that our process is an
AR(M), autoregressive process of order M, satisfying

(6.12.13) X(n) + biX{n-\) + ...+bMX(n- M) = e(n)

where en is as in example (i) above.

Since E{snX(n-k)}=0 for k>0, it follows that

R(k) + biR(k-l) + ...+bMR(k-M) =0 if k>M

where we have set E[X(n)X(n-k)]=fl(k). In other words

R(l) + biR(0) +. . .+ bMR(M- 1) = 0

£(2) + &,£(!) +. . .+ bMR(.M-2)= 0

R(M +1) + biR(M) +. . ,+ bMR(l) =0

which can be possible only if

f l(i) 5(o) R(M- 1) ^

R(2) R(l) R(M-2)
(6.12.14) det = 0.

\. R(M+1) R(M) . . . R(\)

Now, if we know R(0),...,R(M) we could use (6.12.14) to solve for R(M+1). But a simple
computation shows that if £(0)=R(0), «(l)=R(l),...rR(M)=R(M) are known, then
Applications and Extensions 99

R{\) . . R(M-l)
det l<fdelA(MH)| _ -
2 dR(N+l) 'R(M+l) ~ U
VR(M+1) . . R(l)

which, as we saw above, is the condition determining the R(M+1) that maximizes S(N+1).
Observe that the values b„...,bM can be obtained from the first M equations above. Thus if
covanances are our only information, and AR(M) process is the candidate from process having
the given covariances.
We could arrive at the same conclusion by yet another way. To wit, consider the
(differential) entropy rate denned by

S=lim S « p = i b C t « ) + J L ? ln(27rg(a))</a

where the density g(a) is related to the R(n) by

g(a) = ± S^ R(n)exp-(ina.)
and we have
Theorem. The random process {XJ which maximizes the differential entropy rate S,
subject to the constraints

EX„Xn« = R(k) |*| S M

is the Gauss-Markov AR(M) process satisfying the same constraints.

Proof: Let {YJ, {ZJ, {XJ denote arbitrary, Gaussian, and Gauss-Markov processes
satisfying the constraints. Then for any N>M

SN(Y1,...,YN) = -lp(yu...,yN)lnp(yi,...,yN)dy,,~,dyff

<SAZu...,ZN) = SiZi,...,Zv)+J S(Z*IZ<M,...,Zi)

<5(Zi,...,Z w )+ Z 5,(ZtIZt-i,...,Zt.w)
Aft-i

eNHm eNHm
= e M A H =) r eeeM
2 0A0H> 1) 0r e8e42 0 0 > 1 0 8 4
IV ~ gNH IV ~ gNH
100 The Method ofMaximum Entropy

= S(XU...,XM)+ S S(Xk\Xk.u...,Xx)
M+l

= SM(X\,...,XN).

Proof: We know that for any positive f,g on 91", Jf ln(f/g)dx >0. To verify the first
inequality let pN(x) stand for fi» and put g(x)=[(27t)Ndet Cfexp-'/2(x,Cx) where C is the
correlation matrix of Y„...,YN computed with their joint density pN(x). The next one is an
application of Lemma 4.15 and the one right after follows from Lemma 4.16 (actually, a simple
variation on the theme thereof).
The following identity is obtained when we exchange the Gaussian families. The next to
the last identity follows from the Markov property and the last step is justified by the same
reasoning that implied the second step. Therefore

lim± \aSN{2) < lim^SNOO

which almost completes the proof. It only remains to show that {XJ satisfying (6.12.13) is a
Gauss-Markov process having the correct correlations and its spectral density is obtained
inverting (6.12.3-a) for the appropriate R(n)'s.

6.13 Maxentropic solution of integral equations.

We shall consider a problem examined in [6.31]. It should be interesting to explore the

mathematics of it more thoroughly.
We want to solve the following equation for x(t),

(6.13.1) x(t)=flt)-\K(t,s)x{s)ds
a

where, to make things easy we assume a<s, t 1} be a collection of linearly independent functions. Multiply both sides of
(6.13.1) by M^t), integrate over (a,b) with respect to dx (or any appropriate m(dt)) and obtain

(613.2) a„=jx(s)G„(.s)ds
Applications and Extensions 101

where

(6.13.3-a) a„=\Mn(t)fls)dt
a

are known coefficients and the constraints Gn(y) are

(6.13.3-6) G„(s) = M„(s)+\ M„(t)K(t, s)dt.

Here, we see what the minimal assumptions on f and K are. The functions f, K, M„ have to
be such that the integrals above exist, that all exchanges of integrals make sense. What else?.
Under the assumption of positivity on x(t), Mead replaced the problem of solving (6.13.1)
by the problem offindinga maxentropic solution x^t) maximizing

- j pit) Inptfdl
a

subject to (6.13.3-a) for n=l,...,M, and obtains the estimate

(6.13.4) p w (0=exp-S \<Gi(t)

where the \ , i=l,...,M are such that (6.13.3-a) holds. A few examples are provided in [6.31] as
well.
Here we mix approaches a bit to further illustrate maxentropic reconstruction techniques.
To begin with consider the discretized version of (6.13.2)

(6.13.5) n = A E «(i)x(i-l)
/=i

where A = (b-a)/N, <bji) = G„(i-1), n = 1,...,M. We shall consider on fi-»SRw the Gaussian
density p0(£) = exp-(5°/2)/[27t]W2 which makes the coordinate maps X, : Cl -> 91, X ^ ) = £,
independent, centered Gaussian random variables with covariance EjfXjXj] = A8a.
Given an initial guess x„(i) , i = 0,...,N-1, of x(i) we introduce a new auxiliary measure
P,(d£) on £1 such that

dP„ /dPo = exp \ £ xo(/ - 1)Z, - A Z x\{i - 1)

102 The Method ofMaximum Entropy

with respect to P„ we have E/Xj) = x„(j-l)A. We now ask for a measure dP(£), having density
p(i;) with respect to dP„(4), yielding a maximum for SPa(P) over the set of P' such that

»(z ©(0*,) = i

Again, the candidate has a density

P® = exp-E (k, <tKWi&AV)

where Z(X) happens to be such that

lnZ(X) = kX, CX) - A S (X, <t>(i))x0(i - 1)

1
i

with the matrix C having elements

C„n = A I G„(i - l)Gm(i - 1) \<m,n<M

i=l
and X given by

(6.13.6) X = C - 1 ( A S 4 > ( i > o ( ' - l ) - a ) =C-'(ao-a)

where a0 is the G-transform of x^.

Again, the maxentropic reconstruction of x is given by

x(i - 1 ) = MP(Xi) =x0(< - 1 ) - 0 , *(0>.

And when N tends to infinity and i/N tends to t via an appropriate sequence, we obtain

x(t)=xo(t)-(\>Ht))

where X is given by (6.13.6) with C given by

b
C„m =| G„(s)Gm(s)ds.
Applications and Extensions 103

6.14 Maxentropic image reconstruction.

Maxentropic Image Reconstruction methods have made it to movies and may have,
perhaps, contributed a lot to popularize maximum entropy. The references [6.32]-[6.37] are to
serve as starting or guide to literature. Below we present a variation on the theme, in which the
set up is taken from the literature, but we apply to it a level 2 reconstruction technique.
The standard formulation of the problem consists of assuming the (compact) domain
containing the picture to be divided into N cells and imagining the intensity Cn in the n-th cell to
be superposition of the impinging unknown intensities x(j) according to a blurring function b„_j. To
make things worse, there is noise contaminating the background in an additive way. Thus Cn is
actually

(6.14.1) Cn = 2 bn-jxj + v„ n=\,2,...,N

where the vn describe the noise measured in the n-th cell. The stochastic nature of the vn is part of
the data, or of the assumed a priori knowledge. Here we shall assume the vn to be centered,
Gaussian random variables with variance ak.
We will assume the x, to be the mean values of random variables Xi with respect to a
distribution dP(l;) on 9?w, and we shall consider an a priori distribution dP0(O on SR^ with respect
to which the X, are independent and gamma distributed as

(6.14.2) />„(,) s d\ = J S - ^ r - * i= \,...,N.

To deal with the random nature of the constraint, notice that (6.14.1) implies that

(6.14.3) £ ^{c„-l b^jEpxi) = X2

is a x-distribution with N-1 degrees of freedom. Analogously to [6.32], we shall consider a

reconstruction to be admissible whenever %7 < x2095 or some other level.
We can state our reconstruction problem thus: we seek tofindthe measured P(£) realizing

sup {Sn(P):x2(P)^X 0.95 }•

104 The Method ofMaximum Entropy

Now, this is the set up dealt with in section (6.1) there we proved that the maxentropic
dP*(4) was such that
i P ® = (Z(J))- l «iH>.BQ* , o©

where B =toy denotes the blurring matrix, and Z(X) is given by

Z(X) = E,le<W] = EPo[e-V^] = E^e^] =ft [ ^ J p ,

where of course n; = SBj, X = Sb^, X.j. Notice that we let the parameters a and p depend on i to
allow for different "illuminations" of the picture. (By the way, perhaps more physically reasonable
candidates for the a priori distribution could be used. Different situations may merit doing so.)
The value of X that makes dP(£) satisfy the constraints can be found by minimizing

2
(6.14.4) H(X) = lnZ(X) + <X,C) + Xo.95(s Xja/j

as was proved in Theorem 6.1.3.

Once the right ^(A.,,...,^) is found, the maxentropic image is provided by

eNHm
= eMAH)ree200>1084
IV ~ gNH

Just for the fun of it, had we assumed that each X, is uniformly distributed on [0,M], M for
maximum, then, in this case

eNHm
= eMAH)ree200>1084
IV ~ gNH

and, once the corresponding version of (6.14.4) had been minimized for X, the corresponding
maxentropic image is

x i = Ep[Xt] = (1/u.) - Mle*M - 1

where H^Sb^, Xi as before.

I wish I could take a peep at images reconstructed by any of these recipes.
Applications and Extensions 105

6.15 An application in systems analysis.

When analyzing queuing or network systems with exponentially distributed arrival or

service rates it is convenient to assume that the system is in stationary regime. This simplifies
many computations. When the system is Markovian and has discrete states, the holding times for
each state (the times it takes to make a transition) have exponential distribution as consequence of
the Markov property.
Below we shall follow [6.40] in constructing the dynamics of a special Markov chain and
that a special maxentropic distribution in its equilibrium distribution.
Let S be a discrete set of states and assume that we have defined yt: S-»N be N functions
such that Yk(x) is interpreted as the number of elements of type k in state x. The number N is the
number of different types. Then
eNHm
= eMAH)ree200>1084
IV ~ gNH
is the number of elements in state x.
Let A(x) and B(x) be the states adjacent to x, i.e., accessible from x by adding an element
to those already present in state x or removing an element from those present at x. The states in
A(x) are said to be above x and those in B(x) are said to be below x. We shall also put A(k,x) for
the subset of A(x) consisting of elements of type k and B(k,x) for the subset of B(x) consisting of
elements of type k. Then
A(x) =u A(k, x\ B(x) =6 B(k, x) .

For any set c we denote by |c| the cardinality of c. Let X(t) denote the state of the system
at time t, we shall define the transition matrix by

lim P(X(t+A/) =y\X(t) =y)(At)-

(6.15.1)
N
k=\
*)\ + Vk\A(t,*)l) y=*
Q*y = vk y*x, yeB(k,x)
uk y*x, y e A(k,x)
0 otherwise

where the rate function Vk is assumed positive and, it is interpreted as the mean service or
discharge time and Uk is the mean arrival time.
106 The Method ofMaximum Entropy

A probability distribution q on S is an equilibrium distribution if ZqyQw=0 for all x in S.

This condition may be rewritten as
QxQxX =y£ OyQyx , OT

(6.15.2) qx I {Uk\B(k,x)\ + Vk\A(k,x)\) = | ( S Vkqy+ 2 qyUt)

k=l k=l \yeB(kjc) yeA(kjc) '

for all xs S.
To guess a candidate for q(x) we invoke the following
Lemma 6.15.3. The probability distribution on S that maximizes

S(q) = -£q(x)]nq(x)
xeS

subject to the constraint Syk(x)q(x)=mk (the mean number of individuals of type k) is given by

q(x)=n/rtZ(y).
k=l

Proof: Do as usual but denote exp-Xj by y; where X^ is the usual Lagrange multiplier.
If we substitute the q(x) given by the lemma in (6.15.2) we see that the candidate for yk is
(V/UJ thus we arrive at
Lemma 6.15.4. The equilibrium distribution on S satisfying (6.15.2) is

eNHm
= eMAH)ree200>1084
IV ~ gNH
where
z(u,v) = s nf^-)1

Comment. If we do not want to be inconsistent, the m,. in Lemma 3 cannot be arbitrary.

They should be

(6.15.5) m*=2Y/tton(j£J /Z(U,V)

which will be mean occupancy numbers of type k in equilibrium.

Comment. If we insisted in preassigning the mk's, then the Vk are to be determined from
(6.15.5) since the arrival rate Uk is out of control a priori.
Applications and Extensions 107

For a bunch of nice applications of these ideas the reader is directed to [6.40].

6.16 Distributions with preassisned marginals and related problems.

Apparently it was E. Schrodinger in 1931 who first set up the problem offindingthe p(ij)
= P(X = i,Y = j) as "similar or close" to a given P0(i j) such that the marginals

/0)=Sp(/j) and g(i)=Ep(y)

i i

are known before hand. Take a look at [6.41]-[6.42] for some history on this problem and for its
analysis without using max-ent procedures.
Here the closeness between p(ij) and p0(ij) will be measured in terms of

(6.16.1) K(P,P0) = -SP,(P) =S Pl;lnPi;/Po(»j)

o
This problem is also dealt with in [5.1] where it appears as corollary to the theorem we
presented there as (5.17), and we direct the reader to [6.43] for recent work on related subjects.
Before we look into that problem, let us look at two other problems that have usually been
reduced to the one we started with.
Suppose we know the distribution of income of a population, that is we split the income
spectrum into M groups and we know that I, people are in group i<M. Suppose we also know
that the total consumption of j-th good is Gj, for j<N. If we denote by P^ the number of persons of
income i buying good j , then certainly

(6.16.2) SPj=/,; I.P,J = GJ.

i i

Actually, instead of I, we should put al„ where a; is the (known) fraction of income of
group i spend on goods.
Certainly (l-ajlj is saved or invested and we assume it does not determine the
consumption pattern.
The following two examples were reviewed in [0.3]. The first one consists of assuming P^
to be the number of trips between origin i and destination j , i<M, j<N. The number of trips
originating from i is known to be 0> and the number of trips coming into destination j is known to
108 The Method ofMaximum Entropy

be Dj. The trip pattern is useful for urban planners when deciding where to build roads, gas
stations or whatever.
The second similar situation we describe consists of the problem of determining an
international trade pattern P^ measuring the amount of commerce between country i and country j ,
when the only assumed informations are the total exports of country i and the total imports of
country j .
We direct the reader to [0.3] for original references and for a description of how to
convert the reconstruction problem into the problem of finding a density given its marginals.
Below we will do it as an application of the level 2 procedure.
If we introduce constraints Fi(n,m)=5m; Gj(n,m)=5jln, then EpF^i), EpGpgQ and the
candidate for maximizing Sp0 (P) or minimizing K(P,P0) given by (6.17.1)

P(n, m) = e-^e-^Pofa m)IZ(K, u.)

where \ , i=l,...,M are the Lagrange multipliers corresponding to constraint ^ ( F j ) ^ and u; are
similarly defined. Also,
Z(X, u.) =E e-x"e-*"P0(p,m)

If we set 3>n=exp-Xnand v|/m=exp-um"then the 's and xj/'s are to be determined solving

(6.16.3) g, = * , E VjPo(i,jyZ; fj = V, E <b,P0(jjyZ

i i

or equivalently, minimizing

H(<b, y) = In Z®nyaPo(n,myLf„ln®a-T,gmlnym.
run n Ttl

When Po(i,j)=P,(i)P20) this problem has the obvious solution O.^/P^i), \)/j=gj/P2Q) which
yields the obvious P O J ^ g , as an answer. The set (6.16.3) can be "simplified" a bit by setting up a
max-ent problem to determine P(i[j)=P(ij)/gG) or PQ|i)=P(ij)/flTj.
To begin with note that P(i[j) satisfies the constraints

Ep(/1/)=1 j=l,...,N; ^P(i\j)g(j)=-LJlngim)p{n,m)=fi

the partition function is

Zj. =E P0(n, m)ek^mVg(m)
Applications and Extensions 109

and the maxentropic conditional distribution

P(n\m) = Z\lP0(n, m)exp-X„g(n)/g(m).

Similarly, we would have obtained

P(m\n) = Z;1P0fam)exp-u„1/(OT)//(n)

and once the A.n's or u.m's are found we would have

P(n,m)=P(n\m)g(m) or P(n,m) = P(m\ri)/{n).

To treat the problem as a reconstruction assume that the X in (6.17.2) are positive and let
Bjj=miii(fi,gp. Thus 0<PB ^B^. We shall consider a collection of random variables, each taking
values in [O.BJ, uniformly distributed there, and mutually independent relative to a law P0.

P ^ & A,/, i<MJ <N)=Tl^

where |Ay | denotes the length of the subinterval |A,j | of [ 0 ^ ].

We want to find a density p(x„ .....x^) of a probability law P « P 0 such that

SPo(P) = -lp(x)lnp(i)dx

is maximum at the same time that

•LErX^f,, XEpX^gj.
i <

Now, the partition function is

Z(\ n) = £p„exp-E (X, + \ij)X,j

= n ( l - e ^ + ^ ) / ( X , + H;)S;.

The maximum entropy density is

110 The Method ofMaximum Entropy

p(j) = exp-E (X, + VL,)X,JIZ{\ n)

•I

where (X,\x) is obtained, as usual, by minimizing

(l-exp-(vHi,fl(/))
K W
H(X, n) =E b / , \ + S Xig,+I, sijfj.

Again, once the (X,fi) is known, P the maxentropic reconstruction is

P,j = EPX,j = V(X, + Mif-B,^^" - l).

6.17 Maxentropic approach to the moment problem.

The problem of reconstructing a measure defined on an interval (a,b) with --oo <a<b< oo
has a long history and a lot of mathematics has been denoted to it. See the nice review in [6.45].
Here we will lift some results from [6.47] and [6.48] on the convergence of maxentropic
estimates, and we direct the readers' attention to [6.49]-[6.50] for related results using more
functional analytic techniques and to [5.5]-[5.7] for a more probabilistic approach.
We shall denote by P the set of probability densities on [0,1]. It is known that
Theorem 6.17.1. Given a sequence (MJ of positive numbers such that u,0=l, there exists
a bounded fe P such that
lAx)xndx = n„ n>0
0

if and only if |in is completely strictly monotonic and there is a constant M such that

[m) >-<£ ,«=<U...

Comments, i p.J is completely strictlv monotonic if ji 0=1,

AH. = S H)"f Jn-t->0

Applications and Extensions 111

When >0 is replaced by >0 we obtain a completely monotonic sequence and we know that a
measure (dx) exists on B[0,1] such that

1
JxVu(x) = u„ n^O
o
For all about this see Widder's [6.51].
For a given fe P having moments { nn) we set

P„(/) = {densities g on [0,1] : \g(x)xkdx =\xk; k<n}

and if we put S(f)=-Jf(x) lnf(x) dx for fs P, then a sequence fn e Pn(f) maximizing S(f) over Pn(f) is
called a sequence of maximum entropy estimators for the moment problem.
Each f„(x) is of the form
f(x) = exp-Z XkXk,
where -A.0=lnZ(X),
Z„{\)=\txV-(t\
v kAdx
J
o i

and as usual X is found by minimizing

Hn(X) = [nZn(\)+£\k\xk.

Comment. To be consistent we should have put X°=(X,",...,Xn") but that the heck.
The following first Lemma is proved in [6.47]
Lemma 6.17.2. A necessary and sufficient condition for FL, ( X) to have an absolute
minimum is that {u,„} is completely strictly monotonic.
The idea of the proof is to write X as Xu with ||u||=l and rewrite H(X) as

#(X) = ln jdxexpXF„(x)
o
with
F„(x)=i: £/*("* = **)

and then to prove that H(X) grows linearly with \\X\\.

112 The Method ofMaximum Entropy

Comment. This proof has a drawback: it needs the existence and complete strict
monotonicity of the whole sequence of moments. The reader should go to the references of [6.19]
to see how to proceed when only u.0,...u.„are known. The next nice result in [6.47] is
Theorem 6.17.3. Let f(x) be a non-negative integrable function on [0,1] having moments
H0,...,(j,. (Dividing f(x) by \i0 we obtain a density.) Now let fn (x) be the maxentropic densities
described above. Then for any bounded function F(x) on [0,1]

lim )Mx)F(x)dx=]f(x)F(x)dx.
" 0 0

Proof: Construct the sequence

yVr(.x)4m-M'))dt
0

of total bounded variation

V(y„(.x)) =f (f(t) +Mt))dt < 2no.

Since the absolutely continuous (with respect to dx) measures of finite total variation are
the dual of the bounded functions on [0,1] and the unit sphere is weakly-*-compact, given yn(k)
there exists a subsequence, denote it by v|/„(x) again, and a function of finite total variation v|/(x)
such that v|/n(x)-»\|/(x). Since for all k

l I
0 =j xkd\\i„(x) ->J xkdy(x)
0 0

and since the right-hand side is zero for all k, the uniqueness of the moment problem asserts that
v|/(x) = 0 on [0,1] which amounts to what we want.
Note that given the special form of fn(x) and that fn(x) and f(x) have the same moments

Stfn) ~S(f) = [/Win (f(x)lf„(x)) > 0.

Actually, when a, > n, we also have S(fnl) >S(f^) (which can also be guessed from the
fact that VJi) < Pnl(f))- The gist of [6.48] is to prove
Theorem 6.17.4. Let f be a bounded density. Then the maxentropic sequence fn introduced
above satisfies
Applications and Extensions 113

lim sup S(f„)<S(f).

From this it is clear that S(f„)-»S(f) since]

lim inf£(/"„)>S(/)>lirn sup S{f„).

In view of the results of section 6.2 and of Lemma 6.17.2, if we attempted to reconstruct
f(x) by a max-ent procedure we would have to prove that (for example)

r f°° \ ""
H(X) = In | dxexp-[E
v
X„x"J +£ X„\i„
o i ' I

achieves a minimum over some appropriate (candidates?) set of infinite dimensional Vs.

6.18 Maxentropic taxation policies.

The following is a variation on a subject developed by Theil in [6.52] and reviewed in

[0.3]. Assume we are part of a population in which there are n^ individuals in stratum i making an
income \. Assume that the state has to raise an amount T by taking a fraction f, off individuals in
stratum i, that is

(6.18.1) r=S/,n,7,

where N is the number of strata.

After tax, individuals making I, are left with Ri=I1( 1 -^) and parliament for congress decides
"to be fair" and preassigns limits a,,...,^; 0<a<l such that

cti/i < a.2h < ... < a„I„ and a,/, </,.

In this fashion a maximum tax l-aL is preassigned, which if applied does not make the richer
poorer.
The question is how to rise T within these constraints. So how to find fj such that (6.18.1)
holds and

(6.18.2) 0</<l-ct,, i=l,2,...,N

114 The Method ofMaximum Entropy

is satisfied. This is a level 2 reconstruction problem as described in section 6.3, but let us redo it
from scratch here.
Define on Q=[0,l-a,]x...x[0,l-aj the reference measure

Q(dx) = n (dx,n - a,) = ctxIA, where A =Yl (1 - a,).

The coordinate maps X ^ x ) ^ are independent and each is uniformly distributed in the
corresponding interval. The partition function corresponding to the constraints

EPY.n,X,It = T
is given by

zW=ft(^p),
where we set M ^ I ^ l - a ; ) for brevity. Correspondingly, since -d\nZ(\)/d\=T we conclude that

f, =EP[Xt] = (1 - a , ) ( ( W f ) - ' -(expCW,-,))' 1 )

where as always A, has to be such that

£«,/</* = T.

For such X to exist, your representatives will have to find a;'s such that

r<!n,;,(i-o ( )

Note that for every x, 0 < 1/x-l/eM < 1 thus 0< f:* /1-a^ 1 and the constraintsare satisfied.

Comment. The next three sections were written by Aldo Tagliani, and contain a summary
of a line of research bearing on an important practical issue: are there a priori restrictions to be
satisfied by the moment of a distribution when only a finite number of them are used in the
reconstruction problem?
Applications and Extensions 115

6.19 The Stieltjes moment problem.

The moment problem on the semi-finite interval [0,+oo) consists of producing a positive
density p(x) such that
(6.19.1) ]x"p(x)dx=\i„ n>0.

It is well known, see [6.51], that

Theorem. Given a sequence {\in, n>0} of positive numbers such that \x0=\, there exists a
positive density p(x) such that (6.20.1) holds if and only if the Hankel determinants

Ho Hi H2 Hi H2 H3
(6.19.2) Ho Hi Hi H2
Ho, Hi, Hi H2 H3 , H2 H3 H4
Hi H2 H2 H3
H2 H3 H4 H3 H4 H5

are positive.
If we consider the problem of finding a positive p(x) on [0,-K») such that (6.19.1) holds
only for n=0,l,...,N, then the standard maxentropic reconstruction procedure suggests that we
look for Xo, Xi,..., XN such that

(6.19.3) PN(X) = exp-( Z XJXJJ

satisfies

(6.19.4) \x'PN{x)dx = ii„, n = 0,\,...,N.

We shall consider the change of variables

X,l\\\ = X,

and the corresponding change of variable in x, we rewrite (6.20.2) and (6.20.3) as

(6.19.3)' PN(X) = exp-S Xtx'

(6.19.4)' Jx'/'w(x)a!>r=H./Hi = H1
116 The Method ofMaximum Entropy

the u, being normalized moment. Introducing the standard a dimensional statistical parameters
(variation, skewness, kurtosis,...) y,v,K,... we can rewrite the m as

Ma = 1, Mi = I, Mi - 1 +y 2 , M3 = 1 + 3y2 +vy 3 , M 4 = 1 + 6y2 + 4vy3 +Ky4,... .

Integrating (6.19.4)' by parts, the following identities are obtained

(6.19.5) mMm-i=Y.jXjMm^.i m>\

which can be used to obtain the moments M. for j>N as functions of the first N moments and the
Lagrange multipliers X .
We leave for the reader to work out the details for the case N=l. The case N=2 was dealt
with by Dowson and Wragg in [6.53]-[6.59] based on previous work by Barrow-Cohen [6.55].
They introduced the Mill's function B(x) defined by

(6.19.6) \/B(x) = exp (x2/2) J exp-(r2/2)df

such that 5'=f2-x) and B"=B'(2B-x)-l. They prove that there is an x such that

Y2 = 1 -BB"I(B')\ X2 = \(B'IBf, Xi =xJTx7, \<> = - I n ( B ^ A T )

Barrow and Cohen proved that

M IMUB^
(UB-xf
has to satisfy

(6.19.7) 1<M2<2 or 0 < y < l .

In other words, when N=2 the positivity of the Hankel determinants (6.19.1) is only a
necessary condition for the existence of PN(x). The condition (6.19.7) must be imposed to obtain
the existence of PN(x) satisfying (6.19.3)' and (6.19.4)'.
Applications and Extensions 117

The cases N=3 and N=4 were discussed by Tagliani in [6.56] extending previous work in
[6.38]-[6.39]. The computations and arguments are intrincate. The case N=3 is briefly
summarized and the results for N=4 have just been presented.
The basic philosophy consist of transforming (6.19.4)' or (6.19.5) into a system of
differential equations, by varying continuously one of the moments, say u., and keeping the others
constant. The dependence of X^A,,,...,^ in terms of u, is studied and of particular interest is the
range of u^ making X^ positive.
From now, on we shall write Kj, i >1, for the standard statistical coefficients y,v,K, etc.
and we shall denote by D(KJ ,N) the domain of acceptable values of the coefficient K when N
moments are preassigned. It was proved in [6.56] that

(6.19.8) D(K},N)C D(KjN+ 1).

Also the following inequalities hold

K»rt«r' =1 (- VmPN(x)dx < K2jWKf 3 +| (x- l)2\2x-x2)PN(x)dx

0 1

KK? =1 (- l)VP*&dx < K1J+2KT2+l &- l)1J(2x-x2)PN(x)dx .

0 0

From these, we see that if we let K2,Ks,Kfl_] become arbitrarily large, then

(6.19.9) K2>ti^<» and K%n-<> j>2

which are obviously interpreted as saying that if for a particular value of N, say N*, none of the
coefficients IC,,...,KN- , admits an upper bound, then for any N>N*, the coefficients Kj,...,^ are
bounded as well.
In other words, if for given u,,,...,^ a PN-(x) satisfying (6.19.3)' and (6.19.4)' exists, then
P^x) exists for N>N* and (6.19.8) represents a necessary and sufficient condition for the
existence of a maximum entropy reconstruction of a density with thefirstN moments preassigned.
Let us now look at the details for N=3. In this case y and v are preassigned and we want
to determine D(y,3) and D(v,3).
It can be seen that D(y,3)=(0,oo) but (y,3) depends on the value for y. From Schwartz's
inequality written as
M*., < UjiV-2 j > 2
118 The Method of Maximum Entropy

it follows that whenever y< 1

y-\<\<\m!a.

where v , ^ is a constant depending on v and X^, and whenever y>l we obtain

y—4 < y <+oo

All for which can be summarized as

D(y,3) = (0,-KX>)

(6.19.10) Z3(v)3) = ( y - l / y ) v m a x ) for y<l

D(y, 3) = (y - 1/y, -H») for y>1

in which the final comment that the positivity of the Hankel determinants (6.19.2) represent a
necessary and sufficient condition for the existence of P,(x) if y>l but only a necessary condition
fory<l.
Let us take a brief look at the case N=4 where y,v,K are preassigned. From (6.19.8) we
obtain that
Z)(y,4) = (0,oo)
(6.19.11)
D(y, 4) = (y - y, -K»J when y> 1

but D(v,4) for y< 1 and D ( K , 4 ) are yet to be determined. After a quite cumbersome analysis one
arrives at
Z)(v, 4) = (y - 1/y, +°o) y<1
(6.19.12)
D(K,4) = (1+V2,+OO)

where the quantity 1+v2 is related to the positivity of the Hankel determinant

Mo Hi H2
Hi H2 u 3
H2 u 3 u 4
Applications and Extensions 119

These results solve completely the case N=4. No upper bound exists for the coefficients
y,v,K, or the positivity of the Hankel determinants (6.19.2) represents a necessary and sufficient
condition for the existence of P4(x).
By taking into account (6.19.8) and (6.19.9) the same is valid for N>4. We summarize the
discourse of this section in
Theorem. Given a sequence \i.0,\i.x,...,\iv, N>4, of positive numbers, a necessary and
sufficient condition for the existence of P,Xx) is the positivity of the Hankel determinants (6.19.2).
For N=2 or N=3, the positivity of the determinants is only a necessary condition and auxiliary
constraints have to be introduced for the existence of PN(x).

6.20 The Hamburger moment problem.

Here we present a brief description of the results in [6.57], and we will be concerned with
finding a density P^x) on (-oo,oo) whose first moments \x.a,...,\i^ are assigned. That is we want
PN(x) such that
PN(X) = exp-S XjXJ
(6.20.1)
| x"PN(x)dx = \i„ n = 0,1, ...,N.

To begin with let us recall the result in [6.51] asserting the existence of solution to the
moments problem.
Theorem. Given a sequence {n„: n>0) of numbers such that m,=l, there will exist a
positive measurable density p(x) on (-<», <x>) such that

\x"p(x)dx = \i„ n>0

if and only if the Hankel determinants

Ho HI H2
(6.20.2) Mo,
Ho Hi Hi H2 H3
Hi H2 Hi H3 H4

are strictly positive.

To begin with our problem let us consider first the symmetric case, i.e. , let us assume we
know PN(x)=PN(-x). This will be possible if and only if N=2M and ^=...=u7M1=0.
120 The Method ofMaximum Entropy

As in the Stieltjes case, the even moments can be expressed in terms of adimensional
coefficients y,v,K,...(also labeled as Kj below) such that

u„ = 1, u2 = 1, u 4 = l+y 2 , u6 = l+3y 2 +vy 3 , \i% = 1 + 6y2 + 4vy3 +icy4

And as above we want to determine what restrictions, besides the positivity of (6.20.2)
does thefinitesize of the moment problem impose.
Begin with N=2. This case was analyzed by Powles and Carranza in [6.58]. They obtain
after transforming (6.20.2) into a differential equation solved with the aid of weber functions that

l<i^<3
or equivalently

(6.20.3) D<y<j2

and from a relation similar to (6.19.5) the coefficient

(6.20.4) vmax = r - l / Y + 2-y 2 /4X 4 y 3

to be used below is deduced.

Also, as above one can prove with some effort that

(6.20.5) D(Kj,N)cD(Kj,N+l)

as in the Stieltjes problem, together with

K2/HK2T =1 (2 - l f ' M # < K2,«Kf3+ J (X2 - l) 2j+ '(2 2 -X*)pN{x)dx

K2/K? = | (X2 - l)2jpN(x)dx < K ^ 2 K f 2 + | (X2 - 1)2/(2X2 -X*)pN(X)dx

0 0

and as before , when either K ^ . . . , ^ . , , ^ are unbounded, then so are K^,, iqj+2 and so on.
The results for N=3 and N=4 for the symmetric case are similar to the corresponding
results for the Stieltjes case and are obtained by Tagliani in [6.39],
Applications and Extensions 121

Let us now look at the general, non-symmetric, case. As usual begin with the case N=2.
Here we have

(6.20.6) Z)(y,2) = (0,-H»)

and hence

(6.20.7) D(y,N) = (0, -H») n > 2.

Now consider N=4 when y,v,K are preassigned. Again (6.21.1) or its equivalent (6.20.5) is
transformed into a system of differential equations. Examination of this solution yields

£>(v,4) = (-oo,+oo)
(6.20.8)
-D(K,4) = (1+V 2 ,+OO)

and no upper bound exists for Y,V,K. AS above we sum up with

Theorem. The conditions for the existence of a maxentropic solution on the full
Hamburger problem, plus the additional constraints that have to be imposed when symmetry is
required.

6.21 Applications to data analysis.

In many cases when applying statistical modeling in applied sciences, the analytical
representation of probability distributions is essentially empirical.
Experience suggests that whenever a particular mathematical distribution gives a good fit
to experimental data under limited information, it is also reasonable to base the estimation of
probabilities on the maximum entropy method. See for example the work by Siddall and Diab
[6.59]. From their work one would conclude that almost all well known analytical distributions
can be accurately reconstructed from the convergence of their first four orfivemoments.
In other words, the probabilistic nature of the random variable can be reasonably well captured by
these moments. Or, the question whether the first few moments are a good representation of
information supplied by the sample data, appears to have a positive answer.
122 The Method ofMaximum Entropy

In general, however, the result of the testing program is a set of N measured values
{x,,...,xN} rather than a set of population moments. From these one could compute N independent
sample moments by

(6.21.1) fo = i £(,) k=\,2,...,Ng.

And in applications it is frequently assumed that the unknown population moments u^ can
be replaced by the known sample moments fit.
By replacing 14 by (ik it would appear that the entropy, which is presumable a measure of
information, does not depend on the number of tests used to compute the sample moments.
Besides that, it is not clear how many moments should be included as constraints in the maximum
entropy formalism when the available information is a sample of N measured values.
Such questions have been raised by Baker in [6.60] in a vivid paper and his approach is
applied to the case of a random variable takes values in DcW (typically D=[0,1], as in the
Hausdorff moments case).
Making use of Kullback's relative information, we would obtain p*(x) as

inf K(p,p0)\ \p(x)xkdx=\ik,k = 0, ..,N

that is p*(x)=p0(x)exp-( £ Xkxk), with the X^s determined as usual.

On the other hand, Akaike's estimation procedure determines both the number of
parameters N and their values in such a way that the resting probability reflects property the
information contained in a given sample.
Baker introduces a "differential entropy" AH(Xo, ■■■, Xm,M) depending on the number M
of moments \ik to be considered and on the Lagrange multipliers X0,...,X by

(6.21.2) m(X0,...,X),,M) = ^-LXk\ik.

With this
a) One solves for the {Xa,...,X^ for different M as usual.
b) The "best" number of moments corresponds to that value of M making (6.21.2) smallest.
Applications and Extensions 123

REFERENCES

[6.1] Gamboa, F. "Minimization de L'information de KuUback et Maximization de L'entropie

sous une contraint quadratique". C.R.A.S. Paris t 306, Serie 1, pp. 425-427, 1988.
[6.2] Borwein, J.M. "On the failure of the maximum entropy reconstruction for Fredholm
equations and other infinite systems". To appear.
[6.3] Rockafellar, R.T. "Convex Analysis". Princeton Univ. Press, Princeton, 1970.
[6.4] Schneider, M.H. "Matrix scaling, Entropy mmimization and conjugate duality". Lin. Alg.
Appl. 114/115. pp. 785-813, 1989.
[6.5] Charnes, A. and Cooper, W.W. "Constrained Kullback-Leibler estimation". Accad. Naz.
Dei Lincei, Serie VIII, LVUI. Fasc 4. pp. 568-576, 1975.
[6.6] Gamboa, F. and Gzyl, H. "Linear Programming with the Maximum Entropy'' Mathl.
Comp Modelling. 13_ pp. 49-52, 1990.
[6.7] Gzyl, H. "The max-ent approach to linear programming with quadratic errors", ibid. 15,
pp. 43-45, 1991.
[6.8] Elfving, T. "On some methods for entropy maximization and matrix scaling". Lin. Alg.
Appl. 34, pp. 321-339, 1980.
[6.9] Erickson, J. "A note on the solution of large sparse maximum entropy problems with
linear equality constraints". Math. Prog. 18, pp. 146-154, 1980.
[6.10] Erlander, S. "Entropy in linear programs". Math. Prog. 2J, pp. 137-151, 1981.
[6.11] Censor, Y. "On linearly constrained entropy maximization". Lin. Alg. Appl. 80, pp.
191-195, 1986.
[6.12] Cinlar, E. "Introduction to stochastic processes" Prentice Hall. Englewood Cliffs, New
Jersey, 1976.
[6.13] Kelly, F. "Reversibility and Stochastic Networks". John Wiley, New York, 1979.
[6.14] Doyle, P.G. and Snell, J.L. "Random Walks and Electric Networks" Carus Monograph N°
22, Math. Assoc. Am, 1984.
[6.15] Koopman, BO. "Entropy increase and symmetry" In "The Maximum Entropy
Formalism". Levine, R and Tribus, M. Eds, M I T . Press, Cambridge, 1979.
[6.16] Flores de Chela, D. "Generalized inverses in normed linear spaces" Lin. Alg. Appl. 26, pp.
243-263, 1977.
[6.17] Levine, R.D. "An information theoretical approach to linear inversion problems". J. Phys.
A. Math. Gen. 13, pp. 91-108, 1980.
[6.18] Bard, Y. "Estimation of state probabilities using the maximum entropy principle". IBM J
Res. Develop. 24, pp. 563-569, 1980.
124 The Method ofMaximum Entropy

[6.19] Ulrich, T., Bassrei, A. and Lane, M. "Minimum relative entropy inversion of ID data with
applications". Geoph. Prosp. 38, PP- 465-487, 1990.
[6.20] Rietsch, E. "The maximum entropy approach to the inversion of Id seismograms". Geoph.
Prosp. 36, pp. 365-382, 1988.
[6.21] Aichelin, J. and Huefner, J. "Fragmentation reactions on nuclei: condensation of vapor,
shattering of glass". Phys. Lett. 136 B. pp. 15-17, 1984.
[6.22] Varadham, S.R.S. "Diffusion problems and partial differential equations" Tata Led.
Notes. N° 64, Springier-Verlag, Berlin, 1980.
[6.23] Gassiat, E. "Probleme sommatoire par maximum d'entropie" C.R.A.S. Paris, t 303. Serie
I, pp. 675-680, 1986.
[6.24] Landau, H. J. "Maximum entropy and the moment problem". Bull. Am. Math. Soc. 1. .16,
pp. 47-77, 1987.
[6.25] Choi, B.S and Cover, T. M. "An information theoretic proof of Burg's maximum entropy
spectrum". Proc. IEEE. 72, pp. 1094-1096, 1984.
[6.26] Grandel, J., Hamrud, H. and Toll, P. "A remark on the correspondence between the
max-entropy method and the autoregressive model" IEEE. Trans. Inf. Th. IT-26. pp.
750-751, 1980.
[6.27] Van den Bos, A. "Alternative interpretation of maximum entropy spectral analysis". IEEE.
Trans. Inf. Th. IT-17. pp. 493-494, 1971.
[6.28] Lin, Dh. and Wong, E.K. "A survey on the maximum entropy method and parameter
spectral estimation". Phys. Reports. North Holland, 193. pp. 41-135, 1990.
[6.29] Karlin, Sand Taylor, H.M. "A first course in stochastic processes1'. 2nd. Ed. Acad. Press,
New York, 1975.
[6.30] Grenander, V and Szego, G. "Toeplitz forms and their applications". Univ. Calif. Press,
Berkeley, 1958.
[6.31] Mead, L. R. "Approximate solution of Fredholm integral equations by the maximum
entropy method? Jour. Math. Phys. 27, pp. 2903-2907, 1986.
[6.32] Bryan, L. K. and Skilling, J. "Deconvolution by maximum entropy, as illustrated by
application to the jet of M87". Mon. Not. R. Art. Soc. 191_, PP 69-79, 1980.
[6.33] Birch, S. F., Gull, S. F. and Skilling, J. "Image restoration by a powerful maximum
entropy method" Comp. Vis. Graph and Im. Proc. 23, pp. 113-128, 1983.
[6.34] Wenecke, S. J. and D'Addario, L. R. "Maximum entropy image reconstruction". IEEE. C.
26, pp. 351-369, 1977.
[6.35] Geman, D. and Geman, S. "Bayesian image analysis". Nato ASI Series, F20, Disord.
Syst.and Biol. Organize, Springer-Verlag, Berlin, 1986.
Applications and Extensions 125

[6.36] Zuang, X., Ostelvold, E. and Haralick, R. M. "A differential equation approach to
maximum entropy image reconstruction". IEEE AS.5.P 35, pp. 208-218, 1987.
[6.37] Elfving, T. "An algorithm for maximum entropy image reconstruction from noisy data".
Math. Comp. Modeling, 12, pp. 729-745, 1989.
[6.38] Rosenblueth, E. Karmesh and Hong, H. P. "Maximum entropy and discretization of
probability distributions1'. Probab. Engin. Mech. 2, pp. 58-63, 1987.
[6.39] Tagliani, A. "On the existence of maximum entropy distributions with four or more
assigned moments". Probab. Engin. Mech.
[6.40] Ferdinand, A. E. "A statistical mechanical approach to systems analysis". I.B.M. Jour. Res.
Dev. Sept., pp. 539-547, 1970.
[6.41] Jamison, B. "A Martin boundary interpretation of the maximum entropy method". Zeit. f.
Warsch. 30, pp. 265-272, 1974.
[6.42] Jamison, B. "Reciprocal processes". Zeit f Warsch. 30, pp. 65-86, 1974.
[6.43] Aebi, R. and Nagasawa, M. "Large deviations and the propagation of chaos for
Schroedinger processes". Zeit. f Warsch. 94, PP- 53-68, 1992.
[6.44] Arnold, G. S. and Kinsey, J. L. "Information theory for marginal distributions applications
to energy disposal in an exothermic reaction". Jour. Chem. Phys. 67, pp. 3530-3532,
1977.
[6.45] Rebick, C, Levine, R. D. and Bernstein, R. B. "Energy requirements and energy disposal"
Jour. Chem. Phys. 60, pp. 4977-4989, 1974.
[6.46] Landau, H. J. "Maximum entropy and the moment problem". Bull. Am. Math. Soc. 16, pp.
47-71, 1987.
[6.47] Mead, L. R. and Papanicolau, N. "Maximum entropy in the problem of moments". Jour.
Math. Phys. 25, pp. 2404-2417, 1984.
[6.48] Forte, B., Hughes, N. and Pales, Z. "Maximum entropy and the problem of moments"
Rendiconti di Matematica, Serie VII, 9, pp. 689-699, 1989.
[6.49] Borwein, J. M. and Lewis, A. S. "Convergence of best entropy estimates" SLAM Jour.
Optim. 1 p p . 191-205, 1991.
[6.50] Lewis, A. S. "The convergence of entropic estimates for moment problems". Workshop
on Functional Analysis/Optimization. Fitzpatrick S. and Giles, J. Eds, Centre for Mathem.
Analysis, Australian Nat. Univ. Canberra, pp. 100-115, 1988.
[6.51] Widder, D. V. "The Laplace transform". Princeton Univ. Press, Princeton, 1946.
[6.52] Theil, H. "Economics and information theory". North Holland, Amsterdam, 1967.
[6.53] Dowson, D. C. and Wragg, A. "Maximum entropy distributions having prescribed first
and second moments". LEEEIL16, p p . 689-693, 1973.
126 The Method ofMaximum Entropy

[6.54] Wragg, A. and Dowson, D.C. "Fitting continuous probability density functions over [0,oo)
using information theory ideas". IEEE IT-16. pp. 220-230, 1970.
[6.55] Barrow, D. F. and Cohen, A. C. "On some functions involving Mill's ratio". Ann. Math.
Statistics, 25, pp. 405-408, 1954.
[6.56] Tagliani, A. "On the application of maximum entropy to the problem of moments". Jour.
Math. Phys. 34, pp. 326-337, 1993.
[6.57] Tagliani, A. "Maximum entropy the Hamburger moments problem". Submitted to Jour.
Math. Phys. 1993.
[6.58] Powles, J. G. and Carranza, B. "An information theory of nuclear magnetic resonance". In
Magnetic Resonance. CooganC. K. Ed., pp. 133-161, 1970.
[6.59] Siddall, J. N. and Diab, Y. "The use in probabilistic design of probability curves generated
by maximizing the Shannon entropy function constrained by moments". Jour, of Engin. for
Industry. A. S. M. E. 97 , pp. 843-852, 1975.
[6.60] Baker, R. "Probability estimation and information principles" Structural Stability. 9, pp.
97-116, 1990.
Chapter 7

ENTROPY AND LARGE DEVIATIONS

The following is taken almost literally from [7.1]. It comprises the very basic results in the
theory of large deviations. In that reference you will find quite a lot about the subject and its
applications to statistical mechanics. Also, check with [7.2] for more.
We shall consider a probability space (£2,r,P) on which we have defined a family of
independent, identically distributed random variables {X„: n > 1} taking values on afiniteset
S={x„...,xN).
It is clear that any measure p on S (equipped with the a-algebra P(S) the class of all
subsets of S) can be written as

p(A)=2pi5IXA)

where 8Xi(A) is 1 or 0 depending on whether x, is in A or not. We shall set

SJn =t Xkln, n > 1
l

and define the empiricalfrequenciesLttl by

(7.1) IvM^ht&iyMxi})

where the co e Q. is written to emphasize that the Lal are random variables which count, for each
realization XJa\) of the process, thefrequencywith which the sequence XJa) takes the value x,.

(7.2) SJn =t XiLni

F=l

The L^ also define the empirical measures

L„(A)=I,Ln,i = ^idXj(A).

If for the law P on (fi,c^, P(Xk=i)= p, then

127
128 The Method of Maximum Entropy

EXt = E x , p , = m p

and the summands in L , are independent, identically distributed random variables, taking values
in the set of all probability measures on S.
According to the law of large numbers, for any e>0 the following limits hold true

lim P{\SJn-mp\ >e} =0

lim P\ max |L„, - p , | > e i = 0

»-*"• I \ii<N J

where the vector (pi,..., p#) is the limit of the random vector ( L ^ ^ . X ^ ) .
To begin with we shall consider the fluctuations of L a i about their means when S={0,1},
or if you will, for the head and tail game. All the basic results and techniques already appear in this
case, in which counting is simpler
For the time being set S={0,1}, p=l/2(5 0 +5, ), p 0 = p, = l/2.L ft0 = 1-S„ /n, L ftl = S„ In.
Therefore, |Z,„,o - Po | = \S„/n -mp\ From this

P(\S„/n -m91 > 6) = P[max \L„it - p , | > s )

v r=o,i y

Let Qn(1) denote the distribution of Sn/n as an 5R-valued variable and set
■4 = { r e 9 t : \t-m9\ >e} with0<s<l/2.
Certainly, An[0,l] * 0 and Q„(1,(A) = P{|5„/n-m p | >s} is positive for large enough n.
Since ms £ A,Q(n\A) - » 0 as n -»oo .

Let us define

(73) /(1)(z) = jzln2z + (l-z)ln2(l-z) z e [0,1]

[ co otherwise

with 01n0=0 as usual. Note that I(1)(z) is symmetric about 1/2 and has its minimum there. The
following result relates the decay of Qn(1)(A) to I(1)(z).
Theorem 7.4. With the notations introduced above

lim ^ l n e ^ ) = l i m ± l n P { | ^ - m p | >e} =-min/( 1 '(z)

Entropy and Large Deviations 129

Comment. Since A is a closed set and m p <* /<J min A'^z) is strictly large:r than ^ ( m ^ N ) .
Therefore Qnc"(<4) tends to zero exponentially fast as n - » oo.
Proof: Sn ranges over the set {0,l,...,n} and

P(S„=k) = ( j ) / 2 " .

If we put An={k:| k/n-l/2|> e} we have

e™(^)= £ P(Sn=k) = 2 ( " \ '911

Since there are (n+1) terms in the sum

max 1 " j / 2 " < Q(n\A) <(n + 1 ) m a x 1 " J/2"

keA„ \K J keA„ \ K

and since lnz is an increasing function, we have

max ■
keAn
teA„
*J isfrePwzy&H**.41}
"1
2
keA„
m\
i i
To conclude the proof we need the following.
Lemma 7.5. The following estimate is uniform in k < n.

iln("J=-|^-(l-0ln(l-0 + ^) aj n - > « .

Proof: For k=0 or k=n it is obviously true. From Stilting' s theorem, In n!=n In n-ri+0(ln
n). Therefore

im(*J=m B - t t a * - ^ M * - A ) + ±0(lnB).

Since In « = - £ In j - ^ir In j with which we obtain the estimate

130 The Method ofMaximum Entropy

| l n ( j ) i = l n | - | l n | - ( l - | ) i n ( l - | ) + ,o(¥)

= -/»(*)+o(V).

Back to the theorem. As both In n/n and ln(n+l)/n are 0(ln n/n) we have

|ln a («({|})=^)(|) + o(¥)

from which

lim i In g ^ V ) =lim max f - ^ 0 0 ) = --lim min Kfl-

We are almost there. Note to finish that

{z e [0,1]: z= £ for some k in A„}cA n[0,l]

and since fl\z) = °o whenever z is not in A,,, then

hm\\nQ$\A) = - min /C>(z) =-min 7C)(z).

Z£dr,[0,l) red

The missing steps are contained in

Lemma 7.6. Let A cSR be a compact set. f: A->91 ii continuous function and
A„c /4 closed sets such that for any a e A there exists a sequence .a„ e J4„ and a„ -» a. Then

lim rmnf[x)=minf{a}1.
"- > °° An A

Proof: Let a be such that

™inf{x)=f{a)
and an be as above. Then

(7.7) f(a) = limf(a„) > lim Jmnf(x) ^min/(^)-

i
^4

Let us now consider the general case: S={x„...xn}, (and let the xt be real numbers such
thatx ,<...<xj. As mentioned above
Entropy and Large Deviations 131

P (S)= {</?,,..,/>„) :|/>, = l,;/>,> 0}

is a compact subset of 9t". Recall that the entropy of v =Z v,8x, relative to p = Y.p,hXi is

Sp(y) = -Y.v,\n(\,lp,).

Assume that on (Cl,<&) we have a probability Pp such that PQiI=xi'^=pi for all n. Let us
denote by Qn(1) and Qn(2) the distributions of S„/n and Lu with respect to P
Let A, and \ be the Borel sets defined by

A\ = {/ e 91: |/-/w p | >e}, 0 <E < min {mp-xi,xN-mp}

Ai = \ v e P(5): max |v, -p,\ >sf, 0 < E < min {/?,, 1 -p,\
I i=l,2,... j i

and define

max S : v e p
(7 8) S(p z) = \ { f^ (^> £v,x, =z} z e [*i,x„]
| -oo z<2 [XI,A:„]

We are now ready for

Theorem 7.9. With all the jargon introduced above

(0 lim igi(^i)=max5(p,z)

(ii) Bmi0W2)=max^(v)
n-wo v£i42

Proof: Let us take care of (ii) to begin with. For each n and <D S Cl fixed, let 1 < / < N
and k;= #{Xj appears in the sequence X, ,...,XN}. Then L^pkj/n and LD(.) is in A, if and only if
k=(k„...kN) is in the set

A2„=W:0<k, <N, I,k,=n, max | | - / J > e } .

132 The Method ofMaximum Entropy

Introduce the (standard) symbols

C(«,k) = »!/«, !,...,»„!, pk=pk1',-,p'CN

with the aid of which we have

Ql{A2)=Z p{L„,, = ki, \<i<N}= Z C(n,k)p"

An obvious variation on the theme of Lemma 7.5 yields

ilnC(»,k) = - S | l n | + o ( ^ )

as n becomes large. From this it is clear that

ilnC(»,*)p" = - s £ l n ^ + o ( ^ ) .

Setting, for any k,

'k/n :
= S k eP(S),
it is clear that L ^ k / n if and only if L^v^. From this we have

(7.10) |ln0({v k / „}) = 5p(vk/n) + o ( ^ ) .

Noticing now, that in the sum defining Q*(A^) there are less than (n+l)N terms, we can
proceed as in the proof of Theorem 7.4 to obtain

>G2(/l2)=max | ^ ( V « , ) + 0 ( T ) 1 .

Again, for each n, the set

{v e P(5) : v = vk/„ for some ksA2,„}

is a subset of A2. Invoking Lemma 7.6 the following is clear

Entropy and Large Deviations 133

U m i l n e ^ 2 ) = max5 p (v).

To obtain (i) from (ii) we proceed as follows:

Since S„/„ = I.xiL„J we see that Sn,„&A\ iff L„ e B2={\ e V(S) : |Zx,v, -mp\ >s}
Therefore Q„<1)(^,)=Q11<2)(52). A variation on the theme developed above yields

l i m i l n e ^ , ) = limilng? ) (S 2 )= max Sp(v).

ve£ 2

To get the result we want we need to compute the right hand side. Note to begin with that

maxSp(v) = max max {Sp(v) : v e F(S), I,x,v, = z}

= max S(p, z)
ze^ir,[i,,j: K )

note now, that we defined Sip,z) = -oo whenever it is not in [x„xN]. Therefore

n
lim \;\nQ$\Ai)= max S(p,z).
-*°° Z£/l|

Comments. Since max{S(p,z): z e At} is negative, this theorem asserts that for N large,
the probability that a microscopic configuration is such that the empirical mean \xLN(dx) differs
a little bit from the actual mean of X, is exponentially small. See chapter of [7.1] for more on this
and [7.3] for a variation on the theme.

REFERENCES

[7.1] Ellis, R. S. "Entropy, large deviations and statistical mechanics". Springer Verlag,
Berlin, 1985.
[7.2] Bucklew, J. A. "Large deviation techniques in decision, simulation, and estimation"
John Wiley & Sons, New York, 1990.
[7.3] Robert, C. "An entropy concentration theorems an application in artificial intelligence
and descriptive statistics". Jour. Appl. Prob.,_27, pp. 303-313, 1990.
Chapter 8

MAXIMUM ENTROPY AND CONDITIONAL

PROBABILITIES

In this chapter we present a very interesting way of "understanding" how do maxentropic

distributions arise. We go through the simplest situation, lifting the results from [8.1-3] but direct
the reader to [6.41], [8.2], [8.3] and especially [8.4] for history, different and more
comprehensive results. It should also become apparent that the results here and in the previous
chapter are deeply connected.
We shall see that under some obvious regularity conditions, the maxentropic distributions
appear as limits of conditional distributions. But before we go into that, let us go through an
interlude on martingales, and changes of measure that, on one hand illuminates the maximum
entropy method and, on the other explains some of the changes of measure that are made below.
Consider a probability space (fi,^P) and an increasing family <^ of sub-o-algebras of
#". We shall say that a process (an indexed family of random variables) M^ is a martingale if: (i)
E|MJ<°o Vn, (ii)M„e£r Vn, and (iii) EfM^, WJ=M^ Vn.
Assume now that M„ is positive, and define

(8.1) EM[H]=E[HM„]

for every bounded H in d^ . The martingale property provides us with consistency for (8.1). To
have (8.1) for any bounded H in <^"just approximate H by an appropriate sequence H^.
Consider now a collection {Xn: n>l} of independent, identically distributed random
variables such that
D(X) = {Xe SRl£[exp-;uri] < oo}

has a nonempty interior. A standard convexity argument implies that D(X) is an interval
containing 0, and the comments in the appendix to chapter 5 tell us that d?(X)= -dlnZ(A,)/dX is a
differentiable bijection between the int (D(X)) and int (conv(range X,)). That is for each a s
int(conv(range X,)) there is X„e int (D(X)) such that a= -d\nZJdX(XJ.
Notice now that MB=exp-A.Sn/Z(X)" is a positive martingale, where Sn=ZXK, and that

EM(X1)=E[X1e-™<]/Z(X)

135
136 The Method ofMaximum Entropy

In the probabilistic literature NL, is called Wald's martingale and the change of measure in (8.1) is
the discrete time analogue of the Cameron-Martin-Girsanov transformation employed in sections
6.10 and 6.11. This setup can be greatly generalized, but let us not do it here. Let us just prove
Lemma 8.2. Let G be an inteprable function. Let NL, be any positive martingale and PM
be defined as in (8.1). Then, for G in 7n

EulG\S„]=E[GMjSn]/M„.

Proof: It is a rather simple consequence of the defining identities

Corollary 8.3. Let M1=exp-XSJZ(XY. Then if G is integrable and /„-measurable

E[G\S„]=EU[G\S„]

The first result we quote from [8.1] is contained in

Proposition 8.4. Assume that (X^ n>l} are independent, finitely valued, identically
distributed with P(X,=x,)=pi. Let min{x,}<a<max{xl}, then

p(Xi =MS„ = a)= e-^Pj/z(l)

where X is such that -dlnZ(X)/dX = a.

Proof: Let us denote PM by P Then, the corollary above implies that

P(Xi =j\S„ = an) = P(ATi =j\S„ = an)

= P(Xi =J)P(X2 +... +X„ = na -jyP(S„ = na).

Since . P ( J f - a | > e j behaves as exp-nK(s) for appropriate K(s), see Theorem 7.4, the
result we want follows.
This result is extended in
Theorem 8.5. Let {Xu: n>l} denote a sequence of real valued independent identically
distributed random variables. Let U: 5R -> 9? be a bounded measurable function and h: 5R -»5R
be such that D(X)={Xe SR: Z(A.)=E[exp-Xh(X)]<oo} has a nonempty interior. Let C denote the
closure of the convex set generated by the range of X,. Choose keint D(X) such that
EJh(X,)]=a for aeint C, and put Sn=Zh(X). Then
Maximum Entropy and Conditional Probabilities 137

Jim EM[u(XOg{^)]=EM[U(X,)]gia)

for any smooth function g of rapid decay.

Proof: Write g(x)=[e"iI°£(k)dk/27t. We have set things up so that Fubini's theorem can be
applied to justify exchanging integrals to obtain

Et^UWOgfe) ]= i£&)EM[WXl)<!Kp-(iSJny\

Now, the h(Xn) are still independent relative to PM and the expectation under the integral sign can
be further computed as

Eu[U(Xdexp-i(kS„/n)]

= EM[ O 0 r i ) « p - ( i * i ) ]£[exp-(x+f) I X, ]exp-(W - 1)K(X)

where we are again using K(X)=lnZ(X). The way we choose X insures us with the approximation

« p - 0 i - 1)(K(X + # ) -K(X)) = exp-(»- 1 ) ( M « + O 0 )

which tends to exp-(ika) as n goes to infinity.

Therefore
lim £*[«&*!*(%■) ] = j£g(k)e-^EM[U(Xl)]

= EM[U(Xl)]g(a).

And we are ready to state

Corollary 8.6. With the notations above

Jim E[U(Xl)\S„/n = a]=EM[U(Xl)].

Proof: From corollary 8.3 it follows that

E[U(Xl)\S„ ln = a]= Eu[UVCi)\S„ ln = a\

138 The Method ofMaximum Entropy

To conclude the proof, all you need is to know that when regular conditional probabilities
exist, then
EM[U(Xi)\SJn = a] =lim E^U(Xi)gk{k) ]'E g (V)

where g(x) is any sequence (like NM exp-k(x ~a)2/2) peaking about a.

Nice huh? The natural, and obvious, interpretation is that conditioning with respect to
{S„/n=a} concentrates the probability on the distribution yielding value a to h(X).

REFERENCES

[8.1] Van Campenhout, J. M. and Cover, T. M. "Maximum entropy and conditional

probability". IEEE, IT.-21 pp. 483-489, 1981.
[8.2] Zabell, S. L. "Rates of convergence for conditional expectations"
[8.3] Natarajan, S. "Large deviations, hypothesis testing and source coding for finite Markov
chains". IEEE, IJ-31, PP- 360-365, 1985.
[8.4] Csizar, I., Cover, T. M. and Choi, B. S. "Conditional limits theorems under Markov
conditioning". IEEE, 11-33, pp. 788-801, 1983.
Chapter 9

MAXIMUM ENTROPY AND STATISTICS

This chapter consists of sections not bearing a relation with each other, but all related to
statistics.

9.1 Gauss principle and minimum discrimination.

Let <${. 5R —> 9t, i=l, K be measurable functions and F0(x) be the guess we make about the
unknown distribution function F(x) of a random variable x. Assume that

Z(\) = \e-^«V»dFo(x)

is finite on 95* The density dF/dF0(x) minimizing K(F„F0)=-SFP(F) over the set of distribution
functions {F: dF«dF 0 , EF(3>)=c} is given by

(9.1.1) Jtfr) = (ZCX))-1 exp-Wc),*tt)

where X(c) is such that

(9.1.2) j (bix)(Z(X)Tl exp-(X(c), <D(x))dFo (x) = -VjJnZft)! m = c

That much is old news. In [9.1] from which we are quoting, Campbell observed the
following. Assume there is a measure m(dx) on 95 with respect to which F and dF0 have densities
f(x,c) and ^(x) respectively.
If we put $ 0 (x)=l and define A.0=lnZ(X), then if n measurements x,,...,^ of a random
variable X distributed according to (9.2), i.e.

(9.1.3) J(x, c) =/ 0 (x)exp-| Xk9k(x)

then we have

139
140 The Method of Maximum Entropy

Lemma 9.1.4. With the notations introduced above, the maximum likelihood estimators of
the c^ are

(9.1.5) C i ^ I * ^ )
i=\

n
Proof: Just form In Tlfixi, c) and differentiate (9.1.3) with respect to <\, equate the
derivative to zero, use (9.1.2) to obtain (9.1.5).
Before going to the converse, we shall recall (the generalization of) Gauss' method. Again
let us assume that the unknown distribution with respect to a measure m(dx)) of a real valued
random variable is f(x,c). The experimenter knows a sample x,,...^ and f0(x)=f(x,c0) for some c0.
Gauss' methods is based on assuming that
i) The right f(x,c) corresponds to the value of c maximizing the likelihood

(9.1.6) In II fix,,t)

ii) That the correct value of c is given by the estimator

cK = li^k(x,) k=\,2,...,K

for some appropriate family <t>,(x),...,K(x). We are ready now for

Lemma 9.1.7. Assume that the functions 1,$,(X),...,O K (X) are independent (none of them
is a function of the others) and that they are continuously differentiable. Assume also that f(x,c) is
continuously differentiable, once with respect to x and twice with respect to c. If f^x,c) is obtained
via Gauss' method, then it is of the form

fix, c) =/o (*)exp-(X(c), <Kx))

for some appropriate X(c).

Proof: The correct value of c has to satisfy both

(9 18) ^-i:in/(x: J ,c) = 0 k=i,2,...,K

and (9.1.5). If we think of x=(x„...,xn) as the coordinates of a point in 9T, then both (9.1.8) and
(9.1.5) rewritten as
Maximum Entropy and Statistics 141

(919) i!($t(i;)-ct) =0 k=l,2,...,K

determine the same value of c. This is the key determine the same k-dimensional surface. This
means that each normal to (9.1.5) is a linear combinations of the normals to (9.1.9). That is,
for each l<i<n

s r G f e | M * ' ) ) = | anife 9fo) - c)

where the ak depend on c. Obviously, for every i,

^feMv))^,^,.)
Since the functions involved depend only on one coordinate x^, we can give any generic
name to it, let it be x.
Now, integrating both sides of the last identity with respect to x and noticing that (9.1.8)
and (9.1.9) have to hold, it is clear that the integration constants have to be such that

(9.1.10) £lnA*,c)=ZajH(<^(x)-c<)

Since 32lnf(x,c)/5ck3c1=32lnf(x,c)/9cl5ck, from that last identity it follows that

Jfc-^)(*,fr)-^-fl«+«» = 0
and bringing in the assumption of the independence of the set l,4>j(x),,..,#K(x), we obtain that

dahldci = dau/dck, u« = |ijt

The first implies that for each l<i<K there exists a X.,(c) such that a^- -dXt ld<\. The
second implies that there is a function V(c) such that A^SV/SCj. Now, we can rewrite (9.1.10) as

t ^ . c ) = -4|[M®/-C i )] + F .

Integrating both sides along any curve joining c0 to C and since f„(x)=f(x,c0) we obtain
Jfy, c) =/0(x)exp-[(X(c), ®(x)) + Xo]
142 The Method of Maximum Entropy

where the identification of X0 with all remaining constants in the exponent is clear. Since the left
hand side is to be normalized, it is clear that expX0 = Z(X) as above
This concludes Campbell's proof.
Notice one interesting consequence of the extension of Gauss' method. When c is given
according to (9.1.5) and when f(x,c) is the right distribution, and when n tends to infinity then

i | Infix,, c)//„CO -> l^dF(x)

},i<S>k(xd^l<S>k(x)dF(x)
r=l

where of course dF=fi(x,c)m(dx). Certainly we would also obtain

- i Z ln/fr, c) -> Sm(X) = -\flx, c)ln/(x, c)m(dx)

J=I

Thus, maximum likelihood makes the entropy functional easy to accept, at least for
statisticians. And for them, the maximum entropy method for looking for distribution functions
must also be natural. The gist or the crux of the problem of characterizing distribution becomes
identical with the issue of choosing a family {<l>k(x), k=l,...,K} to characterize the parameters of
the distribution.
In the next section we shall see how the notion of sufficiency anticipated Campbell
results from a different point of view.

9.2 Sufficiency.

Here we shall refer to the second chapter of [4.1] and from the second and third
chapters of [0.2]. Recall that if the probability P on (Cl.d&j has a density f with respect to a
measure u, on (€l.d&) then the restriction on P to a sub-a-algebra G has a density E [f\G] with
respect to the restriction of u, to G.
We defined for P,QeP(n) and ueM(n)

K°(P, Q) = I°{P, Q) = \E, [f\g]\n fgjrfu

and a variation of the proof of Lemma 4.9 establishes that

Maximum Entropy and Statistics 143

(92.1) *„(/>, 0>A£(>,e)

where P, Q, p denote restrictions to G.

Definition 9.2.2. G is sufficient for P.O (relative to u) whenever the identity holds in
(9.2.1). That is whenever

dPUyi dQIdp
Er[dPld\iiG\ ~ Er[dQ/dtiiG]

holds a. e. (i.
In the setup of Lemma 4.9 G = a(*) for * : (Cl.<&) ->• (fi',^), P' = P °<D_1 and |i'=
| i ° t " ' , we note that E^ [dP/du,|<X>]=(dP'/d|/) ■=$ , etc., therefore the sufficiency condition
becomes

((oP'/tfu') o 0)'ldP/d\i = (fdQ/d\x) ° tf^'dg/tfu

Consider now the following setup. Let X: Q-»(E,?) be an E-valued random variable.
Let Q be a probability on (fl,<^ and m be a measure on 9t such that

Q(XzA)=\P<&m(d{,)
A

(or the Q-distribution of X has a density f with respect to m).

Let <1>: E—» 9J* be an 91*-valued Borel measurable function such that

Z(X) = £ e [<^°W>] =J e - (X, <m)P(&>>>(<%)

is defined for X in some convex set De 9?* Denote by P(X) the measure on fl with density
dP(A.)/dQ=exp-(X,4>(X))/Z(X). Let 9(X) be given by

(9.2.3) 6(W = V ) ( W )

Suppose you take N independent copies X„...,XN distributed according to

PN=P(V)®...®P(X) and let Q®...®Q be the product of N copies of Q. Then

e = ^ZflKx,)
is a sufficient statistics for K(PN,QN)=NK(P(X),Q).
144 The Method of Maximum Entropy

Thus supplementing this result with the uniqueness in the correspondence X -> 9(X)
contained in (9.2.3) neatly rounds up. a bunch of ideas.

9.3 Some very elementary Bavesian statistics.

We shall present some very elementary results about the Bayesian approach to density
estimation. For more the reader should take a look at [9.2] or [9.3].
When considering a parametric family p(x,9) it proves convenient to think of 9 as the
values of a random variable about which we know an a priori distribution G(9). We want to
estimate G given the values X,,...,^, of a random variable X having distribution function p(x|9).
To produce the Bayesian estimator of 9 the following procedure is applied
i) The posterior distribution of 9 given the observations x„...,x„ is

h(Q\xu...,x„) = p(xu...,x„\Q)p(Q)(5f(xu ...,x„\e)p(e)v(dS))~l

The measure v is preassigned on the range T of 9.

ii) A utility (or loss) function L(9,9) is chosen in advance. It is assumed to describe the gain
or penalty incurred in choosing the estimator 9 when the correct value of parameter happens to
be 9. For example; (9 - 9)=L(9,9).
iii) The risk measure for 9 = 6(x„...,xn) is defined by

i?(e,e) =Ji(§,e)p(xi,...,xnle>fri.,.<&„

and the expected risk is

The Bayes estimator 9 B is the estimator 9 which minimizes R(9). When L(9,9) is convex
in the first variable, then R(9) is a convex functional of 9 and we can do variational analysis. For
full details see [9.2].
It is reasonably easy to verify that 9s is the estimator at which the a posteriori risk
R(Q\XU...,X„) =\L(p,Q)h(B\xi,...,xn)\(d&)
reaches its minimum.
Maximum Entropy and Statistics 145

The problem is then how to choose g(9). In [9.2] there are a few examples in which
g(9) is chosen as the density (with respect to v(d8)) which maximizes

5v(6) = -|g(9)lng(0)v(dB)

subject to

Jg(9)v(dB) = l

jA(9)g@)v(dB) = \xK K=l,...,m.

This is a level 1 maxentropic reconstruction problem about which you have heard
enough by now. We present instead an example missing in [9.2] which is a (minor) variation on
the theme of section 9 of [9.3].
Assume that we somehow know that P(9) is a member of a family of densities p(9,a) on
TxA, a being some countable set on which we have another a priori density w(9,a) chosen
perhaps according to some invariance principle (or god given to cut short the regression ).
Assume that on the basis of an observation of a we decide to look for the distribution P(9,a)
concentrated on an a maximizing the relative entropy (or Kullback distance)

S = -K(P, W) = - 2 f P(9, a )ln ^ | v ( d 9 )

subject to the obvious constraint

JP(9, a0)v(<ie) =Z \P(Q, a)8aaov(dff) = 1 .

The maxentropic PME(0,a) which is our candidate for P(9 ) is

P m.\-ldW(-Q>a»M<«))"1 W(Q,a0) = P(Q) if a = a0

PME(e a)
' "{ 0 otherwise

And this is a good point to stop.

146 The Method of Maximum Entropy

REFERENCES

[9.1] Campbell, O. L. "Equivalences of Gauss' principle and minimum discrimination

information estimation of probabilities" Ann. of Math. Stat. 41, pp. 1011-1015, 1970.
[9.2] Berger, J. O. "Statistical Decision Theory and Bayesian Analysis". Springer-Verlag,
Berlin, 1985.
[9.3] Akaike, H. "Prediction and entropy" in "A Celebration of Statistics". Atkinson, A. C. and
Fienberg, S. E., Eds. Springer-Verlag, Berlin, 1985.

An Introduction To Ergodic Theory
No ratings yet
An Introduction To Ergodic Theory
259 pages
Vicente Amigo Roma
86% (7)
Vicente Amigo Roma
8 pages
Entropy in Dynamical Systems New Mathematical Monographs 1st Edition Tomasz Downarowicz PDF Download
100% (4)
Entropy in Dynamical Systems New Mathematical Monographs 1st Edition Tomasz Downarowicz PDF Download
71 pages
Maths Project For Class 10
50% (20)
Maths Project For Class 10
23 pages
(Probability and Its Applications) Mu-Fa Chen - Eigenvalues, Inequalities, and Ergodic Theory (Probability and Its Applications) (2004, Springer) - Libgen - Li PDF
100% (1)
(Probability and Its Applications) Mu-Fa Chen - Eigenvalues, Inequalities, and Ergodic Theory (Probability and Its Applications) (2004, Springer) - Libgen - Li PDF
239 pages
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems_ Proceedings of the Third Workshop on Maximum Entropy and Bayesian Methods in Applied Statistics, Wyoming, U.S.a., August 1–4, 1983 ( PDFDrive )
No ratings yet
Maximum-Entropy and Bayesian Spectral Analysis and Estimation Problems_ Proceedings of the Third Workshop on Maximum Entropy and Bayesian Methods in Applied Statistics, Wyoming, U.S.a., August 1–4, 1983 ( PDFDrive )
322 pages
A Dynamical Approach To Random Matrix Theory PDF
100% (1)
A Dynamical Approach To Random Matrix Theory PDF
239 pages
Lectures On Ergodic Theory by Petersen
No ratings yet
Lectures On Ergodic Theory by Petersen
28 pages
The Sensual World
100% (11)
The Sensual World
65 pages
Math 10-Q3-L1-Factorial Notation
No ratings yet
Math 10-Q3-L1-Factorial Notation
17 pages
ACaticha-Entropic Physics Book-July 2022
No ratings yet
ACaticha-Entropic Physics Book-July 2022
364 pages
Classic Papers in Applied Mathematics
No ratings yet
Classic Papers in Applied Mathematics
366 pages
DaSSWeb PedroMacedo 11may2021 Slides
No ratings yet
DaSSWeb PedroMacedo 11may2021 Slides
110 pages
Lectures On Random Matrix Theory
No ratings yet
Lectures On Random Matrix Theory
131 pages
1.1 Functions Topic Questions 0606 Set 2 QP Ms
No ratings yet
1.1 Functions Topic Questions 0606 Set 2 QP Ms
14 pages
Entropy Optimization Principles and Their Applications
No ratings yet
Entropy Optimization Principles and Their Applications
18 pages
E3. A Simple Proof of Menger's Theorem
No ratings yet
E3. A Simple Proof of Menger's Theorem
3 pages
Chapter 8 Test 1
No ratings yet
Chapter 8 Test 1
6 pages
Entropy Optimization Principles With Applications
100% (1)
Entropy Optimization Principles With Applications
217 pages
Digital Logic Design Assignment
0% (1)
Digital Logic Design Assignment
2 pages
Class XII Mathematics Exam 2024
No ratings yet
Class XII Mathematics Exam 2024
7 pages
Maximum Likelihood An Introduction: L. Le Cam
No ratings yet
Maximum Likelihood An Introduction: L. Le Cam
31 pages
Maximum Entropy Method: Sampling Bias: Jorge - Cossio@cigb - Edu.cu
No ratings yet
Maximum Entropy Method: Sampling Bias: Jorge - Cossio@cigb - Edu.cu
10 pages
Ergodic Hypothesis
No ratings yet
Ergodic Hypothesis
13 pages
Eece 522 Notes - 05 CH - 3b
No ratings yet
Eece 522 Notes - 05 CH - 3b
10 pages
Ergodic Hypothesis
No ratings yet
Ergodic Hypothesis
13 pages
Fornaess - Dynamics in Several Complex Variables
No ratings yet
Fornaess - Dynamics in Several Complex Variables
70 pages
The Q-Exponentials Do Not Maximize The Rényi Entropy
No ratings yet
The Q-Exponentials Do Not Maximize The Rényi Entropy
12 pages
Maximum Entropy On The Mean and The Cramér Rate Function in Statistical Estimation and Inverse Problems: Properties, Models, and Algorithms
No ratings yet
Maximum Entropy On The Mean and The Cramér Rate Function in Statistical Estimation and Inverse Problems: Properties, Models, and Algorithms
50 pages
Lectures on Planar Dimer Models
No ratings yet
Lectures on Planar Dimer Models
57 pages
Maximum Entropy for Statisticians
No ratings yet
Maximum Entropy for Statisticians
10 pages
Mathematics Model Paper IPE 2020-21
No ratings yet
Mathematics Model Paper IPE 2020-21
3 pages
Dicitencello Vuie
No ratings yet
Dicitencello Vuie
2 pages
Grendar2002 Why Max-Ent
No ratings yet
Grendar2002 Why Max-Ent
5 pages
Matlab Max Entropy Distribution Tool
No ratings yet
Matlab Max Entropy Distribution Tool
11 pages
Ergodic Notes
No ratings yet
Ergodic Notes
115 pages
CBE240book PDF
No ratings yet
CBE240book PDF
87 pages
Hintikka - Synthetic A Priori
No ratings yet
Hintikka - Synthetic A Priori
13 pages
Technion Lect14
No ratings yet
Technion Lect14
26 pages
Software
No ratings yet
Software
6 pages
Cdias Adams 30 PDF
No ratings yet
Cdias Adams 30 PDF
95 pages
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
No ratings yet
Wittenberg 2010 An Introduction To Maximum Entropy and Minimum Cross Entropy Estimation Using Stata
16 pages
Entropy & Probability in Physics
No ratings yet
Entropy & Probability in Physics
300 pages
(1990) Tagliani
No ratings yet
(1990) Tagliani
4 pages
Quasi-Stationary Distributions For The Radial Ornstein-Uhlenbeck Processes
No ratings yet
Quasi-Stationary Distributions For The Radial Ornstein-Uhlenbeck Processes
10 pages
Eigenvalue and Gaussian Data Pca Replica Notes
No ratings yet
Eigenvalue and Gaussian Data Pca Replica Notes
14 pages
Revised
No ratings yet
Revised
24 pages
The Dynamic Origin of Increasing Entropy
No ratings yet
The Dynamic Origin of Increasing Entropy
35 pages
Lectures on Mathematical Statistical Mechanics
No ratings yet
Lectures on Mathematical Statistical Mechanics
95 pages
De Oliveira, Penna, Herrmann - 1996 - Broad Histogram Method
No ratings yet
De Oliveira, Penna, Herrmann - 1996 - Broad Histogram Method
14 pages
Maximum Entropy Handbook Formulations, Theorems, M-Files: 1 Most Uncertain Finite Distributions
No ratings yet
Maximum Entropy Handbook Formulations, Theorems, M-Files: 1 Most Uncertain Finite Distributions
9 pages
From Physics To Economics
No ratings yet
From Physics To Economics
19 pages
Entropy: Concepts of Entropy and Their Applications
No ratings yet
Entropy: Concepts of Entropy and Their Applications
3 pages
The Relationship Between Information Theory and Thermodynamics: The Mathematical Basis
No ratings yet
The Relationship Between Information Theory and Thermodynamics: The Mathematical Basis
17 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
Computers and Mathematics With Applications: Maxallent: Maximizers of All Entropies and Uncertainty
No ratings yet
Computers and Mathematics With Applications: Maxallent: Maximizers of All Entropies and Uncertainty
19 pages
Sukumar Max Ent
No ratings yet
Sukumar Max Ent
23 pages
Misconceptions and Nature of Math
No ratings yet
Misconceptions and Nature of Math
22 pages
Neutron Spectra Unfolding With Maximum Entropy and Maximum Likelihood
No ratings yet
Neutron Spectra Unfolding With Maximum Entropy and Maximum Likelihood
12 pages
Mixed Series Practice Worksheet
No ratings yet
Mixed Series Practice Worksheet
5 pages
Three Tutorial Lectures
No ratings yet
Three Tutorial Lectures
36 pages
Yet Another Resolution of The Gibbs Paradox: An Information Theory Approach
No ratings yet
Yet Another Resolution of The Gibbs Paradox: An Information Theory Approach
10 pages
Beresteycki Nirenberg
No ratings yet
Beresteycki Nirenberg
37 pages
Maximum Entropy Probability Distribution
No ratings yet
Maximum Entropy Probability Distribution
9 pages
Entropy Uncertainty Final Revision JPA
No ratings yet
Entropy Uncertainty Final Revision JPA
13 pages
Maxent Manual
No ratings yet
Maxent Manual
16 pages
Entropy and Uncertainty
No ratings yet
Entropy and Uncertainty
15 pages
Class 11 Mathematics Notes 2025 26 Chapter 2 Relations and Functions
100% (1)
Class 11 Mathematics Notes 2025 26 Chapter 2 Relations and Functions
51 pages
Rabin Karp Alorithm For String Search
No ratings yet
Rabin Karp Alorithm For String Search
3 pages
Analysing Causal Structures With Entropy: Department of Mathematics, University of York, Heslington, York, YO10 5DD, UK
No ratings yet
Analysing Causal Structures With Entropy: Department of Mathematics, University of York, Heslington, York, YO10 5DD, UK
25 pages
Application of Concepts of Differentials and Integral Calculus
No ratings yet
Application of Concepts of Differentials and Integral Calculus
9 pages
Microcanonical Ensemble Unit 8
No ratings yet
Microcanonical Ensemble Unit 8
12 pages
F13 341 Book Sec 8-4
No ratings yet
F13 341 Book Sec 8-4
2 pages
Nem As Paredes Confesso
No ratings yet
Nem As Paredes Confesso
2 pages
Ids Unit 3 by
No ratings yet
Ids Unit 3 by
109 pages
MaxEnt Solutions in Dynamical Systems
No ratings yet
MaxEnt Solutions in Dynamical Systems
8 pages
Pre Post
100% (1)
Pre Post
7 pages
Unit 4
No ratings yet
Unit 4
12 pages
Entropy - Scholarpedia
No ratings yet
Entropy - Scholarpedia
15 pages
Maximum Entropy: Density Estimation
No ratings yet
Maximum Entropy: Density Estimation
18 pages
(Maa 1.8) Methods of Proof - Solutions
No ratings yet
(Maa 1.8) Methods of Proof - Solutions
5 pages
S V C C P: OME ERY Hallenging Alculus Roblems
No ratings yet
S V C C P: OME ERY Hallenging Alculus Roblems
12 pages
Ech 4
No ratings yet
Ech 4
39 pages
J Jeconom 2007 12 003 PDF
No ratings yet
J Jeconom 2007 12 003 PDF
37 pages
Classical Optimization Theory Quadratic Forms: Let Be A N-Vector
No ratings yet
Classical Optimization Theory Quadratic Forms: Let Be A N-Vector
48 pages
Mathematics 8300/1H: Higher Tier Paper 1 Non-Calculator
No ratings yet
Mathematics 8300/1H: Higher Tier Paper 1 Non-Calculator
29 pages
Wagner Secured PDF
No ratings yet
Wagner Secured PDF
193 pages
Cito Proefschrift Maarten Marsman PDF
No ratings yet
Cito Proefschrift Maarten Marsman PDF
114 pages
"And Dream of Sheep" Lyrics Analysis
No ratings yet
"And Dream of Sheep" Lyrics Analysis
2 pages
AIMO Progress Prize 2 Reference Problems Solutions
No ratings yet
AIMO Progress Prize 2 Reference Problems Solutions
11 pages
Engineering Maths: Differential Equations
No ratings yet
Engineering Maths: Differential Equations
39 pages
Numerical Methods for Engineers
No ratings yet
Numerical Methods for Engineers
13 pages
Inverse Variation Exercises
No ratings yet
Inverse Variation Exercises
4 pages
Kan Vandermaas Kievit 2016-1 PDF
No ratings yet
Kan Vandermaas Kievit 2016-1 PDF
9 pages
Dynamic Item Difficulty in Education
No ratings yet
Dynamic Item Difficulty in Education
14 pages
Subjective InterestingDAtaExploration PDF
No ratings yet
Subjective InterestingDAtaExploration PDF
12 pages
Proefschrift Robert Zwitser PDF
No ratings yet
Proefschrift Robert Zwitser PDF
127 pages
Bayesian Inference For Multistage and Other Incomplete Designs
No ratings yet
Bayesian Inference For Multistage and Other Incomplete Designs
19 pages
On Basic Concepts of Statistics
No ratings yet
On Basic Concepts of Statistics
24 pages
3D Computer Vision Assignment 4
No ratings yet
3D Computer Vision Assignment 4
13 pages
Matthieu Thesis PDF
No ratings yet
Matthieu Thesis PDF
120 pages
The Wiring of Intelligence: Alexander O. Savi Maarten Marsman Han L. J. Van Der Maas Gunter K. J. Maris
No ratings yet
The Wiring of Intelligence: Alexander O. Savi Maarten Marsman Han L. J. Van Der Maas Gunter K. J. Maris
38 pages
MislevyRandomised ETS
No ratings yet
MislevyRandomised ETS
76 pages

Vdoc - Pub The Method of Maximum Entropy

Uploaded by

Vdoc - Pub The Method of Maximum Entropy

Uploaded by

ERRATA

L. I. Csiszar's name is uniformly misspelt. Sorry for that.

Library of Congress Cataloging-in-Publication Data

Gzyl, Henryk, 1946-

Copyright © 1995 by World Scientific Publishing Co. Pte. Ltd.

Printed in Singapore by Uto-Print

9. Fragmentation in a nuclear reaction 86

some other cone (or convex set) C2cV2

(0.3) max {S(x): xe C\, A*=y}

(0.5) sup{S(P): PeP, EPAX=y}

BASIC CONCEPTS FROM PROBABILITY THEORY

Also, pointwise limits and other limiting operations performed on sequences on

(1.2) \X(x)m{dx) = I,anm(An).

X„ ='LanjclA„Jl, Xr»i SXn

(1.3) \X{x)m{dx) = Xvm\Xn(x)m(dx).

The third step is to take an arbitrary function X, decompose it as X=X+-X_, with

(1.4) dm = \X+dm- \X.dm.

We shall also say that X is integrable if

\\X\dm = \X+dm + \X-dm

(1.5) m(A) = j p(x)n(dx) = \ IA(x)p(x)n(dx).

The convention is to write p(x)=dm(x)/dn(x) or dm(x)=p(x)dn(x) and call p(x) the

(1.6) n(B) = m(X'1 (B)) = m(X s B)

n {Xa e A a ) = {Xa eAa,aeI]

and we will call it the mathematical expectation of Y (respect to P) or average value of Y.

t UjCij = E\£ (x.-iXtm.l2>o.

As an exercise for the reader, we leave the proof of

EGiX) = | G(x)p(x)dx =lG(F-\u))du

where G is a bounded measurable function and F"' is the compositional inverse of F.

p{Xx eAi,X2 zA2) = p{Xi eA1}p{X2 e A2}

for classes of sets A,, A, generating the sub-a-algebras

o(X,) = X;lB = {X;l(B) : B 6 B(5R)}.

A related notion, that of conditional expectation is to be introduced in (almost) full

(1.11) E[XY] = E[YE[X\G]\

h) Let G, cGj be two sub-o-algebras of &. Then

i) Let <I> be a convex function, infinite on the range of X, then

<D(£[XlG]) < E[®(X)\G].

Comment. Properties a, e, f, i, are shared by all integration like functional. And

According to properties b, e, h of Lemma (1.12) the correspondence X—»hx is linear, and

and the notation

for every bounded function ge G.

Q \gEPi\f\G]d\x = \gfd\i = \gEp,\f\G\dv.

O) EPl{f\G) =Ep2\f\G] a.e.\y.

- <o "= 7 ^*tJ X( )dP( ).

Here we convene to define E[X;Aj]/P(Ai)=0 whenever P(Aj)=0.

E[X\Y=e,] =(P(Y=ei)r1E[X, Y=e(]

E[X\Y] = Z(P(y=e,))" I £[X; F= <?,]/<**,)

P(B\Q = Y.P(A,B\C) =Y.P(B\A,C)P{AI\C)

PiAm = «ga = ^a^i = P ( 5 U,O^,IO.

Exchanging the roles of A, and B we obtain for any i

P(A \r>r\ - *(BKi<Wilc)

EQUILIBRIUM DISTRIBUTIONS IN STATISTICAL

Principles of Maximum Entropy

(2.1) s(Pt) = -/fclP.lnP,

(2.2) LAk(i)P, = (A,) = a,. i= 1,2,3, ...,M

If we proceed as most physicists do and apply elementary variational analysis using

where the partition function ZA(X) is defined by

ZA(X) = S exp-j Z XkAk(i).

(2.4) H(X) = k]n ZA(X) + (X, a)

and ln(s) denotes the natural logarithm of s>0.

Z Aj{f)(ZA(X))" exp-(z U , p ) = aJy j= 1 M.

* g g = (A,Aj) - (At){Aj) = ((Aj-iAMAt-iAd)

Hb = {¥ : S ->• N, ¥ ( 0 * 0 finitely many i e 5}

whereas the state of the fermions will be described by

Hf = {*F : S -> {0,1}, "P(0 = 1 y?»/te/v ma«y i e 5

(2.6) S({PQ¥))) = - k E / f f ) lnPOF),

where H stands for either Hb or H:.

which have to be computed separately for bosons and fermions.

introduction it was Jaynes who characterized equilibrium distribution in terms of a variational

(31) H(x) = H(pu ...,pn) = -tp, \nPl

and call it the entropy of X or the entropy of {p„....p,,}.

(3.2) p, = Urn jjff, = hm i | I^Q®.

P(i) = P(X, = r„ ...X, = t„) =p^..pn"

(3.3) W(tyP)~±: = SKPNH(X)

from which we obtain

instead of the "classical" on corresponding to the equiprobable situation

Simple optimization shows that