Revised
Revised
This is the author accepted manuscript (AAM). The final published version (version of record) is available online
via Springer at https://link.springer.com/chapter/10.1007%2F978-1-4939-7005-6_2. Please refer to any
applicable terms of use of the publisher.
This document is made available in accordance with publisher policies. Please cite only the
published version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/red/research-policy/pure/user-guides/brp-terms/
Entropy and thinning of discrete random variables
Oliver Johnson∗
June 7, 2016
Abstract
We describe five types of results concerning information and concentration of discrete
random variables, and relationships between them, motivated by their counterparts in
the continuous case. The results we consider are information theoretic approaches to
Poisson approximation, the maximum entropy property of the Poisson distribution,
discrete concentration (Poincaré and logarithmic Sobolev) inequalities, monotonicity
of entropy and concavity of entropy in the Shepp–Olkin regime.
We briefly summarize the five types of results we study in this paper, the discrete analogues
of which are described in more detail in the five sections from Sections 3 to 7 respectively.
∗
School of Mathematics, Univeristy of Bristol, University Walk, Bristol, BS8 1TW, UK. Email:
[email protected]
1
1. [Normal approximation] A paper by Linnik [70] was the first to attempt to use
ideas from information theory to prove the Central Limit Theorem, with later con-
tributions by Derriennic [37] and Shimizu [87]. This idea was developed considerably
by Barron [14], building on results of Brown [25]. Barron considered independent and
2
identically distributed Xi with
√ mean 0 and variance σ . He proved [14] that relative
entropy D((X1 + . . . + Xn )/ nkZ0,σ2 ) converges to zero if and only if it ever becomes
finite. Carlen and Soffer [29] proved similar results for non-identical variables under
a Lindeberg-type condition.
While none of these papers gave an explicit rate of convergence, later work [7, 11, 58]
proved that relative entropy converges at an essentially optimal O(1/n) rate, under
the (too strong) assumption of finite Poincaré constant (see Definition 1.1 below).
Typically (see [54] for a review), these papers did not manipulate entropy directly,
but rather used properties of projection operators to control the behaviour of Fisher
information on convolution. The use of Fisher information is based on the de Bruijn
identity (see [14, 17, 88]), which relates entropy to Fisher information, using the fact
that (see for example [88, Equation (5.1)] for fixed random variable X, the density
pt of X + Z0,t satisfies a heat equation of the form
∂pt 1 ∂ 2 pt 1 ∂
(z) = 2
(z) = (pt (z)ρt (z)) , (2)
∂t 2 ∂z 2 ∂z
where ρt (z) := ∂p
∂z
t
(z)/pt (z) is the score function with respect to location parameter,
in the Fisher sense.
More recent work of Bobkov, Chistyakov and Götze [19, 20] used properties of char-
acteristic functions to remove the assumption of finite Poincaré constant, and even
extended this theory to include convergence to other stable laws [18, 21].
2. [Maximum entropy] In [84, Section 20], Shannon described maximum entropy re-
sults for continuous random variables in exponential family form. That is, given a
test function ψ, positivity Rof the relative entropy shows that the entropy h(p) is max-
∞
imised for fixed values of −∞ p(x)ψ(x)dx by densities proportional to exp(−cψ(x))
for some value of c.
This is a very useful result in many cases, and gives us well-known facts such as
that entropy is maximised under a variance constraint by the Gaussian density. This
motivates the information theoretic approaches to normal approximation described
above, as well as the monotonicity of entropy result of [8] described later.
However, many interesting densities cannot be expressed in exponential family form
for a natural choice of ψ, and we would like to extend maximum entropy theory to
cover these. As remarked in [40, p.215], “immediate candidates to be subjected to
such analysis are, of course, stable laws”. For example, consider the Cauchy density,
for which it is not at all natural to consider the expectation of ψ(x) = log(1 + x2 )
in order to define a maximum entropy class. In the discrete context of Section 4
2
we discuss the Poisson mass function, again believing that E log X! is not a natural
object of study.
3. [Concentration inequalities] Next we consider concentration inequalities; that is,
relationships between senses in which a function is approximately constant. We focus
on Poincaré and logarithmic Sobolev inequalities.
Definition 1.1. We say that a probability density p has Poincaré constant Rp , if for
all sufficiently well-behaved functions g,
Z ∞
Varp (g) ≤ Rp p(x)g 0 (x)2 dx. (3)
−∞
Papers vary in the exact sense of ‘well-behaved’ in this definition, and for the sake
of brevity we will not focus on this issue; we will simply suppose that both sides of
(3) are well-defined.
Example 1.2. Chernoff [32] (see also [23]) used an expansion in Hermite polynomi-
als (see [91]) to show that the Gaussian densities φµ,σ2 have Poincaré constant equal
to σ 2 . By considering g(t) = t, it is clear that for any random variable X with finite
variance, Rp ≥ Varp . Chernoff ’s result [32] played a central role in the information
theoretic proofs of the Central Limit Theorem of Brown [25] and Barron [14].
In a similar way, we define the log-Sobolev constant, in the form discussed in [10]:
Definition 1.3. We say that a probability density p satisfies the logarithmic Sobolev
inequality with constant C if for any function f with positive values:
C ∞ Γ1 (f, f )(x)
Z
Entp (f ) ≤ p(x)dx, (4)
2 −∞ f (x)
where we write Γ1 (f, g) = f 0 g 0 , and view the RHS as a Fisher-type expression.
3
Similarly [9, Theorem 1] (see also [10, Proposition 5.7.1]) gives that:
Theorem 1.5. If the Bakry-Émery condition holds with constant c then the logarith-
mic Sobolev inequality holds with constant 1/c.
If p is Gaussian with variance σ 2 , since (see [43, Example 4.18]), the Bakry-Émery
condition holds with constant c = 1/σ 2 , we recover the original Poincaré inequality
of Chernoff [32] and the log-Sobolev inequality of Gross [42]. Indeed, if the ratio
p/φµ,σ2 is a log-concave function, then this approcah shows that the Poincare and
log-Sobolev inequalities hold with constant σ 2 .
4. [Monotonicity of entropy] While Brown [25] and Barron √ [14] exploited ideas of
Stam [88] to deduce that entropy of (X1 + . . . + Xn )/ n increases in the Central
Limit Theorem regime along the ‘powers of 2 subsequence’ n = 2k , it remained an
open problem for many years to show that the entropy is monotonically increasing
in all n. This conjecture may well date back to Shannon, and was mentioned by
Lieb [68]. It was resolved by Artstein, Ball, Barthe and Naor [8], who proved a more
general result using a variational characterization of Fisher information (introduced
in [11]):
Theorem 1.6 ([8]). P Given independent continuous Xi with finite variance, for any
positive αi such that n+1
i=1 αi = 1, writing α
(j)
= 1 − αj , then
n+1
! n+1 !
X √ X Xq
nh αi Xi ≥ α(j) h αi /α(j) Xi . (5)
i=1 j=1 i6=j
4
5. [Concavity of entropy] In the setting of probability measures taking values on the
real line R, there has been considerable interest in optimal transportation of mea-
sures; that is, given densities f0 and f1 , to find a sequence of densities ft smoothly
interpolating between them in a way which minimises some appropriate cost func-
tion. For example, using the natural quadratic cost function induces the Wasserstein
distance W2 between f0 and f1 . This theory is extensively reviewed in the two books
by Villani [95, 96], even for probability measures taking values on a Riemannian man-
ifold. In this setting, work by authors including Lott, Sturm and Villani [71, 89, 90]
relates the concavity of the entropy h(ft ) as a function of t to the Ricci curvature of
the underlying manifold.
Key works in this setting include those by Benamou and Brenier [15, 16] (see also
[30]), who gave a variational characterization of the Wasserstein distance between
f0 and f1 . Authors such as Caffarelli [27] and Cordero-Erausquin [33] used these
ideas to give new proofs of results such as log-Sobolev and transport inequalities.
Concavity of entropy also plays a key role in the field of information geometry (see
[2, 3, 78]), where the Hessian of entropy induces a metric on the space of probability
measures.
5
Note that if f = Q/P , where Q is a probability mass function, then
∞
X Q(x)
EntP (f ) = Q(x) log = D(QkP ), (7)
x=0
P (x)
the relative entropy from Q to P . We define the operators ∆ and ∆∗ (which are adjoint with
respect to counting measure) by ∆f (x) = f (x + 1) − f (x) and ∆∗ f (x) = f (x − 1) − f (x),
with the convention that f (−1) = 0.
We now make two further key definitions, in Condition 1 we define the ultra-log-concave
(ULC) random variables (as introduced by Pemantle [77] and Liggett [69]) and in Definition
2.2 we define the thinning operation introduced by Rényi [79] for point processes.
Condition 1 (ULC). For any λ, define the class of ultra-log-concave (ULC) mass functions
P
ULC(λ) = {P : λP = λ and P (v)/Πλ (v) is log-concave}.
Equivalently we require that vP (v)2 ≥ (v + 1)P (v + 1)P (v − 1), for all v.
The ULC property is preserved on convolution; that is, for P ∈ ULC(λP ) and Q ∈
ULC(λQ ) then P ?Q ∈ ULC(λP +λQ ) (see [97, Theorem 1] or [69, Theorem 2]). The ULC
class contains a variety of parametric families of random variables, including the Poisson
and sums of independent Bernoulli variables (Poisson–Binomial), including the binomial
distribution.
A weaker condition than ULC (Condition 1) is given in [57], which corresponds to
Assumption A of [28]:
Definition 2.1. Given a probability mass function P , write
P (x)2 − P (x − 1)P (x + 1) P (x) P (x − 1)
EP (x) := = − . (8)
P (x)P (x + 1) P (x + 1) P (x)
Condition 2 (c-log-concavity). If EP (x) ≥ c for all x ∈ Z+ , we say that P is c-log-concave.
[28, Section 3.2] showed that if P is ULC then it is c-log-concave, with c = P (0)/P (1).
In [57, Proposition 4.2] we show that DBE(c), a discrete form of the Bakry-Émery condi-
tion, is implied by c-log-concavity.
Another condition weaker than the ULC property (Condition 1) was discussed in [35],
related to the definition of a size-biased distribution. For a mass function P , we define
its size-biased version P ∗ by P ∗ (x) = (x + 1)P (x + 1)/λP (note that some authors define
this to be xP (x)/λP ; the difference is simply an offset). Now P ∈ ULC(λP ) means that
P ∗ is dominated by P in likelihood ratio ordering (write P ∗ ≤lr P ). This is a stronger
assumption than that P ∗ is stochastically dominated by P (write P ∗ ≤st P ) – see [82] for a
review of stochastic ordering results. As described in [34, 35, 75], such orderings naturally
arise in many contexts where P is the mass function of a sum of negatively dependent
random variables.
The other construction at the heart of this paper is the thinning operation introduced
by Rényi [79]:
6
Definition 2.2. Given probability mass function P , define the α-thinned version Tα P to
be ∞
X y x
Tα P (x) = α (1 − α)y−x P (y). (9)
y=x
x
Johnstone and MacGibbon showed that this quantity has a number of desirable features;
first a Cramér-Rao lower bound I(P ) ≥ 1/VarP (see [61, Lemma 1.2]), with equality if and
only if P is Poisson, second a projection identity which implies the subadditivity result
I(P ? Q) ≤ (I(P ) + I(Q))/4 (see [61, Proposition 2.2]). In theory, these two results should
allow us to deduce Poisson convergence in the required manner. However, there is a serious
problem. Expanding (10) we obtain
∞
!
X P (x − 1)2
I(P ) = − 1. (11)
x=0
P (x)
7
For example, if P is a Bernoulli(p) distribution then the bracketed sum in (11) becomes
P (0)2 /P (1) + P (1)2 /P (2) = (1 − p)2 /p + p2 /0 = ∞. Indeed, for any mass function P
with finite support I(P ) = ∞. To counter this problem, a new definition was made by
Kontoyiannis, Harremoës and Johnson in [65].
Definition 3.2 ([65]). For probability mass function P with mean λP , define the scaled
score function ρP and scaled Fisher information K by
(x + 1)P (x + 1)
ρP (x) := − 1, (12)
λP P (x)
X∞
K(P ) := λP P (x)ρP (x)2 . (13)
x=0
Note that this definition does not suffer from the problems of Definition 3.1 described
above.
Example 3.3. For P Bernoulli(p), the scaled score function can be evaluated as ρP (0) =
P (1)/λP P (0) − 1 = p/(1 − p) and ρP (1) = 2P (2)/λP P (1) − 1 = −1, and the scaled Fisher
information as K(P ) = p2 /(1 − p).
Further, K retains the desirable features of Definition 3.1; first positivity of the perfect
square ensures that K(P ) ≥ 0 with equality if and only if P is Poisson and second a
projection identity [65, Lemma, P.471] ensures that the following subadditivity result holds:
Pn
Theorem 3.4 ([65], Proposition 3). If S = P i=1 Xi is the sum of n independent random
variables Xi with means λi then writing λS = ni=1 λi :
n
1 X
K(PS ) ≤ λi K(PXi ), (14)
λS i=1
As discussed in [65], this result implies Poisson convergence at rates of an optimal order,
in a variety of senses including Hellinger distance and total variation. However, perhaps
the most interesting consequence comes via the following modified logarithmic Sobolev
inequality of Bobkov and Ledoux [22], proved via a tensorization argument:
Theorem 3.5 ([22], Corollary 4). For any function f taking positive values:
∞
X (f (x + 1) − f (x))2
EntΠλ (f ) ≤ λ Πλ (x) . (15)
x=0
f (x)
Observe that (see [65, Proposition 2]) taking f = P/Πλ for some probability mass
function P , the RHS of (15) can be written as
∞ 2 ∞ 2
X f (x + 1) X (x + 1)P (x + 1)
λ Πλ (x)f (x) −1 =λ P (x) −1 , (16)
x=0
f (x) x=0
λP P (x)
and so (15) becomes D(P kΠλ ) ≤ K(P ). Hence, combining Theorems 3.4 and 3.5, we
deduce convergence in the strong sense of relative entropy in a general setting.
8
Example 3.6. (See [65, Example 1]) Combining Example 3.3 and Theorem 3.5 we deduce
that
λ2
D(Bn,λ/n kΠλ ) ≤ , (17)
n(n − λ)
so via Pinsker’s inequality [66] we deduce that
λ
kBn,λ/n − Πλ kT V ≤ (2 + ) , (18)
n
giving the same order of convergence (if not the optimal constant) as results stated in [12].
We make the following brief remarks, the first of which will be useful in the proof of
the Poisson maximum entropy property in Section 4:
Remark 3.7. An equivalent formulation of the ULC property of Condition 1 is that P ∈
ULC(λP ) if and only if the scaled score function ρP (x) is a decreasing function of x.
Remark 3.8. In [13], definitions of score functions and subadditive inequalities are ex-
tended to the compound Poisson setting. The resulting compound Poisson approximation
bounds are close to the strongest known results, due to authors such as Roos [81].
Remark 3.9. The fact that two competing definitions of the score function and Fisher
information are possible (see Definition 3.1 and Definition 3.2) corresponds to the two
definitions of the Stein operator in [67, Section 1.2].
9
The gradient form of the RHS of this partial derivative is key for us, and is reminiscent
of the continuous probability models studied in [4]. Such representations will lie at the
heart of the analysis of entropy in the Shepp–Olkin setting [50, 51, 86] in Section 7.
Theorem 4.2. If P ∈ ULC(λ) then
Now, the assumption that P ∈ ULC(λ) means that Pα ∈ ULC(λ) so by Remark 3.7, the
score function ρα (z) is decreasing in z. Hence (21) is the covariance of an increasing and
a decreasing function, so by ‘Chebyshev’s other inequality’ (see [63, Equation (1.7)]) it is
negative.
In other words, X ∈ ULC(λ) makes Λ0 (α) ≤ 0 so Λ(α) is a decreasing function in α.
Since P0 = Πλ , and P1 = P , we deduce that
∞
X
H(X) ≤ − P (x) log Πλ (x) (22)
x=0
∞
X
= Λ(1) ≤ Λ(0) = − Πλ (x) log Πλ (x) = H(Πλ ),
x=0
and the result is proved. Here (22) follows by the positivity of relative entropy.
Remark 4.3. These ideas were developed by Yu [103], who proved that for any n, λ the
binomial mass function Bn,λ/n maximises entropy in the class of mass functions P such
that P/Bn,λ/n is log-concave (this class is referred to as the ULC(n) class by [77] and [69]).
Remark 4.4. A related extension was given in the compound Poisson case in [59], one
assumption of which was removed by Yu [105]. Yu’s result [105, Theorem 3] showed that
(assuming it is log-concave) the compound Poisson distribution CP (λ, Q) is maximum
entropy among all distributions with ‘claim number distribution’ in ULC(λ) and given
‘claim size distribution’ Q. Yu [105, Theorem 2] also proved a corresponding result for
compound binomial distributions.
Remark 4.5. The assumption in Theorem 4.2 that P ∈ ULC(λP ) was weakened in recent
work of Daly [34, Corollary 2.3], who proved the same result assuming that P ∗ ≤st P .
10
5 Poincaré and log-Sobolev inequalities
We give a definition which mimics Definition 1.1 in the discrete case:
Definition 5.1. We say that a probability mass function P has Poincaré constant RP , if
for all sufficiently well-behaved functions g,
X ∞
VarP (g) ≤ RP P (x) (∆g(x))2 . (23)
x=0
Example 5.2. Klaassen’s paper [64] (see also [26]) showed that the Poisson random vari-
able Πλ has Poincaré constant λ. As in Example 1.2, taking g(t) = t, we deduce that
RP ≥ VarP for all P with finite variance. The result of Klaassen can also be proved by
expanding in Poisson–Charlier polynomials (see [91]), to mimic the Hermite polynomial
expansion of Example 1.2.
However, we would like to generalize and extend Klaassen’s work, in the spirit of The-
orems 1.4 and 1.5 (originally due to Bakry and Émery [9]). Some results in this direction
were given by Chafaı̈ [31], however we briefly summarize two more recent approaches, using
ideas described in two separate papers.
In [35, Theorem 1.1], using a construction based on Klaassen’s original paper [64], it
was proved that RP ≤ λP , assuming that P ∗ ≤st P . Recall (see [82]) that this implies
P ∗ ≤lr P , which is a restatement of the ULC property (Condition 1). We deduce the
following result [35, Corollary 2.4]:
Theorem 5.3. For P ∈ ULC(λ):
VarP ≤ RP ≤ λP .
The second approach, appearing in [57], is based on a birth-and-death Markov chain
construction introduced by Caputo, Dai Pra and Posta [28], generalizing the thinning-based
interpolation of Lemma 4.1. In [57], ideas based on the Bakry-Émery calculus are used to
prove the following two main results, taken from [57, Theorem 1.5] and [57, Theorem 1.3],
which correspond to Theorems 1.4 and 1.5 respectively.
Theorem 5.4 (Poincaré inequality). Any probability mass function P whose support is the
whole of the positive integers Z+ and which satisfies the c-log-concavity condition (Condi-
tion 2) has Poincaré constant 1/c.
Note that (see the discussion at the end of [57]), Theorem 5.4 typically gives a weaker
result than Theorem 5.3, under weaker conditions. Next we describe a modified form of
log-Sobolev inequality proved in [57]:
Theorem 5.5 (New modified log-Sobolev inequality). Fix probability mass function P
whose support is the whole of the positive integers Z+ and which satisfies the c-log-concavity
condition (Condition 2). For any function f with positive values:
∞
1X f (x + 1) f (x)
EntP (f ) ≤ P (x)f (x + 1) log −1+ . (24)
c x=0 f (x) f (x + 1)
11
Remark 5.6. As described in [57], Theorem 5.5 generalizes a modified log-Sobolev inequal-
ity of Wu [100], which holds in the case where P = Πλ (see [104] for a more direct proof of
this result). Wu’s result in turn strengthens Theorem 3.5, the modified logarithmic Sobolev
inequality of Bobkov and Ledoux [22]. It can be seen directly that Theorem 5.5 tightens
Theorem 3.5 using the bound log u ≤ u − 1 with u = f (x + 1)/f (x).
Note that since Theorems 5.4 and 5.5 are proved under the c-log-concavity condition
(Condition 2), they hold under the stronger assumption that P ∈ ULC(λ) (Condition 1),
taking c = P (0)/P (1).
Remark 5.7. Note that (for reasons similar to those affecting the Johnstone–MacGibbon
Fisher information quantity I(P ) of Definition 3.1), Theorems 5.4 and 5.5 are restricted
to mass functions supported on the whole of Z+ . In order to consider mass functions such
as the Binomial Bn,p , it would be useful to remove this assumption.
However, one possibility is that in this case, we may wish to modify the form of the
derivative used in Definition 5.1. In the paper [52], for mass functions P supported on the
finite set {0, 1, . . . , n}, a ‘mixed derivative’
x x
∇n f (x) = 1 − (f (x + 1) − f (x)) + (f (x) − f (x − 1)) (25)
n n
was introduced. This interpolates between left and right derivatives, according to the po-
sition where the derivative is taken. In [52] it was shown that the Binomial Bn,p mass
function satisfies a Poincaré inequality with respect to ∇n . The proof was based on an ex-
pansion in Krawtchouk polynomials (see [91]), and exactly parallels the results of Chernoff
(Example 1.2) and Klaassen (Example 5.2). It would be of interest to prove results that
mimic Theorem 5.4 and 5.5 for the derivative ∇n of (25).
6 Monotonicity of entropy
Next we describe results concerning the monotonicity of entropy on summation and thin-
ning, in a regime described in [47] as the ‘law of thin numbers’. The first interesting feature
is that (unlike the Gaussian case of Theorem 1.6) monotonicity of entropy and of relative
entropy are not equivalent. By convex ordering arguments, Yu [104] showed that:
Theorem 6.1. Given a random variable X with probability mass function PX and mean
λ, we write D(X) for D(PX kΠλ ). For independent and identically distributed Xi , with
mass function P and mean λ:
Pn
1. relative entropy D i=1 T1/n X i is monotone decreasing in n,
Pn
2. If P ∈ ULC(λ), the entropy H i=1 T1/n Xi is monotone increasing in n.
12
Theorem 6.2. Given positive αi such that n+1 (j)
P
i=1 αi = 1, and writing α = 1 − αj , then
for any independent Xi ,
n+1
! n+1 !
X X X
nD Tαi Xi ≤ α(j) D Tαi /α(j) Xi .
i=1 j=1 i6=j
Comparison with (5) shows that this is a direct analogue of the result of Artstein et
al. √
[8], replacing differential entropies h with discrete entropy H, and replacing scalings
by α by thinnings by α. Since the result of Artstein et al. is a strong one, and implies
strengthened forms of the Entropy Power Inequality, this equivalence is good news for us.
However, there are two serious drawbacks.
Remark 6.4. First, the proof of Theorem 6.3, which is based on an analysis of ‘free
energy’ terms similar to the Λ(α) of Section 4, does not lend itself to easy generalization or
extension. It is possible that a different proof (perhaps using a variational characterization
similar to that of [7, 11]) may lend more insight.
Remark 6.5. Second, Theorem 6.3 does not lead to a single universally accepted discrete
Entropy Power Inequality in the way that Theorem 1.6 did. As discussed in [60], several
natural reformulations of the Entropy Power Inequality fail, and while [60, Theorem 2.5]
does prove a particular discrete Entropy Power Inequality, it is by no means the only
possible result of this kind. Indeed, a growing literature continues to discuss the right
formulation of this inequality.
For example, [83, 99, 101, 102] consider replacing integer addition by binary addition,
with the paper [102] introducing the so-called Mrs. Gerber’s Lemma, extended to a more
general group context by Anantharam and Jog [5, 53]. Harremoës and Vignat [48] (see also
[85]) consider conditions under which the original functional form of the Entropy Power
Inequality continues to hold. Other authors use ideas from additive combinatorics and
sumset theory, with Haghighatshoar, Abbe and Telatar [45] adding an explicit error term
and Wang, Woo and Madiman [98] giving a reformulation based on rearrangements. It
appears that there is considerable scope for extensions and unifications of all this theory.
13
Pp for the probability mass function of the random variable S := X1 + . . . + Xm , where
Xi are independent Bernoulli(pi ) variables. Shepp and Olkin conjectured that the entropy
of H(Pp ) is a concave function of the parameters. We can simplify this conjecture by
considering the affine case, where each pi (t) = (1 − t)pi (0) + tpi (1).
This result is plausible since Shepp and Olkin [86] (see also [74]) proved that for any m
the entropy H(Bm,p ) of a Bernoulli random variable is concave in p, and also claimed in
their paper that it was true for the sum of m independent Bernoulli random variables when
m = 2, 3. Since then, progress was limited. In [106, Theorem 2] it was proved that the
entropy is concave when for each i either pi (0) = 0 or pi (1) = 0. (In fact, this follows from
the case n = 2 of Theorem 6.3 above). Hillion [49] proved the case of the Shepp–Olkin
conjecture where all pi (t) are either constant or equal to t.
However, in recent work of Hillion and Johnson [50, 51], the Shepp–Olkin conjecture
was proved, in a result stated as [51, Theorem 1.2]:
We briefly summarise the strategy of the proof, and the ideas involved. In [50], Theorem
7.1 was proved in the monotone case where each p0i := p0i (t) has the same sign. In [51],
this was extended to the general case; in fact, the monotone case of [50] is shown to be
∂2
the worst case. In order to bound the second derivative of entropy ∂t 2 H(Pp(t) ), we need to
control the first and second derivatives of the probability mass function Pp(t) , in the spirit
of Lemma 4.1. Indeed, we write (see [51, Equations (8),(9)]) the derivative of the mass
function in gradient form (see (2) and (19) for comparison):
∂Pp(t)
(k) = g(k − 1) − g(k) (26)
∂t
∂ 2 Pp(t)
(k) = h(k − 2) − 2h(k − 1) + h(k) (27)
∂t2
for certain functions g(k) and h(k), about which we can be completely explicit. The key
property (referred to as Condition 4 in [50]) is that for all k:
The proof of (28) is based on a cubic inequality [50, Property(m)] satisfied by Poisson-
binomial mass functions, which relates to an iterated log-concavity result of Brändén [24].
The argument in [50] was based around the idea of optimal transport of probability
measures in the discrete setting, where it was shown that if all the p0i have the same sign
then the Shepp–Olkin path provides an optimal interpolation, in the sense of a discrete
formula of Benamou–Brenier type [15, 16] (see also [38, 41] which also consider optimal
transport of discrete measures, including analysis of discrete curvature in the sense of [28]).
14
8 Open problems
We briefly list some open problems, corresponding to each research direction described in
Sections 3 to 7. This is not a complete list of open problems in the area, but gives an
indication of some possible future research directions.
2. [Maximum entropy] While not strictly a question to do with discrete random vari-
ables, as described in Section 1 it remains an interesting problem to find classes within
which the stable laws are maximum entropy. As with the ULC class of Condition 1,
or the fixed variance class within which the Gaussian maximises entropy, we want
such a class to be well-behaved on convolution and scaling. Preliminary work in this
direction, including the fact that stable laws are not in general maximum entropy
in their own domain of normal attraction (in the sense of [39]), is described in [56],
and an alternative approach based on non-linear diffusion equations and fractional
calculus is given by Toscani [92].
15
Note that while they are monotone functions of each other, the q-Rényi and q-Tsallis
entropies need not be concave for the same q. We quote the following generalized
Shepp-Olkin conjecture from [51, Conjecture 4.2]:
(a) There is a critical qR∗ such that the q-Rényi entropy of all Bernoulli sums is
concave for q ≤ qR∗ , and the entropy of some interpolation is convex for q > qR∗ .
(b) There is a critical qT∗ such that the q-Tsallis entropy of all Bernoulli sums is
concave for q ≤ qT∗ , and the entropy of some interpolation is convex for q > qT∗ .
Indeed we conjecture that qR∗ = 2 and qT∗ = 3.65986 . . ., the root of 2−4q+2q = 0. We
remark that the form of discrete Entropy Power Inequality proposed by [98] (based
on the theory of rearrangements) also holds for Rényi entropies.
In addition, Shepp and Olkin [86] made another conjecture regarding Bernoulli sums,
that the entropy H(Pp ) is a monotone increasing function in p if all pi ≤ 1/2, which
remains open with essentially no published results.
Acknowledgments
The author thanks the Institute for Mathematics and Its Applications for the invitation
and funding to speak at the workshop ‘Information Theory and Concentration Phenomena’
in Minneapolis in April 2015. In addition, he would like to thank the organisers and
participants of this workshop for stimulating discussions. The author thanks Fraser Daly,
Mokshay Madiman and an anonymous referee for helpful comments on earlier drafts of
this paper.
References
[1] J. A. Adell, A. Lekuona, and Y. Yu. Sharp bounds on the entropy of the Poisson law
and related quantities. IEEE Trans. Inform. Theory, 56(5):2299 –2306, may 2010.
[3] S.-i. Amari and H. Nagaoka. Methods of information geometry, volume 191 of Trans-
lations of Mathematical Monographs. American Mathematical Society, Providence,
RI, 2000.
[4] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the
space of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser
Verlag, Basel, second edition, 2008.
16
[5] V. Anantharam. Counterexamples to a proposed Stam inequality on finite groups.
IEEE Trans. Inform. Theory, 56(4):1825 –1827, 2010.
[7] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. On the rate of convergence in the
entropic central limit theorem. Probab. Theory Related Fields, 129(3):381–390, 2004.
[10] D. Bakry, I. Gentil, and M. Ledoux. Analysis and geometry of Markov diffusion
operators, volume 348 of Grundlehren der mathematischen Wissenschaften. Springer,
2014.
[11] K. Ball, F. Barthe, and A. Naor. Entropy jumps in the presence of a spectral gap.
Duke Math. J., 119(1):41–63, 2003.
[14] A. R. Barron. Entropy and the Central Limit Theorem. Ann. Probab., 14(1):336–342,
1986.
[15] J.-D. Benamou and Y. Brenier. A numerical method for the optimal time-continuous
mass transport problem and related problems. In Monge Ampère equation: appli-
cations to geometry and optimization (Deerfield Beach, FL, 1997), volume 226 of
Contemp. Math., pages 1–11. Amer. Math. Soc., Providence, RI, 1999.
[16] J.-D. Benamou and Y. Brenier. A computational fluid mechanics solution to the
Monge-Kantorovich mass transfer problem. Numer. Math., 84(3):375–393, 2000.
[17] N. M. Blachman. The convolution inequality for entropy powers. IEEE Trans.
Information Theory, 11:267–271, 1965.
17
[19] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Rate of convergence and Edgeworth-
type expansion in the entropic central limit theorem. Ann. Probab., 41(4):2479–2512,
2013.
[20] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Berry–Esseen bounds in the entropic
central limit theorem. Probability Theory and Related Fields, 159(3-4):435–478, 2014.
[21] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Fisher information and convergence
to stable laws. Bernoulli, 20(3):1620–1646, 2014.
[22] S. G. Bobkov and M. Ledoux. On modified logarithmic Sobolev inequalities for
Bernoulli and Poisson measures. J. Funct. Anal., 156(2):347–365, 1998.
[23] A. Borovkov and S. Utev. On an inequality and a related characterisation of the
normal distribution. Theory Probab. Appl., 28(2):219–228, 1984.
[24] P. Brändén. Iterated sequences and the geometry of zeros. J. Reine Angew. Math.,
658:115–131, 2011.
[25] L. D. Brown. A proof of the Central Limit Theorem motivated by the Cramér-Rao
inequality. In G. Kallianpur, P. R. Krishnaiah, and J. K. Ghosh, editors, Statistics
and Probability: Essays in Honour of C.R. Rao, pages 141–148. North-Holland, New
York, 1982.
[26] T. Cacoullos. On upper and lower bounds for the variance of a function of a random
variable. Ann. Probab., 10(3):799–809, 1982.
[27] L. A. Caffarelli. Monotonicity properties of optimal transportation and the FKG
and related inequalities. Communications in Mathematical Physics, 214(3):547–563,
2000.
[28] P. Caputo, P. Dai Pra, and G. Posta. Convex entropy decay via the Bochner-Bakry-
Emery approach. Ann. Inst. Henri Poincaré Probab. Stat., 45(3):734–753, 2009.
[29] E. Carlen and A. Soffer. Entropy production by block variable summation and
Central Limit Theorems. Comm. Math. Phys., 140(2):339–371, 1991.
[30] E. A. Carlen and W. Gangbo. Constrained steepest descent in the 2-Wasserstein
metric. Ann. of Math. (2), 157(3):807–846, 2003.
[31] D. Chafaı̈. Binomial-Poisson entropic inequalities and the M/M/∞ queue. ESAIM
Probability and Statistics, 10:317–339, 2006.
[32] H. Chernoff. A note on an inequality involving the normal distribution. Ann. Probab.,
9(3):533–535, 1981.
[33] D. Cordero-Erausquin. Some applications of mass transport to Gaussian-type in-
equalities. Arch. Ration. Mech. Anal., 161(3):257–269, 2002.
18
[34] F. Daly. Negative dependence and stochastic orderings. arxiv:1504.06493, 2015.
[35] F. Daly and O. T. Johnson. Bounds on the Poincaré constant under negative depen-
dence. Statistics and Probability Letters, 83:511–518, 2013.
[38] M. Erbar and J. Maas. Ricci curvature of finite Markov chains via convexity of the
entropy. Archive for Rational Mechanics and Analysis, 206:997–1038, 2012.
[44] D. Guo, S. Shamai, and S. Verdú. Mutual information and minimum mean-square
error in Gaussian channels. IEEE Trans. Inform. Theory, 51(4):1261–1282, 2005.
[47] P. Harremoës, O. T. Johnson, and I. Kontoyiannis. Thinning, entropy and the law
of thin numbers. IEEE Trans. Inform. Theory, 56(9):4228–4244, 2010.
[48] P. Harremoës and C. Vignat. An Entropy Power Inequality for the binomial fam-
ily. JIPAM. J. Inequal. Pure Appl. Math., 4, 2003. Issue 5, Article 93; see also
http://jipam.vu.edu.au/.
19
[49] E. Hillion. Concavity of entropy along binomial convolutions. Electron. Commun.
Probab., 17(4):1–9, 2012.
[50] E. Hillion and O. T. Johnson. Discrete versions of the transport equation and the
Shepp-Olkin conjecture. Annals of Probability, 44(1):276–306, 2016.
[51] E. Hillion and O. T. Johnson. A proof of the Shepp-Olkin entropy concavity conjec-
ture. Bernoulli (to appear), 2016. See also arxiv:1503.01570.
[52] E. Hillion, O. T. Johnson, and Y. Yu. A natural derivative on [0, n] and a binomial
Poincaré inequality. ESAIM Probability and Statistics, 16:703–712, 2014.
[53] V. Jog and V. Anantharam. The entropy power inequality and Mrs. Gerber’s Lemma
for groups of order 2n . IEEE Transactions on Information Theory,, 60(7):3773–3786,
2014.
[54] O. T. Johnson. Information theory and the Central Limit Theorem. Imperial College
Press, London, 2004.
[55] O. T. Johnson. Log-concavity and the maximum entropy property of the Poisson
distribution. Stoch. Proc. Appl., 117(6):791–802, 2007.
[56] O. T. Johnson. A de Bruijn identity for symmetric stable laws. In submission, see
arXiv:1310.2045, 2013.
[58] O. T. Johnson and A. R. Barron. Fisher information inequalities and the Central
Limit Theorem. Probability Theory and Related Fields, 129(3):391–409, 2004.
[60] O. T. Johnson and Y. Yu. Monotonicity, thinning and discrete versions of the Entropy
Power Inequality. IEEE Trans. Inform. Theory, 56(11):5387–5395, 2010.
[62] A. Kagan. A discrete version of the Stam inequality and a characterization of the
Poisson distribution. J. Statist. Plann. Inference, 92(1-2):7–12, 2001.
20
[65] I. Kontoyiannis, P. Harremoës, and O. T. Johnson. Entropy and the law of small
numbers. IEEE Trans. Inform. Theory, 51(2):466–472, 2005.
[67] C. Ley and Y. Swan. Stein’s density approach for discrete distributions and infor-
mation inequalities. See arxiv:1211.3668, 2012.
[68] E. Lieb. Proof of an entropy conjecture of Wehrl. Comm. Math. Phys., 62:35–41,
1978.
[70] Y. Linnik. An information-theoretic proof of the Central Limit Theorem with the
Lindeberg Condition. Theory Probab. Appl., 4:288–299, 1959.
[71] J. Lott and C. Villani. Ricci curvature for metric-measure spaces via optimal trans-
port. Ann. of Math. (2), 169(3):903–991, 2009.
[72] M. Madiman and A. Barron. Generalized entropy power inequalities and mono-
tonicity properties of information. IEEE Trans. Inform. Theory, 53(7):2317–2329,
2007.
[73] M. Madiman, J. Melbourne, and P. Xu. Forward and reverse Entropy Power Inequal-
ities in convex geometry. See: arxiv:1604.04225, 2016.
[74] P. Mateev. The entropy of the multinomial distribution. Teor. Verojatnost. i Prime-
nen., 23(1):196–198, 1978.
[78] C. R. Rao. On the distance between two populations. Sankhya, 9:246–248, 1948.
[79] A. Rényi. A characterization of Poisson processes. Magyar Tud. Akad. Mat. Kutató
Int. Közl., 1:519–527, 1956.
21
[81] B. Roos. Kerstan’s method for compound Poisson approximation. Ann. Probab.,
31(4):1754–1771, 2003.
[83] S. Shamai and A. Wyner. A binary analog to the entropy-power inequality. IEEE
Trans. Inform. Theory, 36(6):1428–1430, Nov 1990.
[85] N. Sharma, S. Das, and S. Muthukrishnan. Entropy power inequality for a family of
discrete random variables. In 2011 IEEE International Symposium on Information
Theory Proceedings (ISIT), pages 1945–1949. IEEE, 2011.
[86] L. A. Shepp and I. Olkin. Entropy of the sum of independent Bernoulli random
variables and of the multinomial distribution. In Contributions to probability, pages
201–206. Academic Press, New York, 1981.
[88] A. J. Stam. Some inequalities satisfied by the quantities of information of Fisher and
Shannon. Information and Control, 2:101–112, 1959.
[89] K.-T. Sturm. On the geometry of metric measure spaces. I. Acta Math., 196(1):65–
131, 2006.
[90] K.-T. Sturm. On the geometry of metric measure spaces. II. Acta Math., 196(1):133–
177, 2006.
[92] G. Toscani. The fractional Fisher information and the central limit theorem for stable
laws. Ricerche di Matematica (online first), doi:10.1007/s11587-015-0253-9, 1-21,
2015.
[94] A. Tulino and S. Verdú. Monotonic decrease of the non-Gaussianness of the sum
of independent random variables: a simple proof. IEEE Trans. Inform. Theory,
52(9):4295–4297, 2006.
22
[95] C. Villani. Topics in optimal transportation, volume 58 of Graduate Studies in Math-
ematics. American Mathematical Society, Providence, RI, 2003.
[96] C. Villani. Optimal transport: Old and New, volume 338 of Grundlehren der Mathe-
matischen Wissenschaften. Springer-Verlag, Berlin, 2009.
[97] D. W. Walkup. Pólya sequences, binomial convolution and the union of random sets.
J. Appl. Probability, 13(1):76–85, 1976.
[98] L. Wang, J. O. Woo, and M. Madiman. A lower bound on the Rényi entropy of
convolutions in the integers. In 2014 IEEE International Symposium on Information
Theory (ISIT), pages 2829–2833. IEEE, 2014.
[100] L. Wu. A new modified logarithmic Sobolev inequality for Poisson point processes
and several applications. Probab. Theory Related Fields, 118(3):427–438, 2000.
[101] A. D. Wyner. A theorem on the entropy of certain binary sequences and applications.
II. IEEE Trans. Information Theory, 19(6):772–777, 1973.
[102] A. D. Wyner and J. Ziv. A theorem on the entropy of certain binary sequences and
applications. I. IEEE Trans. Information Theory, 19(6):769–772, 1973.
[103] Y. Yu. On the maximum entropy properties of the binomial distribution. IEEE
Trans. Inform. Theory, 54(7):3351–3353, July 2008.
23