Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views24 pages

Revised

The document discusses five types of results related to the entropy and concentration properties of discrete random variables, drawing parallels with continuous cases. Key topics include Poisson approximation, maximum entropy properties, concentration inequalities, monotonicity, and concavity of entropy. The author emphasizes the importance of these results for understanding the behavior of discrete random variables and their information-theoretic implications.

Uploaded by

Jeffry Immanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

Revised

The document discusses five types of results related to the entropy and concentration properties of discrete random variables, drawing parallels with continuous cases. Key topics include Poisson approximation, maximum entropy properties, concentration inequalities, monotonicity, and concavity of entropy. The author emphasizes the importance of these results for understanding the behavior of discrete random variables and their information-theoretic implications.

Uploaded by

Jeffry Immanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Johnson, O. T. (2017).

Entropy and thinning of discrete random


variables. In E. Carlen, M. Madiman, & E. Werner (Eds.), Convexity
and concentration: proceedings of the Spring 2015 Semester of the
Theme Year in Discrete Structures, IMA Minneapolis (pp. 33-53). (The
IMA Volumes in Mathematics and its Applications; Vol. 161). Springer.
https://doi.org/10.1007/978-1-4939-7005-6_2

Peer reviewed version

Link to published version (if available):


10.1007/978-1-4939-7005-6_2

Link to publication record on the Bristol Research Portal


PDF-document

This is the author accepted manuscript (AAM). The final published version (version of record) is available online
via Springer at https://link.springer.com/chapter/10.1007%2F978-1-4939-7005-6_2. Please refer to any
applicable terms of use of the publisher.

University of Bristol – Bristol Research Portal


General rights

This document is made available in accordance with publisher policies. Please cite only the
published version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/red/research-policy/pure/user-guides/brp-terms/
Entropy and thinning of discrete random variables
Oliver Johnson∗
June 7, 2016

Abstract
We describe five types of results concerning information and concentration of discrete
random variables, and relationships between them, motivated by their counterparts in
the continuous case. The results we consider are information theoretic approaches to
Poisson approximation, the maximum entropy property of the Poisson distribution,
discrete concentration (Poincaré and logarithmic Sobolev) inequalities, monotonicity
of entropy and concavity of entropy in the Shepp–Olkin regime.

1 Results in continuous case


This paper gives a personal review of a number of results concerning the entropy and
concentration properties of discrete random variables. For simplicity, we only consider
independent random variables (though it is an extremely interesting open problem to
extend many of the results to the dependent case). These results are generally motivated
by their counterparts in the continuous case, which we will briefly review, using notation
which holds only for Section 1.
For simplicity we restrict our attention in this section
R ∞ to random variables taking values
in R. For any probability density p, write λp = −∞ xp(x)dx for its mean and Varp =
R∞
−∞
(x − λp )2 p(x)dx for its variance. We write h(p) for the differential entropy of p, and
interchangeably write h(X) for X ∼ p. Similarly we write D(pkq) or D(XkY ) for relative
entropy. We write φµ,σ2 (x) for the density of Gaussian Zµ,σ2 ∼ N (µ, σ 2 ).
Given a function f , we wish to measure R ∞its concentration properties with respect to
probability density p; we write λp (f ) = −∞ f (x)p(x)dx for the expectation of f with
R∞
respect to p, write Varp (f ) = −∞ p(x)(f (x) − λp (f ))2 dx for the variance and define
Z ∞
Entp (f ) = p(x)f (x) log f (x)dx − λp (f ) log λp (f ). (1)
−∞

We briefly summarize the five types of results we study in this paper, the discrete analogues
of which are described in more detail in the five sections from Sections 3 to 7 respectively.

School of Mathematics, Univeristy of Bristol, University Walk, Bristol, BS8 1TW, UK. Email:
[email protected]

1
1. [Normal approximation] A paper by Linnik [70] was the first to attempt to use
ideas from information theory to prove the Central Limit Theorem, with later con-
tributions by Derriennic [37] and Shimizu [87]. This idea was developed considerably
by Barron [14], building on results of Brown [25]. Barron considered independent and
2
identically distributed Xi with
√ mean 0 and variance σ . He proved [14] that relative
entropy D((X1 + . . . + Xn )/ nkZ0,σ2 ) converges to zero if and only if it ever becomes
finite. Carlen and Soffer [29] proved similar results for non-identical variables under
a Lindeberg-type condition.
While none of these papers gave an explicit rate of convergence, later work [7, 11, 58]
proved that relative entropy converges at an essentially optimal O(1/n) rate, under
the (too strong) assumption of finite Poincaré constant (see Definition 1.1 below).
Typically (see [54] for a review), these papers did not manipulate entropy directly,
but rather used properties of projection operators to control the behaviour of Fisher
information on convolution. The use of Fisher information is based on the de Bruijn
identity (see [14, 17, 88]), which relates entropy to Fisher information, using the fact
that (see for example [88, Equation (5.1)] for fixed random variable X, the density
pt of X + Z0,t satisfies a heat equation of the form

∂pt 1 ∂ 2 pt 1 ∂
(z) = 2
(z) = (pt (z)ρt (z)) , (2)
∂t 2 ∂z 2 ∂z
where ρt (z) := ∂p
∂z
t
(z)/pt (z) is the score function with respect to location parameter,
in the Fisher sense.
More recent work of Bobkov, Chistyakov and Götze [19, 20] used properties of char-
acteristic functions to remove the assumption of finite Poincaré constant, and even
extended this theory to include convergence to other stable laws [18, 21].

2. [Maximum entropy] In [84, Section 20], Shannon described maximum entropy re-
sults for continuous random variables in exponential family form. That is, given a
test function ψ, positivity Rof the relative entropy shows that the entropy h(p) is max-

imised for fixed values of −∞ p(x)ψ(x)dx by densities proportional to exp(−cψ(x))
for some value of c.
This is a very useful result in many cases, and gives us well-known facts such as
that entropy is maximised under a variance constraint by the Gaussian density. This
motivates the information theoretic approaches to normal approximation described
above, as well as the monotonicity of entropy result of [8] described later.
However, many interesting densities cannot be expressed in exponential family form
for a natural choice of ψ, and we would like to extend maximum entropy theory to
cover these. As remarked in [40, p.215], “immediate candidates to be subjected to
such analysis are, of course, stable laws”. For example, consider the Cauchy density,
for which it is not at all natural to consider the expectation of ψ(x) = log(1 + x2 )
in order to define a maximum entropy class. In the discrete context of Section 4

2
we discuss the Poisson mass function, again believing that E log X! is not a natural
object of study.
3. [Concentration inequalities] Next we consider concentration inequalities; that is,
relationships between senses in which a function is approximately constant. We focus
on Poincaré and logarithmic Sobolev inequalities.
Definition 1.1. We say that a probability density p has Poincaré constant Rp , if for
all sufficiently well-behaved functions g,
Z ∞
Varp (g) ≤ Rp p(x)g 0 (x)2 dx. (3)
−∞

Papers vary in the exact sense of ‘well-behaved’ in this definition, and for the sake
of brevity we will not focus on this issue; we will simply suppose that both sides of
(3) are well-defined.
Example 1.2. Chernoff [32] (see also [23]) used an expansion in Hermite polynomi-
als (see [91]) to show that the Gaussian densities φµ,σ2 have Poincaré constant equal
to σ 2 . By considering g(t) = t, it is clear that for any random variable X with finite
variance, Rp ≥ Varp . Chernoff ’s result [32] played a central role in the information
theoretic proofs of the Central Limit Theorem of Brown [25] and Barron [14].

In a similar way, we define the log-Sobolev constant, in the form discussed in [10]:
Definition 1.3. We say that a probability density p satisfies the logarithmic Sobolev
inequality with constant C if for any function f with positive values:
C ∞ Γ1 (f, f )(x)
Z
Entp (f ) ≤ p(x)dx, (4)
2 −∞ f (x)
where we write Γ1 (f, g) = f 0 g 0 , and view the RHS as a Fisher-type expression.

This is an equivalent formulation of the log-Sobolev inequality first stated by Gross


[42], who proved that the Gaussian density φµ,σ2 has log-Sobolev constant σ 2 . In fact,
Gross’s Gaussian log-Sobolev inequality [42] follows from Shannon’s Entropy Power
Inequality [84, Theorem 15], as proved by the Stam–Blachman approach [17, 88].
However, we would like to find results that strengthen Chernoff and Gross’s results,
to provide inequalities of similar functional form, under assumptions that p belongs
to particular classes of densities. Gross’s Gaussian log-Sobolev inequality can be
considerably generalized by the Bakry–Émery calculus [9] (see [6, 10, 43] for reviews
of this theory). Assuming that the so-called Bakry-Émery condition is satisfied,
taking U (t) = t2 in [9, Proposition 5] we deduce (see also [10, Proposition 4.8.1]):
Theorem 1.4. If the Bakry-Émery condition holds with constant c then the Poincaré
inequality holds with constant 1/c.

3
Similarly [9, Theorem 1] (see also [10, Proposition 5.7.1]) gives that:

Theorem 1.5. If the Bakry-Émery condition holds with constant c then the logarith-
mic Sobolev inequality holds with constant 1/c.

If p is Gaussian with variance σ 2 , since (see [43, Example 4.18]), the Bakry-Émery
condition holds with constant c = 1/σ 2 , we recover the original Poincaré inequality
of Chernoff [32] and the log-Sobolev inequality of Gross [42]. Indeed, if the ratio
p/φµ,σ2 is a log-concave function, then this approcah shows that the Poincare and
log-Sobolev inequalities hold with constant σ 2 .

4. [Monotonicity of entropy] While Brown [25] and Barron √ [14] exploited ideas of
Stam [88] to deduce that entropy of (X1 + . . . + Xn )/ n increases in the Central
Limit Theorem regime along the ‘powers of 2 subsequence’ n = 2k , it remained an
open problem for many years to show that the entropy is monotonically increasing
in all n. This conjecture may well date back to Shannon, and was mentioned by
Lieb [68]. It was resolved by Artstein, Ball, Barthe and Naor [8], who proved a more
general result using a variational characterization of Fisher information (introduced
in [11]):

Theorem 1.6 ([8]). P Given independent continuous Xi with finite variance, for any
positive αi such that n+1
i=1 αi = 1, writing α
(j)
= 1 − αj , then
n+1
! n+1 !
X √ X Xq
nh αi Xi ≥ α(j) h αi /α(j) Xi . (5)
i=1 j=1 i6=j

This result is referred to as monotonicity because, choosing αi =P1/(n + √ 1) for


h ( ni=1 X
independent and identically distributed (IID) Xi , (5) shows that P √i n) is
/
n
monotonically increasing in n. Equivalently, relative entropy D ( i=1 Xi / nk Z) is
monotonically decreasing in n. This means that the Central Limit Theorem can be
viewed as an equivalent of the Second Law of Thermodynamics. An alternative proof
of this monotonicity result is given by Tulino and Verdú [94], based on the MMSE
characterization of mutual information of [44].
The form of (5) in the case n = 2 was previously a well-known result; indeed (see for
example [36]) it is equivalent to Shannon’s Entropy Power Inequality [84, Theorem
15]. However, Theorem 1.6 allows a stronger form of the Entropy Power Inequality
to be proved, referring to ‘leave-one-out’ sums (see [8, Theorem 3]. This was gen-
eralized by Madiman and Barron [72], who considered the entropy power of sums
of arbitrary subsets of the original variables. The more recent paper [73] reviews
in detail relationships between the Entropy Power Inequality and results in convex
geometry, including the Brunn-Minkowski inequality.

4
5. [Concavity of entropy] In the setting of probability measures taking values on the
real line R, there has been considerable interest in optimal transportation of mea-
sures; that is, given densities f0 and f1 , to find a sequence of densities ft smoothly
interpolating between them in a way which minimises some appropriate cost func-
tion. For example, using the natural quadratic cost function induces the Wasserstein
distance W2 between f0 and f1 . This theory is extensively reviewed in the two books
by Villani [95, 96], even for probability measures taking values on a Riemannian man-
ifold. In this setting, work by authors including Lott, Sturm and Villani [71, 89, 90]
relates the concavity of the entropy h(ft ) as a function of t to the Ricci curvature of
the underlying manifold.
Key works in this setting include those by Benamou and Brenier [15, 16] (see also
[30]), who gave a variational characterization of the Wasserstein distance between
f0 and f1 . Authors such as Caffarelli [27] and Cordero-Erausquin [33] used these
ideas to give new proofs of results such as log-Sobolev and transport inequalities.
Concavity of entropy also plays a key role in the field of information geometry (see
[2, 3, 78]), where the Hessian of entropy induces a metric on the space of probability
measures.

2 Technical definitions in the discrete case


In the remainder of this paper, we describe results which can be seen as discrete analogues
of the results of Section 1. In that section, a distinguished role was played by the set of
Gaussian densities, a set which has attractive properties including closure under convolu-
tion and scaling, and an explicit value for their entropies. We argue that a similar role
is played by the Poisson mass functions, a class which is preserved on convolution and
thinning (see Definition 2.2 below), even though (see [1]) we only have entropy bounds.
We now make some definitions which will be useful throughout the remainder of the
paper. From now, we will assume all random variables take values on the positive P∞ integers
Z+ . Given a probability P∞ mass function (distribution) P , we write λP = x=0 xP (x) for
2
its mean and VarP = x=0 (x − λP ) P (x) for its variance. In particular, write Πλ (x) =
e−λ λx /x! for the mass function of a Poisson random variable with mean λ and Bn,p (x) =
n x

x
p (1 − p)n−x for the mass function of a Binomial Bin(n, p). We write H(P ) for the
discrete entropy of a probability distribution P , and interchangably write h(X) for X ∼ P .
For any random variable X we write PX , defined by PX (x) = P(X = x), for its probability
mass function. P∞
Given a function f and mass function P , we write P∞ λ P (f ) = x=0 f (x)P (x) for the
expectation of f with respect to P , write VarP (f ) = x=0 P (x)(f (x) − λP (f ))2 and define

X
EntP (f ) = P (x)f (x) log f (x) − λP (f ) log λP (f ). (6)
x=0

5
Note that if f = Q/P , where Q is a probability mass function, then
∞  
X Q(x)
EntP (f ) = Q(x) log = D(QkP ), (7)
x=0
P (x)

the relative entropy from Q to P . We define the operators ∆ and ∆∗ (which are adjoint with
respect to counting measure) by ∆f (x) = f (x + 1) − f (x) and ∆∗ f (x) = f (x − 1) − f (x),
with the convention that f (−1) = 0.
We now make two further key definitions, in Condition 1 we define the ultra-log-concave
(ULC) random variables (as introduced by Pemantle [77] and Liggett [69]) and in Definition
2.2 we define the thinning operation introduced by Rényi [79] for point processes.
Condition 1 (ULC). For any λ, define the class of ultra-log-concave (ULC) mass functions
P
ULC(λ) = {P : λP = λ and P (v)/Πλ (v) is log-concave}.
Equivalently we require that vP (v)2 ≥ (v + 1)P (v + 1)P (v − 1), for all v.
The ULC property is preserved on convolution; that is, for P ∈ ULC(λP ) and Q ∈
ULC(λQ ) then P ?Q ∈ ULC(λP +λQ ) (see [97, Theorem 1] or [69, Theorem 2]). The ULC
class contains a variety of parametric families of random variables, including the Poisson
and sums of independent Bernoulli variables (Poisson–Binomial), including the binomial
distribution.
A weaker condition than ULC (Condition 1) is given in [57], which corresponds to
Assumption A of [28]:
Definition 2.1. Given a probability mass function P , write
P (x)2 − P (x − 1)P (x + 1) P (x) P (x − 1)
EP (x) := = − . (8)
P (x)P (x + 1) P (x + 1) P (x)
Condition 2 (c-log-concavity). If EP (x) ≥ c for all x ∈ Z+ , we say that P is c-log-concave.
[28, Section 3.2] showed that if P is ULC then it is c-log-concave, with c = P (0)/P (1).
In [57, Proposition 4.2] we show that DBE(c), a discrete form of the Bakry-Émery condi-
tion, is implied by c-log-concavity.
Another condition weaker than the ULC property (Condition 1) was discussed in [35],
related to the definition of a size-biased distribution. For a mass function P , we define
its size-biased version P ∗ by P ∗ (x) = (x + 1)P (x + 1)/λP (note that some authors define
this to be xP (x)/λP ; the difference is simply an offset). Now P ∈ ULC(λP ) means that
P ∗ is dominated by P in likelihood ratio ordering (write P ∗ ≤lr P ). This is a stronger
assumption than that P ∗ is stochastically dominated by P (write P ∗ ≤st P ) – see [82] for a
review of stochastic ordering results. As described in [34, 35, 75], such orderings naturally
arise in many contexts where P is the mass function of a sum of negatively dependent
random variables.
The other construction at the heart of this paper is the thinning operation introduced
by Rényi [79]:

6
Definition 2.2. Given probability mass function P , define the α-thinned version Tα P to
be ∞  
X y x
Tα P (x) = α (1 − α)y−x P (y). (9)
y=x
x

Equivalently, if Y ∼ P then Tα P is the distribution of a random variable Tα Y := Yi=1 Bi ,


P
where B1 , B2 . . . are IID Bernoulli(α) random variables, independent of Y .
A key result for Section 4 is the fact that (see [55, Proposition 3.7]) if P ∈ ULC(λP )
then Tα P ∈ ULC(αλP ). Further properties of this thinning operation are reviewed in
[47], including the fact that Tα preserves several parametric families, such as the Poisson,
binomial and negative binomial. In that √ paper, we argue that the thinning operation
Tα is the discrete equivalent of scaling by α and that the ‘mean-preserving
√ √ transform’
Tα X + T1−α Y is equivalent to the ‘variance-preserving transform’ αX + 1 − αY in the
continuous case.

3 Score function and Poisson approximation


A well-known phenomenon, often referred to as the ‘law of small numbers’, states that in
a triangular array of ‘small’ random variables, the row sums converge in distribution to
the Poisson probability distribution Πλ . For example, consider the setting where binomial
laws Bin(n, λ/n) → Πλ as n → ∞. This problem has been extensively studied, with a
variety of strong bounds proved using techniques such as the Stein–Chen method (see for
example [12] for a summary of this technique and its applications).
Following the pioneering work of Linnik [70] applying information theoretic ideas to
normal approximation, it was natural to wonder whether Poisson approximation could be
considered in a similar framework. An early and important contribution in this direction
was made by Johnstone and MacGibbon [61], who made the following natural definition
(also used in [62, 76]), which mirrors the Fisher-type expression of (4):
Definition 3.1. Given a probability distribution P , define
∞  2
X P (x − 1)
I(P ) := P (x) −1 . (10)
x=0
P (x)

Johnstone and MacGibbon showed that this quantity has a number of desirable features;
first a Cramér-Rao lower bound I(P ) ≥ 1/VarP (see [61, Lemma 1.2]), with equality if and
only if P is Poisson, second a projection identity which implies the subadditivity result
I(P ? Q) ≤ (I(P ) + I(Q))/4 (see [61, Proposition 2.2]). In theory, these two results should
allow us to deduce Poisson convergence in the required manner. However, there is a serious
problem. Expanding (10) we obtain

!
X P (x − 1)2
I(P ) = − 1. (11)
x=0
P (x)

7
For example, if P is a Bernoulli(p) distribution then the bracketed sum in (11) becomes
P (0)2 /P (1) + P (1)2 /P (2) = (1 − p)2 /p + p2 /0 = ∞. Indeed, for any mass function P
with finite support I(P ) = ∞. To counter this problem, a new definition was made by
Kontoyiannis, Harremoës and Johnson in [65].
Definition 3.2 ([65]). For probability mass function P with mean λP , define the scaled
score function ρP and scaled Fisher information K by
(x + 1)P (x + 1)
ρP (x) := − 1, (12)
λP P (x)
X∞
K(P ) := λP P (x)ρP (x)2 . (13)
x=0

Note that this definition does not suffer from the problems of Definition 3.1 described
above.
Example 3.3. For P Bernoulli(p), the scaled score function can be evaluated as ρP (0) =
P (1)/λP P (0) − 1 = p/(1 − p) and ρP (1) = 2P (2)/λP P (1) − 1 = −1, and the scaled Fisher
information as K(P ) = p2 /(1 − p).
Further, K retains the desirable features of Definition 3.1; first positivity of the perfect
square ensures that K(P ) ≥ 0 with equality if and only if P is Poisson and second a
projection identity [65, Lemma, P.471] ensures that the following subadditivity result holds:
Pn
Theorem 3.4 ([65], Proposition 3). If S = P i=1 Xi is the sum of n independent random
variables Xi with means λi then writing λS = ni=1 λi :
n
1 X
K(PS ) ≤ λi K(PXi ), (14)
λS i=1

As discussed in [65], this result implies Poisson convergence at rates of an optimal order,
in a variety of senses including Hellinger distance and total variation. However, perhaps
the most interesting consequence comes via the following modified logarithmic Sobolev
inequality of Bobkov and Ledoux [22], proved via a tensorization argument:
Theorem 3.5 ([22], Corollary 4). For any function f taking positive values:

X (f (x + 1) − f (x))2
EntΠλ (f ) ≤ λ Πλ (x) . (15)
x=0
f (x)

Observe that (see [65, Proposition 2]) taking f = P/Πλ for some probability mass
function P , the RHS of (15) can be written as
∞  2 ∞  2
X f (x + 1) X (x + 1)P (x + 1)
λ Πλ (x)f (x) −1 =λ P (x) −1 , (16)
x=0
f (x) x=0
λP P (x)

and so (15) becomes D(P kΠλ ) ≤ K(P ). Hence, combining Theorems 3.4 and 3.5, we
deduce convergence in the strong sense of relative entropy in a general setting.

8
Example 3.6. (See [65, Example 1]) Combining Example 3.3 and Theorem 3.5 we deduce
that
λ2
D(Bn,λ/n kΠλ ) ≤ , (17)
n(n − λ)
so via Pinsker’s inequality [66] we deduce that
λ
kBn,λ/n − Πλ kT V ≤ (2 + ) , (18)
n
giving the same order of convergence (if not the optimal constant) as results stated in [12].
We make the following brief remarks, the first of which will be useful in the proof of
the Poisson maximum entropy property in Section 4:
Remark 3.7. An equivalent formulation of the ULC property of Condition 1 is that P ∈
ULC(λP ) if and only if the scaled score function ρP (x) is a decreasing function of x.
Remark 3.8. In [13], definitions of score functions and subadditive inequalities are ex-
tended to the compound Poisson setting. The resulting compound Poisson approximation
bounds are close to the strongest known results, due to authors such as Roos [81].
Remark 3.9. The fact that two competing definitions of the score function and Fisher
information are possible (see Definition 3.1 and Definition 3.2) corresponds to the two
definitions of the Stein operator in [67, Section 1.2].

4 Poisson maximum entropy property


We next prove a maximum entropy property for the Poisson distribution. The first results
in this direction were proved by Shepp and Olkin [86] and by Mateev [74], who showed
that entropy is maximised in the class of Poisson-binomial variables with mean λ by the
binomial Bin(n, λ/n) distribution. Using a limiting argument, Harremoës [46] showed that
the Poisson Πλ maximises the entropy in the closure of the set of Poisson-binomial variables
with mean λ (though note that the Poisson itself does not lie in this set).
Johnson [55] gave the following argument, based on the idea of ‘smart paths’. For
some function Λ(α), surprisingly, it can be easiest to prove Λ(1) ≤ Λ(0) by proving the
stronger result that Λ0 (α) ≤ 0 for all α ∈ [0, 1]. The key idea in this case is to introduce
an interpolation between probability measures using the thinning operation of Definition
2.2, and to show that it satisfies a partial differential equation corresponding to the action
of the M/M/∞ queue (see also [31]).
Lemma 4.1. Write Pα (z) = P(Tα X + T1−α Y = z), where X ∼ PX with mean λ and
Y ∼ Πλ then (in a form reminiscent of (2) above):
∂ λ ∗
Pα (x) = ∆ (Pα (x)ρα (x)), (19)
∂α α
where ρα := ρPα is the score in the sense of Definition 3.2 above.

9
The gradient form of the RHS of this partial derivative is key for us, and is reminiscent
of the continuous probability models studied in [4]. Such representations will lie at the
heart of the analysis of entropy in the Shepp–Olkin setting [50, 51, 86] in Section 7.
Theorem 4.2. If P ∈ ULC(λ) then

H(P ) ≤ H(Πλ ), (20)

with equality if and only if P ≡ Πλ .


Proof. We consider the functional Λ(α) = − ∞
P
x=0 Pα (x) log Πλ (x) as a function of α.
Using Lemma 4.1 and the adjoint property of ∆ and ∆∗ we deduce that
∞ ∞
0
X ∂ λX ∗
Λ (α) = − Pα (x) log Πλ (x) = − ∆ (Pα (x)ρα (x)) log Πλ (x)
x=0
∂α α x=0
∞  
λX x+1
= Pα (x)ρα (x) log . (21)
α x=0 λ

Now, the assumption that P ∈ ULC(λ) means that Pα ∈ ULC(λ) so by Remark 3.7, the
score function ρα (z) is decreasing in z. Hence (21) is the covariance of an increasing and
a decreasing function, so by ‘Chebyshev’s other inequality’ (see [63, Equation (1.7)]) it is
negative.
In other words, X ∈ ULC(λ) makes Λ0 (α) ≤ 0 so Λ(α) is a decreasing function in α.
Since P0 = Πλ , and P1 = P , we deduce that

X
H(X) ≤ − P (x) log Πλ (x) (22)
x=0

X
= Λ(1) ≤ Λ(0) = − Πλ (x) log Πλ (x) = H(Πλ ),
x=0

and the result is proved. Here (22) follows by the positivity of relative entropy.
Remark 4.3. These ideas were developed by Yu [103], who proved that for any n, λ the
binomial mass function Bn,λ/n maximises entropy in the class of mass functions P such
that P/Bn,λ/n is log-concave (this class is referred to as the ULC(n) class by [77] and [69]).
Remark 4.4. A related extension was given in the compound Poisson case in [59], one
assumption of which was removed by Yu [105]. Yu’s result [105, Theorem 3] showed that
(assuming it is log-concave) the compound Poisson distribution CP (λ, Q) is maximum
entropy among all distributions with ‘claim number distribution’ in ULC(λ) and given
‘claim size distribution’ Q. Yu [105, Theorem 2] also proved a corresponding result for
compound binomial distributions.
Remark 4.5. The assumption in Theorem 4.2 that P ∈ ULC(λP ) was weakened in recent
work of Daly [34, Corollary 2.3], who proved the same result assuming that P ∗ ≤st P .

10
5 Poincaré and log-Sobolev inequalities
We give a definition which mimics Definition 1.1 in the discrete case:
Definition 5.1. We say that a probability mass function P has Poincaré constant RP , if
for all sufficiently well-behaved functions g,
X ∞
VarP (g) ≤ RP P (x) (∆g(x))2 . (23)
x=0

Example 5.2. Klaassen’s paper [64] (see also [26]) showed that the Poisson random vari-
able Πλ has Poincaré constant λ. As in Example 1.2, taking g(t) = t, we deduce that
RP ≥ VarP for all P with finite variance. The result of Klaassen can also be proved by
expanding in Poisson–Charlier polynomials (see [91]), to mimic the Hermite polynomial
expansion of Example 1.2.
However, we would like to generalize and extend Klaassen’s work, in the spirit of The-
orems 1.4 and 1.5 (originally due to Bakry and Émery [9]). Some results in this direction
were given by Chafaı̈ [31], however we briefly summarize two more recent approaches, using
ideas described in two separate papers.
In [35, Theorem 1.1], using a construction based on Klaassen’s original paper [64], it
was proved that RP ≤ λP , assuming that P ∗ ≤st P . Recall (see [82]) that this implies
P ∗ ≤lr P , which is a restatement of the ULC property (Condition 1). We deduce the
following result [35, Corollary 2.4]:
Theorem 5.3. For P ∈ ULC(λ):
VarP ≤ RP ≤ λP .
The second approach, appearing in [57], is based on a birth-and-death Markov chain
construction introduced by Caputo, Dai Pra and Posta [28], generalizing the thinning-based
interpolation of Lemma 4.1. In [57], ideas based on the Bakry-Émery calculus are used to
prove the following two main results, taken from [57, Theorem 1.5] and [57, Theorem 1.3],
which correspond to Theorems 1.4 and 1.5 respectively.
Theorem 5.4 (Poincaré inequality). Any probability mass function P whose support is the
whole of the positive integers Z+ and which satisfies the c-log-concavity condition (Condi-
tion 2) has Poincaré constant 1/c.
Note that (see the discussion at the end of [57]), Theorem 5.4 typically gives a weaker
result than Theorem 5.3, under weaker conditions. Next we describe a modified form of
log-Sobolev inequality proved in [57]:
Theorem 5.5 (New modified log-Sobolev inequality). Fix probability mass function P
whose support is the whole of the positive integers Z+ and which satisfies the c-log-concavity
condition (Condition 2). For any function f with positive values:
∞    
1X f (x + 1) f (x)
EntP (f ) ≤ P (x)f (x + 1) log −1+ . (24)
c x=0 f (x) f (x + 1)

11
Remark 5.6. As described in [57], Theorem 5.5 generalizes a modified log-Sobolev inequal-
ity of Wu [100], which holds in the case where P = Πλ (see [104] for a more direct proof of
this result). Wu’s result in turn strengthens Theorem 3.5, the modified logarithmic Sobolev
inequality of Bobkov and Ledoux [22]. It can be seen directly that Theorem 5.5 tightens
Theorem 3.5 using the bound log u ≤ u − 1 with u = f (x + 1)/f (x).

Note that since Theorems 5.4 and 5.5 are proved under the c-log-concavity condition
(Condition 2), they hold under the stronger assumption that P ∈ ULC(λ) (Condition 1),
taking c = P (0)/P (1).

Remark 5.7. Note that (for reasons similar to those affecting the Johnstone–MacGibbon
Fisher information quantity I(P ) of Definition 3.1), Theorems 5.4 and 5.5 are restricted
to mass functions supported on the whole of Z+ . In order to consider mass functions such
as the Binomial Bn,p , it would be useful to remove this assumption.
However, one possibility is that in this case, we may wish to modify the form of the
derivative used in Definition 5.1. In the paper [52], for mass functions P supported on the
finite set {0, 1, . . . , n}, a ‘mixed derivative’
 x x
∇n f (x) = 1 − (f (x + 1) − f (x)) + (f (x) − f (x − 1)) (25)
n n
was introduced. This interpolates between left and right derivatives, according to the po-
sition where the derivative is taken. In [52] it was shown that the Binomial Bn,p mass
function satisfies a Poincaré inequality with respect to ∇n . The proof was based on an ex-
pansion in Krawtchouk polynomials (see [91]), and exactly parallels the results of Chernoff
(Example 1.2) and Klaassen (Example 5.2). It would be of interest to prove results that
mimic Theorem 5.4 and 5.5 for the derivative ∇n of (25).

6 Monotonicity of entropy
Next we describe results concerning the monotonicity of entropy on summation and thin-
ning, in a regime described in [47] as the ‘law of thin numbers’. The first interesting feature
is that (unlike the Gaussian case of Theorem 1.6) monotonicity of entropy and of relative
entropy are not equivalent. By convex ordering arguments, Yu [104] showed that:

Theorem 6.1. Given a random variable X with probability mass function PX and mean
λ, we write D(X) for D(PX kΠλ ). For independent and identically distributed Xi , with
mass function P and mean λ:
Pn 
1. relative entropy D i=1 T1/n X i is monotone decreasing in n,
Pn 
2. If P ∈ ULC(λ), the entropy H i=1 T1/n Xi is monotone increasing in n.

In fact, implicit in the work of Yu [104] is the following stronger theorem:

12
Theorem 6.2. Given positive αi such that n+1 (j)
P
i=1 αi = 1, and writing α = 1 − αj , then
for any independent Xi ,
n+1
! n+1 !
X X X
nD Tαi Xi ≤ α(j) D Tαi /α(j) Xi .
i=1 j=1 i6=j

In [60, Theorem 3.2], the following result was proved:


Theorem 6.3. Given positive αi such that n+1 (j)
P
i=1 αi = 1, and writing α = 1 − αj , then
for any independent ULC Xi ,
n+1
! n+1 !
X X X
nH Tαi Xi ≥ α(j) H Tαi /α(j) Xi .
i=1 j=1 i6=j

Comparison with (5) shows that this is a direct analogue of the result of Artstein et
al. √
[8], replacing differential entropies h with discrete entropy H, and replacing scalings
by α by thinnings by α. Since the result of Artstein et al. is a strong one, and implies
strengthened forms of the Entropy Power Inequality, this equivalence is good news for us.
However, there are two serious drawbacks.
Remark 6.4. First, the proof of Theorem 6.3, which is based on an analysis of ‘free
energy’ terms similar to the Λ(α) of Section 4, does not lend itself to easy generalization or
extension. It is possible that a different proof (perhaps using a variational characterization
similar to that of [7, 11]) may lend more insight.
Remark 6.5. Second, Theorem 6.3 does not lead to a single universally accepted discrete
Entropy Power Inequality in the way that Theorem 1.6 did. As discussed in [60], several
natural reformulations of the Entropy Power Inequality fail, and while [60, Theorem 2.5]
does prove a particular discrete Entropy Power Inequality, it is by no means the only
possible result of this kind. Indeed, a growing literature continues to discuss the right
formulation of this inequality.
For example, [83, 99, 101, 102] consider replacing integer addition by binary addition,
with the paper [102] introducing the so-called Mrs. Gerber’s Lemma, extended to a more
general group context by Anantharam and Jog [5, 53]. Harremoës and Vignat [48] (see also
[85]) consider conditions under which the original functional form of the Entropy Power
Inequality continues to hold. Other authors use ideas from additive combinatorics and
sumset theory, with Haghighatshoar, Abbe and Telatar [45] adding an explicit error term
and Wang, Woo and Madiman [98] giving a reformulation based on rearrangements. It
appears that there is considerable scope for extensions and unifications of all this theory.

7 Entropy concavity and the Shepp–Olkin conjecture


In 1981, Shepp and Olkin [86] made a conjecture regarding the entropy of Poisson-binomial
random variables. That is, for any vector p := (p1 , . . . , pm ) where 0 ≤ pi ≤ 1, we can write

13
Pp for the probability mass function of the random variable S := X1 + . . . + Xm , where
Xi are independent Bernoulli(pi ) variables. Shepp and Olkin conjectured that the entropy
of H(Pp ) is a concave function of the parameters. We can simplify this conjecture by
considering the affine case, where each pi (t) = (1 − t)pi (0) + tpi (1).
This result is plausible since Shepp and Olkin [86] (see also [74]) proved that for any m
the entropy H(Bm,p ) of a Bernoulli random variable is concave in p, and also claimed in
their paper that it was true for the sum of m independent Bernoulli random variables when
m = 2, 3. Since then, progress was limited. In [106, Theorem 2] it was proved that the
entropy is concave when for each i either pi (0) = 0 or pi (1) = 0. (In fact, this follows from
the case n = 2 of Theorem 6.3 above). Hillion [49] proved the case of the Shepp–Olkin
conjecture where all pi (t) are either constant or equal to t.
However, in recent work of Hillion and Johnson [50, 51], the Shepp–Olkin conjecture
was proved, in a result stated as [51, Theorem 1.2]:

Theorem 7.1 (Shepp-Olkin Theorem). For any m ≥ 1, function p 7→ H(Pp ) is concave.

We briefly summarise the strategy of the proof, and the ideas involved. In [50], Theorem
7.1 was proved in the monotone case where each p0i := p0i (t) has the same sign. In [51],
this was extended to the general case; in fact, the monotone case of [50] is shown to be
∂2
the worst case. In order to bound the second derivative of entropy ∂t 2 H(Pp(t) ), we need to

control the first and second derivatives of the probability mass function Pp(t) , in the spirit
of Lemma 4.1. Indeed, we write (see [51, Equations (8),(9)]) the derivative of the mass
function in gradient form (see (2) and (19) for comparison):

∂Pp(t)
(k) = g(k − 1) − g(k) (26)
∂t
∂ 2 Pp(t)
(k) = h(k − 2) − 2h(k − 1) + h(k) (27)
∂t2
for certain functions g(k) and h(k), about which we can be completely explicit. The key
property (referred to as Condition 4 in [50]) is that for all k:

h(k) f (k + 1)2 − f (k)f (k + 2)




≤ 2g(k)g(k + 1)f (k + 1) − g(k)2 f (k + 2) − g(k + 1)2 f (k), (28)


∂ 2
which allows us to bound the non-linear terms arising in the 2nd derivative ∂t 2 H(Pp(t) ).

The proof of (28) is based on a cubic inequality [50, Property(m)] satisfied by Poisson-
binomial mass functions, which relates to an iterated log-concavity result of Brändén [24].
The argument in [50] was based around the idea of optimal transport of probability
measures in the discrete setting, where it was shown that if all the p0i have the same sign
then the Shepp–Olkin path provides an optimal interpolation, in the sense of a discrete
formula of Benamou–Brenier type [15, 16] (see also [38, 41] which also consider optimal
transport of discrete measures, including analysis of discrete curvature in the sense of [28]).

14
8 Open problems
We briefly list some open problems, corresponding to each research direction described in
Sections 3 to 7. This is not a complete list of open problems in the area, but gives an
indication of some possible future research directions.

1. [Poisson approximation] The information theoretic approach to Poisson approx-


imation described in Section 3 holds only in the case of independent summands. In
the original paper [65] bounds are given on the relative entropy, based on the data
processing inequality, which hold for more general dependence structures. However,
these data processing results do not give Poisson approximation bounds of optimal
order. It is an interesting problem to generalize the projection identity [65, Lemma,
P.471] to the dependent case (perhaps resulting in an inequality), under some appro-
priate model of (negative) dependence, allowing Poisson approximation results to be
proved.
Further, although (as described above) such information theoretic approximation
results hold for Poisson and compound Poisson limit distributions, it would be of
interest to generalize this theory to a wider class of discrete distributions.

2. [Maximum entropy] While not strictly a question to do with discrete random vari-
ables, as described in Section 1 it remains an interesting problem to find classes within
which the stable laws are maximum entropy. As with the ULC class of Condition 1,
or the fixed variance class within which the Gaussian maximises entropy, we want
such a class to be well-behaved on convolution and scaling. Preliminary work in this
direction, including the fact that stable laws are not in general maximum entropy
in their own domain of normal attraction (in the sense of [39]), is described in [56],
and an alternative approach based on non-linear diffusion equations and fractional
calculus is given by Toscani [92].

3. [Concentration inequalities] As described at the end of Section 5, it would be of


interest to extend these results to the case of mass functions supported on a finite set
{0, 1, . . . , n}. We would like to introduce concavity conditions mirroring Conditions
1 or 2, under which results of the form Theorem 5.4 and 5.5 can be proved for the
derivative ∇n of (25). In addition, it would be of interest to see whether a version
of Theorem 5.5 holds under a stochastic ordering assumption P ∗ ≤st P , in the spirit
of [34].

4. [Monotonicity] As in Section 6, it is of interest to develop a new and more illumi-


nating proof of Theorem 6.3, perhaps based on transport inequalities as in [8]. It is
possible that such a proof will give some indication of the correct formulation of the
discrete Entropy Power Inequality, which is itself a key open problem.

5. [Entropy concavity] As described in [51], it is natural to conjecture that the q-


Rényi [80] and q-Tsallis [93] entropies are concave functions for q sufficiently small.

15
Note that while they are monotone functions of each other, the q-Rényi and q-Tsallis
entropies need not be concave for the same q. We quote the following generalized
Shepp-Olkin conjecture from [51, Conjecture 4.2]:

(a) There is a critical qR∗ such that the q-Rényi entropy of all Bernoulli sums is
concave for q ≤ qR∗ , and the entropy of some interpolation is convex for q > qR∗ .
(b) There is a critical qT∗ such that the q-Tsallis entropy of all Bernoulli sums is
concave for q ≤ qT∗ , and the entropy of some interpolation is convex for q > qT∗ .

Indeed we conjecture that qR∗ = 2 and qT∗ = 3.65986 . . ., the root of 2−4q+2q = 0. We
remark that the form of discrete Entropy Power Inequality proposed by [98] (based
on the theory of rearrangements) also holds for Rényi entropies.
In addition, Shepp and Olkin [86] made another conjecture regarding Bernoulli sums,
that the entropy H(Pp ) is a monotone increasing function in p if all pi ≤ 1/2, which
remains open with essentially no published results.

Acknowledgments
The author thanks the Institute for Mathematics and Its Applications for the invitation
and funding to speak at the workshop ‘Information Theory and Concentration Phenomena’
in Minneapolis in April 2015. In addition, he would like to thank the organisers and
participants of this workshop for stimulating discussions. The author thanks Fraser Daly,
Mokshay Madiman and an anonymous referee for helpful comments on earlier drafts of
this paper.

References
[1] J. A. Adell, A. Lekuona, and Y. Yu. Sharp bounds on the entropy of the Poisson law
and related quantities. IEEE Trans. Inform. Theory, 56(5):2299 –2306, may 2010.

[2] S.-i. Amari, O. E. Barndorff-Nielsen, R. E. Kass, S. L. Lauritzen, and C. R. Rao.


Differential geometry in statistical inference. Institute of Mathematical Statistics Lec-
ture Notes—Monograph Series, 10. Institute of Mathematical Statistics, Hayward,
CA, 1987.

[3] S.-i. Amari and H. Nagaoka. Methods of information geometry, volume 191 of Trans-
lations of Mathematical Monographs. American Mathematical Society, Providence,
RI, 2000.

[4] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the
space of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser
Verlag, Basel, second edition, 2008.

16
[5] V. Anantharam. Counterexamples to a proposed Stam inequality on finite groups.
IEEE Trans. Inform. Theory, 56(4):1825 –1827, 2010.

[6] C. Ané, S. Blachere, D. Chafaı̈, P. Fougeres, I. Gentil, F. Malrieu, C. Roberto, and


G. Scheffer. Sur les inégalités de Sobolev logarithmiques. Panoramas et Syntheses,
10:217, 2000.

[7] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. On the rate of convergence in the
entropic central limit theorem. Probab. Theory Related Fields, 129(3):381–390, 2004.

[8] S. Artstein, K. M. Ball, F. Barthe, and A. Naor. Solution of Shannon’s problem on


the monotonicity of entropy. J. Amer. Math. Soc., 17(4):975–982 (electronic), 2004.

[9] D. Bakry and M. Émery. Diffusions hypercontractives. In Séminaire de probabilités,


XIX, volume 1123 of Lecture Notes in Math., pages 177–206. Springer, Berlin, 1985.

[10] D. Bakry, I. Gentil, and M. Ledoux. Analysis and geometry of Markov diffusion
operators, volume 348 of Grundlehren der mathematischen Wissenschaften. Springer,
2014.

[11] K. Ball, F. Barthe, and A. Naor. Entropy jumps in the presence of a spectral gap.
Duke Math. J., 119(1):41–63, 2003.

[12] A. Barbour, L. Holst, and S. Janson. Poisson Approximation. Clarendon Press,


Oxford, 1992.

[13] A. Barbour, O. T. Johnson, I. Kontoyiannis, and M. Madiman. Compound Poisson


approximation via local information quantities. Electronic Journal of Probability,
15:1344–1369, 2010.

[14] A. R. Barron. Entropy and the Central Limit Theorem. Ann. Probab., 14(1):336–342,
1986.

[15] J.-D. Benamou and Y. Brenier. A numerical method for the optimal time-continuous
mass transport problem and related problems. In Monge Ampère equation: appli-
cations to geometry and optimization (Deerfield Beach, FL, 1997), volume 226 of
Contemp. Math., pages 1–11. Amer. Math. Soc., Providence, RI, 1999.

[16] J.-D. Benamou and Y. Brenier. A computational fluid mechanics solution to the
Monge-Kantorovich mass transfer problem. Numer. Math., 84(3):375–393, 2000.

[17] N. M. Blachman. The convolution inequality for entropy powers. IEEE Trans.
Information Theory, 11:267–271, 1965.

[18] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Convergence to stable laws in relative


entropy. Journal of Theoretical Probability, 26(3):803–818, 2013.

17
[19] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Rate of convergence and Edgeworth-
type expansion in the entropic central limit theorem. Ann. Probab., 41(4):2479–2512,
2013.
[20] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Berry–Esseen bounds in the entropic
central limit theorem. Probability Theory and Related Fields, 159(3-4):435–478, 2014.
[21] S. G. Bobkov, G. P. Chistyakov, and F. Götze. Fisher information and convergence
to stable laws. Bernoulli, 20(3):1620–1646, 2014.
[22] S. G. Bobkov and M. Ledoux. On modified logarithmic Sobolev inequalities for
Bernoulli and Poisson measures. J. Funct. Anal., 156(2):347–365, 1998.
[23] A. Borovkov and S. Utev. On an inequality and a related characterisation of the
normal distribution. Theory Probab. Appl., 28(2):219–228, 1984.
[24] P. Brändén. Iterated sequences and the geometry of zeros. J. Reine Angew. Math.,
658:115–131, 2011.
[25] L. D. Brown. A proof of the Central Limit Theorem motivated by the Cramér-Rao
inequality. In G. Kallianpur, P. R. Krishnaiah, and J. K. Ghosh, editors, Statistics
and Probability: Essays in Honour of C.R. Rao, pages 141–148. North-Holland, New
York, 1982.
[26] T. Cacoullos. On upper and lower bounds for the variance of a function of a random
variable. Ann. Probab., 10(3):799–809, 1982.
[27] L. A. Caffarelli. Monotonicity properties of optimal transportation and the FKG
and related inequalities. Communications in Mathematical Physics, 214(3):547–563,
2000.
[28] P. Caputo, P. Dai Pra, and G. Posta. Convex entropy decay via the Bochner-Bakry-
Emery approach. Ann. Inst. Henri Poincaré Probab. Stat., 45(3):734–753, 2009.
[29] E. Carlen and A. Soffer. Entropy production by block variable summation and
Central Limit Theorems. Comm. Math. Phys., 140(2):339–371, 1991.
[30] E. A. Carlen and W. Gangbo. Constrained steepest descent in the 2-Wasserstein
metric. Ann. of Math. (2), 157(3):807–846, 2003.
[31] D. Chafaı̈. Binomial-Poisson entropic inequalities and the M/M/∞ queue. ESAIM
Probability and Statistics, 10:317–339, 2006.
[32] H. Chernoff. A note on an inequality involving the normal distribution. Ann. Probab.,
9(3):533–535, 1981.
[33] D. Cordero-Erausquin. Some applications of mass transport to Gaussian-type in-
equalities. Arch. Ration. Mech. Anal., 161(3):257–269, 2002.

18
[34] F. Daly. Negative dependence and stochastic orderings. arxiv:1504.06493, 2015.

[35] F. Daly and O. T. Johnson. Bounds on the Poincaré constant under negative depen-
dence. Statistics and Probability Letters, 83:511–518, 2013.

[36] A. Dembo, T. M. Cover, and J. A. Thomas. Information theoretic inequalities. IEEE


Trans. Information Theory, 37(6):1501–1518, 1991.

[37] Y. Derriennic. Entropie, théorèmes limite et marches aléatoires. In H. Heyer, editor,


Probability Measures on Groups VIII, Oberwolfach, number 1210 in Lecture Notes in
Mathematics, pages 241–284, Berlin, 1985. Springer-Verlag. In French.

[38] M. Erbar and J. Maas. Ricci curvature of finite Markov chains via convexity of the
entropy. Archive for Rational Mechanics and Analysis, 206:997–1038, 2012.

[39] B. V. Gnedenko and A. N. Kolmogorov. Limit distributions for sums of independent


random variables. Addison-Wesley, Cambridge, Mass, 1954.

[40] B. V. Gnedenko and V. Y. Korolev. Random Summation: Limit Theorems and


Applications. CRC Press, Boca Raton, Florida, 1996.

[41] N. Gozlan, C. Roberto, P.-M. Samson, and P. Tetali. Displacement convexity of


entropy and related inequalities on graphs. Probability Theory and Related Fields,
160(1–2):47–94, 2014.

[42] L. Gross. Logarithmic Sobolev inequalities. Amer. J. Math., 97(4):1061–1083, 1975.

[43] A. Guionnet and B. Zegarlinski. Lectures on logarithmic Sobolev inequalities. In


Séminaire de Probabilités, XXXVI, volume 1801 of Lecture Notes in Math., pages
1–134. Springer, Berlin, 2003.

[44] D. Guo, S. Shamai, and S. Verdú. Mutual information and minimum mean-square
error in Gaussian channels. IEEE Trans. Inform. Theory, 51(4):1261–1282, 2005.

[45] S. Haghighatshoar, E. Abbe, and I. E. Telatar. A new entropy power inequality


for integer-valued random variables. IEEE Transactions on Information Theory,
60(7):3787–3796, 2014.

[46] P. Harremoës. Binomial and Poisson distributions as maximum entropy distributions.


IEEE Trans. Information Theory, 47(5):2039–2041, 2001.

[47] P. Harremoës, O. T. Johnson, and I. Kontoyiannis. Thinning, entropy and the law
of thin numbers. IEEE Trans. Inform. Theory, 56(9):4228–4244, 2010.

[48] P. Harremoës and C. Vignat. An Entropy Power Inequality for the binomial fam-
ily. JIPAM. J. Inequal. Pure Appl. Math., 4, 2003. Issue 5, Article 93; see also
http://jipam.vu.edu.au/.

19
[49] E. Hillion. Concavity of entropy along binomial convolutions. Electron. Commun.
Probab., 17(4):1–9, 2012.

[50] E. Hillion and O. T. Johnson. Discrete versions of the transport equation and the
Shepp-Olkin conjecture. Annals of Probability, 44(1):276–306, 2016.

[51] E. Hillion and O. T. Johnson. A proof of the Shepp-Olkin entropy concavity conjec-
ture. Bernoulli (to appear), 2016. See also arxiv:1503.01570.

[52] E. Hillion, O. T. Johnson, and Y. Yu. A natural derivative on [0, n] and a binomial
Poincaré inequality. ESAIM Probability and Statistics, 16:703–712, 2014.

[53] V. Jog and V. Anantharam. The entropy power inequality and Mrs. Gerber’s Lemma
for groups of order 2n . IEEE Transactions on Information Theory,, 60(7):3773–3786,
2014.

[54] O. T. Johnson. Information theory and the Central Limit Theorem. Imperial College
Press, London, 2004.

[55] O. T. Johnson. Log-concavity and the maximum entropy property of the Poisson
distribution. Stoch. Proc. Appl., 117(6):791–802, 2007.

[56] O. T. Johnson. A de Bruijn identity for symmetric stable laws. In submission, see
arXiv:1310.2045, 2013.

[57] O. T. Johnson. A discrete log-Sobolev inequality under a Bakry-Émery type condi-


tion. In submission, see: arxiv:1507.06268, 2015.

[58] O. T. Johnson and A. R. Barron. Fisher information inequalities and the Central
Limit Theorem. Probability Theory and Related Fields, 129(3):391–409, 2004.

[59] O. T. Johnson, I. Kontoyiannis, and M. Madiman. Log-concavity, ultra-log-concavity,


and a maximum entropy property of discrete compound Poisson measures. Discrete
Applied Mathematics, 161:1232–1250, 2013.

[60] O. T. Johnson and Y. Yu. Monotonicity, thinning and discrete versions of the Entropy
Power Inequality. IEEE Trans. Inform. Theory, 56(11):5387–5395, 2010.

[61] I. Johnstone and B. MacGibbon. Une mesure d’information caractérisant la loi de


Poisson. In Séminaire de Probabilités, XXI, pages 563–573. Springer, Berlin, 1987.

[62] A. Kagan. A discrete version of the Stam inequality and a characterization of the
Poisson distribution. J. Statist. Plann. Inference, 92(1-2):7–12, 2001.

[63] J. F. C. Kingman. Uses of exchangeability. Ann. Probability, 6(2):183–197, 1978.

[64] C. Klaassen. On an inequality of Chernoff. Ann. Probab., 13(3):966–974, 1985.

20
[65] I. Kontoyiannis, P. Harremoës, and O. T. Johnson. Entropy and the law of small
numbers. IEEE Trans. Inform. Theory, 51(2):466–472, 2005.

[66] S. Kullback. A lower bound for discrimination information in terms of variation.


IEEE Trans. Information Theory, 13:126–127, 1967.

[67] C. Ley and Y. Swan. Stein’s density approach for discrete distributions and infor-
mation inequalities. See arxiv:1211.3668, 2012.

[68] E. Lieb. Proof of an entropy conjecture of Wehrl. Comm. Math. Phys., 62:35–41,
1978.

[69] T. M. Liggett. Ultra logconcave sequences and negative dependence. J. Combin.


Theory Ser. A, 79(2):315–325, 1997.

[70] Y. Linnik. An information-theoretic proof of the Central Limit Theorem with the
Lindeberg Condition. Theory Probab. Appl., 4:288–299, 1959.

[71] J. Lott and C. Villani. Ricci curvature for metric-measure spaces via optimal trans-
port. Ann. of Math. (2), 169(3):903–991, 2009.

[72] M. Madiman and A. Barron. Generalized entropy power inequalities and mono-
tonicity properties of information. IEEE Trans. Inform. Theory, 53(7):2317–2329,
2007.

[73] M. Madiman, J. Melbourne, and P. Xu. Forward and reverse Entropy Power Inequal-
ities in convex geometry. See: arxiv:1604.04225, 2016.

[74] P. Mateev. The entropy of the multinomial distribution. Teor. Verojatnost. i Prime-
nen., 23(1):196–198, 1978.

[75] N. Papadatos and V. Papathanasiou. Poisson approximation for a sum of dependent


indicators: an alternative approach. Adv. in Appl. Probab., 34(3):609–625, 2002.

[76] V. Papathanasiou. Some characteristic properties of the Fisher information matrix


via Cacoullos-type inequalities. J. Multivariate Anal., 44(2):256–265, 1993.

[77] R. Pemantle. Towards a theory of negative dependence. J. Math. Phys., 41(3):1371–


1390, 2000.

[78] C. R. Rao. On the distance between two populations. Sankhya, 9:246–248, 1948.

[79] A. Rényi. A characterization of Poisson processes. Magyar Tud. Akad. Mat. Kutató
Int. Közl., 1:519–527, 1956.

[80] A. Rényi. On measures of entropy and information. In J. Neyman, editor, Proceedings


of the 4th Berkeley Conference on Mathematical Statistics and Probability, pages 547–
561, Berkeley, 1961. University of California Press.

21
[81] B. Roos. Kerstan’s method for compound Poisson approximation. Ann. Probab.,
31(4):1754–1771, 2003.

[82] M. Shaked and J. G. Shanthikumar. Stochastic orders. Springer Series in Statistics.


Springer, New York, 2007.

[83] S. Shamai and A. Wyner. A binary analog to the entropy-power inequality. IEEE
Trans. Inform. Theory, 36(6):1428–1430, Nov 1990.

[84] C. E. Shannon. A mathematical theory of communication. Bell System Tech. J.,


27:379–423, 623–656, 1948.

[85] N. Sharma, S. Das, and S. Muthukrishnan. Entropy power inequality for a family of
discrete random variables. In 2011 IEEE International Symposium on Information
Theory Proceedings (ISIT), pages 1945–1949. IEEE, 2011.

[86] L. A. Shepp and I. Olkin. Entropy of the sum of independent Bernoulli random
variables and of the multinomial distribution. In Contributions to probability, pages
201–206. Academic Press, New York, 1981.

[87] R. Shimizu. On Fisher’s amount of information for location family. In Patil, G. P.


and Kotz, S. and Ord, J. K., editor, A Modern Course on Statistical Distributions
in Scientific Work, Volume 3, pages 305–312. Reidel, 1975.

[88] A. J. Stam. Some inequalities satisfied by the quantities of information of Fisher and
Shannon. Information and Control, 2:101–112, 1959.

[89] K.-T. Sturm. On the geometry of metric measure spaces. I. Acta Math., 196(1):65–
131, 2006.

[90] K.-T. Sturm. On the geometry of metric measure spaces. II. Acta Math., 196(1):133–
177, 2006.

[91] G. Szegő. Orthogonal Polynomials. American Mathematical Society, New York,


revised edition, 1958.

[92] G. Toscani. The fractional Fisher information and the central limit theorem for stable
laws. Ricerche di Matematica (online first), doi:10.1007/s11587-015-0253-9, 1-21,
2015.

[93] C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statis-


tical Physics, 52:479–487, 1988.

[94] A. Tulino and S. Verdú. Monotonic decrease of the non-Gaussianness of the sum
of independent random variables: a simple proof. IEEE Trans. Inform. Theory,
52(9):4295–4297, 2006.

22
[95] C. Villani. Topics in optimal transportation, volume 58 of Graduate Studies in Math-
ematics. American Mathematical Society, Providence, RI, 2003.

[96] C. Villani. Optimal transport: Old and New, volume 338 of Grundlehren der Mathe-
matischen Wissenschaften. Springer-Verlag, Berlin, 2009.

[97] D. W. Walkup. Pólya sequences, binomial convolution and the union of random sets.
J. Appl. Probability, 13(1):76–85, 1976.

[98] L. Wang, J. O. Woo, and M. Madiman. A lower bound on the Rényi entropy of
convolutions in the integers. In 2014 IEEE International Symposium on Information
Theory (ISIT), pages 2829–2833. IEEE, 2014.

[99] H. S. Witsenhausen. Some aspects of convexity useful in information theory. IEEE


Trans. Inform. Theory, 26(3):265–271, 1980.

[100] L. Wu. A new modified logarithmic Sobolev inequality for Poisson point processes
and several applications. Probab. Theory Related Fields, 118(3):427–438, 2000.

[101] A. D. Wyner. A theorem on the entropy of certain binary sequences and applications.
II. IEEE Trans. Information Theory, 19(6):772–777, 1973.

[102] A. D. Wyner and J. Ziv. A theorem on the entropy of certain binary sequences and
applications. I. IEEE Trans. Information Theory, 19(6):769–772, 1973.

[103] Y. Yu. On the maximum entropy properties of the binomial distribution. IEEE
Trans. Inform. Theory, 54(7):3351–3353, July 2008.

[104] Y. Yu. Monotonic convergence in an information-theoretic law of small numbers.


IEEE Trans. Inform. Theory, 55(12):5412–5422, 2009.

[105] Y. Yu. On the entropy of compound distributions on nonnegative integers. IEEE


Trans. Inform. Theory, 55(8):3645–3650, 2009.

[106] Y. Yu and O. T. Johnson. Concavity of entropy under thinning. In Proceedings of


the 2009 IEEE International Symposium on Information Theory, 28th June - 3rd
July 2009, Seoul, pages 144–148, 2009.

23

You might also like