Lecture Notes 2
Probability Inequalities
Inequalities are useful for bounding quantities that might otherwise be hard to compute.
They will also be used in the theory of convergence.
Theorem 1 (The Gaussian Tail Inequality) Let X N (0, 1). Then
2 /2
2e
P(|X| > )
If X1 , . . . , Xn N (0, 1) then
1
2
P(|X n | > ) en /2 .
n
Proof. The density of X is (x) = (2)1/2 ex /2 . Hence,
Z
Z
1
P(X > ) =
(s)ds
s (s)ds
Z
2
()
1 0
e /2
(s)ds =
=
.
By symmetry,
2
2e /2
.
P
d
Now let X1 , . . . , Xn N (0, 1). Then X n = n1 ni=1 Xi N (0, 1/n). Thus, X n = n1/2 Z
where Z N (0, 1) and
P(|X| > )
P(|X n | > ) = P(n1/2 |Z| > ) = P(|Z| >
1
2
n ) en /2 .
n
Theorem 2 (Markovs inequality) Let X be a non-negative random variable and
suppose that E(X) exists. For any t > 0,
E(X)
.
t
P(X > t)
(1)
Proof. Since X > 0,
Z
xp(x)dx
t
x p(x)dx t
x p(x)dx +
x p(x)dx =
E(X) =
p(x)dx = t P(X > t).
Theorem 3 (Chebyshevs inequality) Let = E(X) and 2 = Var(X). Then,
P(|X | t)
2
t2
and
P(|Z| k)
1
k2
(2)
where Z = (X )/. In particular, P(|Z| > 2) 1/4 and P(|Z| > 3) 1/9.
Proof. We use Markovs inequality to conclude that
P(|X | t) = P(|X |2 t2 )
2
E(X )2
=
.
t2
t2
The second part follows by setting t = k.
P
If X1 , . . . , Xn Bernoulli(p) then and X n = n1 ni=1 Xi Then, Var(X n ) = Var(X1 )/n =
p(1 p)/n and
Var(X n )
p(1 p)
1
P(|X n p| > )
=
2
2
n
4n2
since p(1 p) 14 for all p.
Hoeffdings Inequality
Hoeffdings inequality is similar in spirit to Markovs inequality but it is a sharper inequality.
We begin with the following important result.
Lemma 4 Supppose that E(X) = 0 and that a X b. Then
2 (ba)2 /8
E(etX ) et
2
Recall that a function g is convex if for each x, y and each [0, 1],
g(x + (1 )y) g(x) + (1 )g(y).
Proof. Since a X b, we can write X as a convex combination of a and b, namely,
X = b + (1 )a where = (X a)/(b a). By the convexity of the function y ety we
have
X a tb b X ta
e +
e .
etX etb + (1 )eta =
ba
ba
Take expectations of both sides and use the fact that E(X) = 0 to get
EetX
a tb
b ta
e +
e = eg(u)
ba
ba
(3)
where u = t(b a), g(u) = u + log(1 + eu ) and = a/(b a). Note that
00
g(0) = g 0 (0) = 0. Also, g (u) 1/4 for all u > 0. By Taylors theorem, there is a (0, u)
such that
0
g(u) = g(0) + ug (0) +
2 (ba)2 /8
Hence, EetX eg(u) et
u2 00
u2
t2 (b a)2
u2 00
g () = g ()
=
.
2
2
8
8
.
Next, we need to use Chernoff s method.
Lemma 5 Let X be a random variable. Then
P(X > ) inf et E(etX ).
t0
Proof. For any t > 0,
P(X > ) = P(eX > e ) = P(etX > et ) et E(etX ).
Since this is true for every t 0, the result follows.
Theorem 6 (Hoeffdings Inequality) Let Y1 , . . . , Yn be iid observations such that
E(Yi ) = and a Yi b. Then, for any > 0,
2
2
P |Y n | 2e2n /(ba) .
(4)
Corollary 7 If X1 , X2 , . . . , Xn are independent with P(a Xi b) = 1 and common
mean , then, with probability at least 1 ,
s
c
2
|X n |
log
(5)
2n
where c = (b a)2 .
Proof. Without los of generality, we asume that = 0. First we have
P(|Y n | ) = P(Y n ) + P(Y n )
= P(Y n ) + P(Y n ).
Next we use Chernoffs method. For any t > 0, we have, from Markovs inequality, that
!
n
Pn
X
P(Y n ) = P
Yi n = P e i=1 Yi en
i=1
P
t n
i=1 Yi
= P e
= etn
Pn
etn etn E et i=1 Yi
E(etYi ) = etn (E(etYi ))n .
i
2 (ba)2 /8
From Lemma 4, E(etYi ) et
. So
2 n(ba)2 /8
P(Y n ) etn et
This is minimized by setting t = 4/(b a)2 giving
P(Y n ) e2n
2 /(ba)2
Applying the same argument to P(Y n ) yields the result.
Example 8 Let X1 , . . . , Xn Bernoulli(p). From, Hoeffdings inequality,
2
P(|X n p| > ) 2e2n .
The Bounded Difference Inequality
So far we have focused on sums of random variables. The following result extends Hoeffdings
inequality to more general functions g(x1 , . . . , xn ). Here we consider McDiarmids inequality,
also known as the Bounded Difference inequality.
4
Theorem 9 (McDiarmid) Let X1 , . . . , Xn be independent random variables. Suppose that
0
sup g(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) g(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) ci (6)
x1 ,...,xn ,x0i
for i = 1, . . . , n. Then
!
P g(X1 , . . . , Xn ) E(g(X1 , . . . , Xn ))
22
exp Pn
2
i=1 ci
.
(7)
Proof.
Let Vi = E(g|X1 , . . . , Xi )E(g|X1 , . . . , Xi1 ). Then g(X1 , . . . , Xn )E(g(X1 , . . . , Xn )) =
Pn
i=1 Vi and E(Vi |X1 , . . . , Xi1 ) = 0. Using a similar argument as in Hoeffdings Lemma we
have,
2 2
E(etVi |X1 , . . . , Xi1 ) et ci /8 .
(8)
Now, for any t > 0,
P (g(X1 , . . . , Xn ) E(g(X1 , . . . , Xn )) ) = P
n
X
!
Vi
i=1
Pn
=P e
e et E et i=1 Vi
!!
Pn1
= et E et i=1 Vi E etVn X1 , . . . , Xn1
Pn1
2 2
et et cn /8 E et i=1 Vi
Pn
i=1 Vi
t
..
.
Pn
2
2
et et i=1 ci .
P
The result follows by taking t = 4/ ni=1 c2i .
P
Example 10 If we take g(x1 , . . . , xn ) = n1 ni=1 xi then we get back Hoeffdings inequality.
Example 11 Suppose we throw m balls into n bins. What fraction of bins are empty? Let
Z be P
the number of empty bins and let F = Z/n be the fraction of empty bins. We can write
Z = ni=1 Zi where Zi = 1 of bin i is empty and Zi = 0 otherwise. Then
= E(Z) =
n
X
E(Zi ) = n(1 1/n)m = nem log(11/n) nem/n
i=1
and = E(F ) = /n em/n . How close is Z to ? Note that the Zi s are not independent
so we cannot just apply Hoeffding. Instead, we proceed as follows.
5
Define variables X1 , . . . , Xm where Xs = i if ball s falls into bin i. Then Z = g(X1 , . . . , Xm ).
If we move one ball into a different bin, then Z can change by at most 1. Hence, (6) holds
with ci = 1 and so
2
P(|Z | > t) 2e2t /m .
Recall that he fraction of empty bins is F = Z/m with mean = /n. We have
2 t2 /m
P(|F | > t) = P(|Z | > nt) 2e2n
Bounds on Expected Values
Theorem 12 (Cauchy-Schwartz inequality) If X and Y have finite variances
then
p
(9)
E |XY | E(X 2 )E(Y 2 ).
The Cauchy-Schwarz inequality can be written as
2 2
Cov2 (X, Y ) X
Y .
Recall that a function g is convex if for each x, y and each [0, 1],
g(x + (1 )y) g(x) + (1 )g(y).
If g is twice differentiable and g 00 (x) 0 for all x, then g is convex. It can be shown that if
g is convex, then g lies above any line that touches g at some point, called a tangent line.
A function g is concave if g is convex. Examples of convex functions are g(x) = x2 and
g(x) = ex . Examples of concave functions are g(x) = x2 and g(x) = log x.
Theorem 13 (Jensens inequality) If g is convex, then
Eg(X) g(EX).
(10)
Eg(X) g(EX).
(11)
If g is concave, then
Proof. Let L(x) = a + bx be a line, tangent to g(x) at the point E(X). Since g is convex,
it lies above the line L(x). So,
Eg(X) EL(X) = E(a + bX) = a + bE(X) = L(E(X)) = g(EX).
Example 14 From Jensens inequality we see that E(X 2 ) (EX)2 .
Example 15 (Kullback Leibler Distance) Define the Kullback-Leibler distance between
two densities p and q by
Z
p(x)
dx.
D(p, q) = p(x) log
q(x)
Note that D(p, p) = 0. We will use Jensen to show that D(p, q) 0. Let X f . Then
Z
Z
q(X)
q(X)
q(x)
D(p, q) = E log
log E
= log p(x)
dx = log q(x)dx = log(1) = 0.
p(X)
p(X)
p(x)
So, D(p, q) 0 and hence D(p, q) 0.
Example 16 It follows from Jensens inequality that 3 types of means can be ordered. Assume that a1 , . . . , an are positive numbers and define the arithmetic, geometric and harmonic
means as
1
(a1 + . . . + an )
n
= (a1 . . . an )1/n
1
= 1 1
.
( + . . . + a1n )
n a1
aA =
aG
aH
Then aH aG aA .
Suppose we have an exponential bound on P(Xn > ). In that case we can bound E(Xn ) as
follows.
Theorem 17 Suppose that Xn 0 and that for every > 0,
2
P(Xn > ) c1 ec2 n
(12)
for some c2 > 0 and c1 > 1/e. Then,
r
E(Xn )
C
.
n
(13)
where C = (1 + log(c1 ))/c2 .
R
Proof. Recall that for any nonnegative random variable Y , E(Y ) = 0 P(Y t)dt. Hence,
for any a > 0,
Z
Z a
Z
Z
2
2
2
2
E(Xn ) =
P(Xn t)dt =
P(Xn t)dt +
P(Xn t)dt a +
P(Xn2 t)dt.
0
Equation (12) implies that P(Xn > t) c1 ec2 nt . Hence,
Z
Z
Z
2
2
E(Xn ) a +
P(Xn t)dt = a +
P(Xn t)dt a + c1
a
ec2 nt dt = a +
c1 ec2 na
.
c2 n
Set a = log(c1 )/(nc2 ) and conclude that
E(Xn2 )
log(c1 )
1
1 + log(c1 )
+
=
.
nc2
nc2
nc2
Finally, we have
s
p
E(Xn )
E(Xn2 )
1 + log(c1 )
.
nc2
Now we consider bounding the maximum of a set of random variables.
Theorem 18 Let X1 , . . . , Xn be random variables. Suppose there exists > 0 such
2
that E(etXi ) et /2 for all t > 0. Then
p
E max Xi 2 log n.
(14)
1in
Proof. By Jensens inequality,
exp tE max Xi
E exp t max Xi
1in
1in
max exp {tXi }
= E
1in
n
X
2 2 /2
E (exp {tXi }) net
i=1
Thus,
E
The result follows by setting t =
max Xi
1in
log n t 2
+
.
t
2
2 log n/.
OP and oP
In statisics, probability and machine learning, we make use of oP and OP notation.
Recall first, that an = o(1) means that an 0 as n . an = o(bn ) means that
an /bn = o(1).
an = O(1) means that an is eventually bounded, that is, for all large n, |an | C for some
C > 0. an = O(bn ) means that an /bn = O(1).
8
We write an bn if both an /bn and bn /an are eventually bounded. In computer sicence this
s written as an = (bn ) but we prefer using an bn since, in statistics, often denotes a
parameter space.
Now we move on to the probabilistic versions. Say that Yn = oP (1) if, for every > 0,
P(|Yn | > ) 0.
Say that Yn = oP (an ) if, Yn /an = oP (1).
Say that Yn = OP (1) if, for every > 0, there is a C > 0 such that
P(|Yn | > C) .
Say that Yn = OP (an ) if Yn /an = OP (1).
Lets use Hoeffdings inequality to show that sample proportions are OP (1/ n) within the
the true mean. Let Y1 , . . . , Yn be coin flips i.e. Yi {0, 1}. Let p = P(Yi = 1). Let
n
1X
Yi .
pbn =
n i=1
We will show that: pbn p = oP (1) and pbn p = OP (1/ n).
We have that
2
P(|b
pn p| > ) 2e2n 0
and so pbn p = oP (1). Also,
C
pn p| > C) = P |b
pn p| >
P( n|b
n
2
2e2C <
if we pick C large enough. Hence,
n(b
pn p) = OP (1) and so
1
pbn p = OP
.
n
Now consider m coins with probabilities p1 , . . . , pm . Then
m
X
P(max |b
pj pj | > )
P(|b
pj pj | > )
j
j=1
m
X
2e2n
union bound
Hoeffding
j=1
2
= 2me2n = 2 exp (2n2 log m) .
Supose that m en where 0 < 1. Then
P(max |b
pj pj | > ) 2 exp (2n2 n ) 0.
j
Hence,
max |b
pj pj | = oP (1).
j