Multiple Random Variables
Multiple Random Variables
MING GAO
DASE @ ECNU
(for course related communications)
[email protected]
3 / 97
Random vector
Definition
An n-dimensional random vector is a function from a sample
space Ω into Rn , n−dimensional Euclidean space.
Example
f (x, y ) = P(X = x, Y = y )
4 / 97
Joint PMF
Definition
Let (X , Y ) be a discrete bivariate random vector. Then the
function f (x, y ) from R2 into R defined by
f (x, y ) = P(X = x, Y = y )
Example
5 / 97
Probability calculation
Example
5 / 97
Expectation
Expectations of functions of random vectors are computed just
as with univariate r.v.s. Let g (x, y ) be a real-valued function
defined for all possible values (x, y ) of the discrete random vec-
tor (X , Y ). Then g (X , Y ) is itself a random variable and its
expected value E (g (X , Y )) is given by
X
E (g (X , Y )) = g (x, y )f (x, y ).
(x,y )∈R2
Expectation
Expectations of functions of random vectors are computed just
as with univariate r.v.s. Let g (x, y ) be a real-valued function
defined for all possible values (x, y ) of the discrete random vec-
tor (X , Y ). Then g (X , Y ) is itself a random variable and its
expected value E (g (X , Y )) is given by
X
E (g (X , Y )) = g (x, y )f (x, y ).
(x,y )∈R2
Question:
For the above given (X , Y ), what is the average value of XY ?
Expectation
Expectations of functions of random vectors are computed just
as with univariate r.v.s. Let g (x, y ) be a real-valued function
defined for all possible values (x, y ) of the discrete random vec-
tor (X , Y ). Then g (X , Y ) is itself a random variable and its
expected value E (g (X , Y )) is given by
X
E (g (X , Y )) = g (x, y )f (x, y ).
(x,y )∈R2
Question:
For the above given (X , Y ), what is the average value of XY ?
Answer: Letting g (x, y ) = xy , we compute E (XY ) =
E (g (X , Y )). Thus,
1 1 11
E (XY ) = 2 × 0 × + ··· + 7 × 5 × = 13 .
6 / 97 36 18 18
Properties of joint pmf
Properties
For any (x, y ), f (x, y ) ≥ 0 since f (x, y ) is a probability.
Since (X , Y ) is certain to be in R2
X
f (x, y ) = P((X , Y ) ∈ R2 ) = 1.
(x,y )∈R2
7 / 97
Marginal pmf
Theorem
Let (X , Y ) be a discrete bivariate random vector with join-
t pmf fX ,Y (x, y ). Then the marginal pmfs of X and Y ,
fX (x) = P(X = x) and fY (y ) = P(Y = y ), are given by
X X
fX (x) = fX ,Y (x, y ), fY (y ) = fX ,Y (x, y ).
y ∈R x∈R
Proof.
For any x ∈ R, let Ax = {(x, y )|y ∈ R}. That is, Ax is the
line in the plane with first coordinate equal to x. Then, for any
x ∈ R:
8 / 97
Example
Given the above joint pmf, we can compute the marginal pmf
of Y .
Given the above joint pmf, we can compute the marginal pmf
of Y .
Given the above joint pmf, we can compute the marginal pmf
of Y .
9 / 97
Joint PDF
Definition
Z +∞ Z +∞ Z 1 Z 1
f (x, y )dxdy = 6xy 2 dxdy
−∞ −∞ 0 0
Z 1 Z 1
= 3x 2 y 2 |10 dy = 3y 2 dy = y 3 |10 = 1.
0 0
11 / 97
Calculating probability I
Thus,
Z Z Z 1Z 1
9
P((X , Y ) ∈ A) = f (x, y )dxdy = 6xy 2 dxdy = .
A 0 1−y 10
12 / 97
Calculating marginal pdf
To calculate fX (x), we note that for x ≥ 1 or x ≤ 0, f (x, y ) = 0.
Thus for x ≥ 1 or x ≤ 0, we have
Z +∞
fX (x) = f (x, y )dy = 0.
−∞
Calculating marginal pdf
To calculate fX (x), we note that for x ≥ 1 or x ≤ 0, f (x, y ) = 0.
Thus for x ≥ 1 or x ≤ 0, we have
Z +∞
fX (x) = f (x, y )dy = 0.
−∞
3y 2 , 0 < y < 1;
fX (x) = .
0, otherwise.
13 / 97
Calculating probability II
Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.
Calculating probability II
Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.
Notice that region A is an unbounded region with three sides
given by the lines y = x, x + y = 1 and x = 0. To integrate
over this region, we would have to break the region into at least
two parts to write this appropriate limits of integration.
Calculating probability II
Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.
Notice that region A is an unbounded region with three sides
given by the lines y = x, x + y = 1 and x = 0. To integrate
over this region, we would have to break the region into at least
two parts to write this appropriate limits of integration.
Thus P((X , Y ) ∈ A) can be calculated as
F (x, y ) = P(X ≤ x, Y ≤ y ).
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by
F (x, y ) = P(X ≤ x, Y ≤ y ).
The joint cdf is usually not very handy for discrete cases;
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by
F (x, y ) = P(X ≤ x, Y ≤ y ).
The joint cdf is usually not very handy for discrete cases;
For continuous bivariate random vector,
Z x Z y
F (x, y ) = f (s, t)dsdt.
−∞ −∞
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by
F (x, y ) = P(X ≤ x, Y ≤ y ).
The joint cdf is usually not very handy for discrete cases;
For continuous bivariate random vector,
Z x Z y
F (x, y ) = f (s, t)dsdt.
−∞ −∞
∂ 2 F (x, y )
= f (x, y ).
∂x∂y
15 / 97
Conditional pmf
Let (X , Y ) be a discrete bivariate random vector with joint pmf
f (x, y ) and marginal pmfs fX (x) and fY (y ). For any x such that
P(X = x) = fX (x) > 0, the conditional pmf of Y given that
X = x is the function of y denoted by f (y |x) and defined by
f (x, y )
f (y |x) = P(Y = y |X = x) = .
fX (x)
Conditional pmf
Let (X , Y ) be a discrete bivariate random vector with joint pmf
f (x, y ) and marginal pmfs fX (x) and fY (y ). For any x such that
P(X = x) = fX (x) > 0, the conditional pmf of Y given that
X = x is the function of y denoted by f (y |x) and defined by
f (x, y )
f (y |x) = P(Y = y |X = x) = .
fX (x)
f (x, y )
f (x|y ) = P(X = x|Y = y ) = .
fY (y )
16 / 97
Example
Define the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) =
18 18
4 4
f (1, 20) = , f (2, 30) = .
18 18
Example
Define the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) =
18 18
4 4
f (1, 20) = , f (2, 30) = .
18 18
First, the marginal pmf of X is
4 4
fX (0) = f (0, 10) + f (0, 20) = , fX (2) = f (2, 30) =
18 18
10
fX (1) = f (1, 10) + f (1, 20) + f (1, 30) =
18
Example
Define the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) =
18 18
4 4
f (1, 20) = , f (2, 30) = .
18 18
First, the marginal pmf of X is
4 4
fX (0) = f (0, 10) + f (0, 20) = , fX (2) = f (2, 30) =
18 18
10
fX (1) = f (1, 10) + f (1, 20) + f (1, 30) =
18
For x = 0,
f (0, 10) 1 f (0, 20) 1
fX (10|0) = = , fX (20|0) = =
fX (0) 2 fX (0) 2
17 / 97
Example Cont’d
For x = 1,
f (1, 10) 3
fX (10|1) = =
fX (1) 10
f (1, 20) 4
fX (20|1) = =
fX (1) 10
f (1, 30) 3
fX (30|1) = =
fX (1) 10
Example Cont’d
For x = 1,
f (1, 10) 3
fX (10|1) = =
fX (1) 10
f (1, 20) 4
fX (20|1) = =
fX (1) 10
f (1, 30) 3
fX (30|1) = =
fX (1) 10
For x = 2,
f (2, 30)
fX (30|2) = =1
fX (2)
18 / 97
Conditional pdf
19 / 97
Calculating conditional pdf
20 / 97
Conditional expectation
If g (Y ) is a function of Y , then the conditional expected value
of g (Y ) given that X = x is denoted by E (g (Y |x)) and is
defined by
X
E (g (Y |x)) = g (y )f (y |x)
y
Z ∞
E (g (Y |x)) = g (y )f (y |x)dy
−∞
Conditional expectation
If g (Y ) is a function of Y , then the conditional expected value
of g (Y ) given that X = x is denoted by E (g (Y |x)) and is
defined by
X
E (g (Y |x)) = g (y )f (y |x)
y
Z ∞
E (g (Y |x)) = g (y )f (y |x)dy
−∞
21 / 97
Calculating conditional expectation and variance
Given above example, the conditional expected value of Y given
X = x can be calculated as
Z ∞
E (Y |X = x) = ye −(y −x) dy = 1 + x.
x
Calculating conditional expectation and variance
Given above example, the conditional expected value of Y given
X = x can be calculated as
Z ∞
E (Y |X = x) = ye −(y −x) dy = 1 + x.
x
f (x, y ) = fX (x)fY (y ).
Independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ) and marginal pdfs or pmfs fX (x) and fY (y ). Then X
and Y are called independent r.v.s if, for any x ∈ R and y ∈ R
f (x, y ) = fX (x)fY (y ).
If X and Y are independent, the conditional pdf of Y given
X = x is
f (x, y ) fX (x)fY (y )
f (y |x) = = = fY (y ).
fX (x) fX (x)
Independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ) and marginal pdfs or pmfs fX (x) and fY (y ). Then X
and Y are called independent r.v.s if, for any x ∈ R and y ∈ R
f (x, y ) = fX (x)fY (y ).
If X and Y are independent, the conditional pdf of Y given
X = x is
f (x, y ) fX (x)fY (y )
f (y |x) = = = fY (y ).
fX (x) fX (x)
For any A ⊂ R and x ∈ R,
Z Z
P(Y ∈ A|x) = f (y |x)dy = fY (y )dy = P(Y ∈ A).
A A
23 / 97
Checking independent I
Define the joint pmf of (X , Y ) by
1
f (10, 1) = f (20, 1) = f (20, 2) =
10
1 3
f (10, 2) = f (10, 3) = , f (20, 3) = .
5 10
Checking independent I
Define the joint pmf of (X , Y ) by
1
f (10, 1) = f (20, 1) = f (20, 2) =
10
1 3
f (10, 2) = f (10, 3) = , f (20, 3) = .
5 10
The marginal pmfs are
1
fX (10) = fX (20) =
2
1 3 1
fY (1) = , fY (2) = , fY (3) =
5 10 2
Checking independent I
Define the joint pmf of (X , Y ) by
1
f (10, 1) = f (20, 1) = f (20, 2) =
10
1 3
f (10, 2) = f (10, 3) = , f (20, 3) = .
5 10
The marginal pmfs are
1
fX (10) = fX (20) =
2
1 3 1
fY (1) = , fY (2) = , fY (3) =
5 10 2
Thus, the r.v.s X and Y are not independent since
1 11
f (10, 3) = 6= = fX (10)fY (3).
5 22
24 / 97
Lemma for independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ). Then X and Y are independent r.v.s if and only if there
exist functions g (x) and h(y ) such that, for every x ∈ R and
y ∈R
f (x, y ) = g (x)h(y ).
Proof.
Lemma for independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ). Then X and Y are independent r.v.s if and only if there
exist functions g (x) and h(y ) such that, for every x ∈ R and
y ∈R
f (x, y ) = g (x)h(y ).
Proof.
⇒: Easily to prove based on the definition.
Lemma for independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ). Then X and Y are independent r.v.s if and only if there
exist functions g (x) and h(y ) such that, for every x ∈ R and
y ∈R
f (x, y ) = g (x)h(y ).
Proof.
⇒: Easily to prove based on the definition.
⇐: Let f (x, y ) = g (x)h(y ). We define
Z ∞
g (x)dx = c
−∞
Z ∞
h(y )dy = d
−∞
25 / 97
Proof Cont’d
Z ∞ Z ∞ Z ∞Z ∞
cd = ( g (x)dx)( h(y )dy ) = g (x)h(y )dxdy
−∞ −∞ −∞ −∞
Z ∞Z ∞
f (x, y )dxdy = 1
−∞ −∞
Thus we have
27 / 97
Theorem for independent r.v.s
Let X and Y are independent r.v.s
For any A ⊂ R and B ⊂ R,
E (X 2 Y ) = (E (X 2 ))(E (Y ))
= (Var (X ) + (E (X ))2 )E (Y )
= (1 + 12 )1 = 2.
29 / 97
MGF of a sum of normal variables
Let X ∼ N(µ1 , σ12 ) and Y ∼ N(µ2 , σ22 ) are independent r.v.s.
Then, the mgfs of X and Y are
2 2 /2
MX (t) = exp µ1 t+σ1 t
2 2 /2
MY (t) = exp µ2 t+σ2 t
30 / 97
MGF of a sum of normal variables
Let X ∼ N(µ1 , σ12 ) and Y ∼ N(µ2 , σ22 ) are independent r.v.s.
Then, the mgfs of X and Y are
2 2 /2
MX (t) = exp µ1 t+σ1 t
2 2 /2
MY (t) = exp µ2 t+σ2 t
Theorem
30 / 97
Distribution of bivariate function
Thus
P((U, V ) ∈ B) = P((X , Y ) ∈ A),
i.e., the probability distribution of (U, V ) is completely deter-
mined by the probability distribution of (X , Y ).
31 / 97
Transformation of discrete r.v.s
If (X , Y ) is discrete bivariate random vector, then there is only
a countable set of values for which the joint pmf of (X , y ) is
positive. Call this set A.
Transformation of discrete r.v.s
If (X , Y ) is discrete bivariate random vector, then there is only
a countable set of values for which the joint pmf of (X , y ) is
positive. Call this set A.
Define the set
A = {(x, y )|x ∈ N, y ∈ N}
B = {(u, v )|v ∈ N, u ≥ v , u ∈ N}.
Distribution of the sum of Poisson variables
Let X and Y are independent Poission r.v.s with parameters θ1
and θ2 , respectively. Thus the joint pmf of (X , Y ) is
A = {(x, y )|x ∈ N, y ∈ N}
B = {(u, v )|v ∈ N, u ≥ v , u ∈ N}.
For the simplest version of this result we assume that the trans-
formation u = g1 (x, y ) and v = g2 (x, y ) defines a one-to-one
transformation of A onto B.
Transformation of continuous r.v.s
If (X , Y ) is a continuous bivariate random vector with joint pdf
fX ,Y (x, y ), then the joint pdf of U, V can be expressed in terms
of fX ,Y (x, y ).
Define the sets
For the simplest version of this result we assume that the trans-
formation u = g1 (x, y ) and v = g2 (x, y ) defines a one-to-one
transformation of A onto B.
For such a one-to-one, onto transformation, we can obtain a
reverse transformation by x = h1 (u, v ) and y = h2 (u, v ). The
role played by a derivative in the univariate case is now played
by a quantity called the Jacobian of the transformation.
35 / 97
Transformation of continuous r.v.s
We further define the Jacobian determinant of the transforma-
tion as
∂x ∂x
∂u ∂v
∂x ∂y ∂x ∂y
J = ∂y ∂y = − ,
∂u ∂v ∂u ∂v ∂v ∂u
∂x
where ∂u = ∂h1∂u
(u,v ) ∂y
, ∂v = ∂h2∂v
(u,v ) ∂x
, ∂v = ∂h1∂v
(u,v ) ∂y
, ∂u = ∂h2∂u
(u,v )
.
The joint pdf of (U, V ) is 0 outside the set B and on the set B
is given by
g1 (x, y ) = x + y , g2 (x, y ) = x − y
u+v u−v
h1 (u, v ) = , h2 (u, v ) = .
2 2
Sum and difference of normal variables
Let X and Y are independent, standard normal r.v.s. Consider
the transformation U = X + Y and V = X − Y , thus we have
g1 (x, y ) = x + y , g2 (x, y ) = x − y
u+v u−v
h1 (u, v ) = , h2 (u, v ) = .
2 2
Furthermore,
∂x ∂x 1 1
∂u ∂v 2 2
1
J= ∂y ∂y = 1 =−
∂u ∂v 2 − 12 2
Sum and difference of normal variables
Let X and Y are independent, standard normal r.v.s. Consider
the transformation U = X + Y and V = X − Y , thus we have
g1 (x, y ) = x + y , g2 (x, y ) = x − y
u+v u−v
h1 (u, v ) = , h2 (u, v ) = .
2 2
Furthermore,
∂x ∂x 1 1
∂u ∂v 2 2
1
J= ∂y ∂y = 1 =−
∂u ∂v 2 − 12 2
38 / 97
Analysis
The joint pdf has factored into a function of u and a
function of v . By the above lemma, U and V are
independent.
U ∼ N(0, 2) and V ∼ N(0, 2).
This important fact, that sums and differences of
independent normal r.v.s are independent normal r.v.s, is
true regardless of the means of X and Y , so long as
Var (X ) = Var (Y ).
Theorem
Let X and Y be independent r.v.s. Let g (x) be a function only
of x and h(y ) be a function only of y . Then the r.v.s U = g (X )
and V = h(Y ) are independent.
38 / 97
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Note that this transformation is not one-to-one since the points
(x, y ) and (−x, −y ) are both mapped into the same (u, v ) point.
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Note that this transformation is not one-to-one since the points
(x, y ) and (−x, −y ) are both mapped into the same (u, v ) point.
Let
X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let
X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Recall that we use notation such as X |Y ∼ Binomial(Y , p) to
mean that the conditional distribution of X given Y = y is
Binomial(y , p).
41 / 97
Binomial-Poisson hierarchy Cont’d
∞
X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!
∞
(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!
∞
(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Binomial-Poisson hierarchy Cont’d
∞
X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!
∞
(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!
∞
(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
Binomial-Poisson hierarchy Cont’d
∞
X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!
∞
(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!
∞
(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
The answer to the original question is now easy to compute
E (X ) = λp.
42 / 97
Theorem for expectation of conditional expectation
If X and Y are any two r.v.s, then
E (X ) = E (E (X |Y ))
Thus, we have
Z
E (X ) = E (X |y )fY (y )dy = E (E (X |Y )).
E (X ) = E (E (X |Y )) = E (pY ) = pλ.
44 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors
E (X ) = E (E (X |Y )) = E (pY ) = pλ.
Definition
A r.v. X is said to have a mixture distribution if the distribution
of X depends on quantity that also has a distribution.
44 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors
E (X ) = E (E (X |Y )) = E (pY ) = pλ.
Definition
A r.v. X is said to have a mixture distribution if the distribution
of X depends on quantity that also has a distribution.
44 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors
E (X ) = E (E (X |Y )) = E (pY ) = pλ.
Definition
A r.v. X is said to have a mixture distribution if the distribution
of X depends on quantity that also has a distribution.
45 / 97
Rethinking the three-stage model
Note that this three-stage model can also be thought of as a
two-stage hierarchy by combining the last two stages.
Rethinking the three-stage model
Note that this three-stage model can also be thought of as a
two-stage hierarchy by combining the last two stages. If Y |Λ ∼
Poisson(Λ) and Λ ∼ exponential(β), then
Z ∞
P(Y = y ) = P(Y = y , 0 < Λ < ∞) = f (y , λ)dλ
0
Z ∞ Z ∞ h −λ y i
e λ 1 − βλ
= f (y |λ)f (λ)dλ = e dλ
0 0 y! β
Z ∞
1 −1 1 1 y +1
= λy e −λ(1+β ) dλ = Γ(y + 1)
βy ! 0 βy ! 1 + β −1
1 1 y +1
= −1
.
1+β 1+β
It forms a negative binomial pmf. Therefore, our three-stage
hierarchy is equivalent to the two-stage hierarchy
46 / 97
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials.
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let
X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let
X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Recall that we use notation such as X |Y ∼ Binomial(Y , p) to
mean that the conditional distribution of X given Y = y is
Binomial(y , p).
47 / 97
Binomial-Poisson hierarchy Cont’d
∞
X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!
∞
(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!
∞
(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Binomial-Poisson hierarchy Cont’d
∞
X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!
∞
(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!
∞
(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
Binomial-Poisson hierarchy Cont’d
∞
X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!
∞
(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!
∞
(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
The answer to the original question is now easy to compute
E (X ) = λp.
48 / 97
Beta-binomial hierarchy
X |P ∼ Binomial(P),
P ∼ β(α, β).
Beta-binomial hierarchy
X |P ∼ Binomial(P),
P ∼ β(α, β).
By iterating the expectation, we calculate the mean of X asThus,
any marginal inference on X as
nα
E (X ) = E (E (X |P)) = E (n|P) = .
α+β
49 / 97
Conditional variance identity
Theorem
For any two r.v.s
54 / 97
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.
The covariance and correlation of X and Y are the numbers
defined by
54 / 97
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.
The covariance and correlation of X and Y are the numbers
defined by
54 / 97
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.
The covariance and correlation of X and Y are the numbers
defined by
relationship
54 / 97 between X and Y .
Theorem
For any r.v.s X and Y ,
Cov (X , Y ) = E (XY ) − µX µY
Theorem
For any r.v.s X and Y ,
Cov (X , Y ) = E (XY ) − µX µY
Proof.
Cov (X , Y ) = E ((X − µX )(Y − µY ))
= E (XY − µX Y − µY X + µX µY )
= E (XY ) − µX E (Y ) − µY E (X ) + µX µY
= E (XY ) − µX µY
57 / 97
Theorem
Proof.
If r.v.s X and Y are Since X and Y are independent, we
independent r.v.s, then have E (XY ) = E (X )E (Y ). Thus
Cov (X , Y ) = 0 and
Cov (X , Y ) = E (XY ) − E (X )E (Y ) = 0
ρXY = 0.
ρXY = 0
Proof.
The mean of aX + bY is E (aX + bY ) = aµX + bµY . Thus,
Proof.
Consider the function h(t) defined by
Proof.
Consider the function h(t) defined by
59 / 97
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.
This is equivalent to
Cov (X , Y )
−σX σY ≤ Cov (X , Y ) ≤ σX σY , i.e., − 1 ≤ ρXY = ≤ 1.
σX σY
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.
This is equivalent to
Cov (X , Y )
−σX σY ≤ Cov (X , Y ) ≤ σX σY , i.e., − 1 ≤ ρXY = ≤ 1.
σX σY
|ρXY | = 1 if and only if h(t) has a single root. But since
((X − µX )t + (Y − µY ))2 ≥ 0, the expected value h(t) =
E ((X − µX )t + (Y − µY ))2 = 0 if and only if
P ((X − µX )t + (Y − µY ))2 = 0 = 1.
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.
This is equivalent to
Cov (X , Y )
−σX σY ≤ Cov (X , Y ) ≤ σX σY , i.e., − 1 ≤ ρXY = ≤ 1.
σX σY
|ρXY | = 1 if and only if h(t) has a single root. But since
((X − µX )t + (Y − µY ))2 ≥ 0, the expected value h(t) =
E ((X − µX )t + (Y − µY ))2 = 0 if and only if
P ((X − µX )t + (Y − µY ))2 = 0 = 1.
This is equivalent to
P (X − µX )t + (Y − µY ) = 0 = 1
60 / 97
Proof Cont’d
61 / 97
Proof Cont’d
61 / 97
Proof Cont’d
61 / 97
Example
Let X have a uniform(−1, 1) distribution and Z have a
1
uniform(0, 10 ) distribution. Suppose X and Z are independent.
Let Y = X 2 + Z and consider the random vector (X , Y ). The
1
conditional distribution of Y given X = x is uniform(x 2 , x 2 + 10 ).
The joint pdf of (X , Y ) is
1
f (x, y ) = 5, −1 < x < 1, x 2 < y < x 2 + .
10
Example
Let X have a uniform(−1, 1) distribution and Z have a
1
uniform(0, 10 ) distribution. Suppose X and Z are independent.
Let Y = X 2 + Z and consider the random vector (X , Y ). The
1
conditional distribution of Y given X = x is uniform(x 2 , x 2 + 10 ).
The joint pdf of (X , Y ) is
1
f (x, y ) = 5, −1 < x < 1, x 2 < y < x 2 + .
10
There is a strong relationship between X
and Y , as indicated by the conditional
distribution of Y given X = x.
Example
Let X have a uniform(−1, 1) distribution and Z have a
1
uniform(0, 10 ) distribution. Suppose X and Z are independent.
Let Y = X 2 + Z and consider the random vector (X , Y ). The
1
conditional distribution of Y given X = x is uniform(x 2 , x 2 + 10 ).
The joint pdf of (X , Y ) is
1
f (x, y ) = 5, −1 < x < 1, x 2 < y < x 2 + .
10
There is a strong relationship between X
and Y , as indicated by the conditional
distribution of Y given X = x.
In fact, E (X ) = E (X 3 ) = 0, since X and Z are independent,
E (XZ ) = E (X )E (Z ).
Cov (X , Y ) = E (X (X 2 + Z )) − E (X )(E (X 2 + Z )) = 0, ρXY = 0.
62 / 97
Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp
Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp
65 / 97
Multivariate distributions Cont’d
Let g (x) = g (x1 , · · · , xn ) be a real-valued function defined on
the sample space of X. Then the expected value of g (X) is
Z ∞ Z ∞
E (g (X)) = ··· g (x)f (x)dx
−∞ −∞
X
E (g (X)) = g (x)f (x)
x∈Rn
65 / 97
Multinomial distribution
Multinomial theory
66 / 97
Multinomial distribution
Multinomial theory
m! X (m − xn )! n−1 Y pi xi
= pnxn (1 − pn )m−xn
xn !(m − xn )! x1 ! · · · xn−1 ! 1 − pn
i=1
m!
= p xn (1 − pn )m−xn
xn !(m − xn )! n
Marginal pdf of multinomial distribution
X m!
f (xn ) = p x1 · · · pnxn
x1 ! · · · xn ! 1
(x1 ,··· ,xn−1 )∈B
X m! (m − xn )!(1 − pn )m−xn
= p1x1 · · · pnxn
x1 ! · · · xn ! (m − xn )!(1 − pn )m−xn
(x1 ,··· ,xn−1 )∈B
m! X (m − xn )! n−1 Y pi xi
= pnxn (1 − pn )m−xn
xn !(m − xn )! x1 ! · · · xn−1 ! 1 − pn
i=1
m!
= p xn (1 − pn )m−xn
xn !(m − xn )! n
Hence, the marginal distribution of Xn is binomial(m, pn ).
Marginal pdf of multinomial distribution
X m!
f (xn ) = p x1 · · · pnxn
x1 ! · · · xn ! 1
(x1 ,··· ,xn−1 )∈B
X m! (m − xn )!(1 − pn )m−xn
= p1x1 · · · pnxn
x1 ! · · · xn ! (m − xn )!(1 − pn )m−xn
(x1 ,··· ,xn−1 )∈B
m! X (m − xn )! n−1 Y pi xi
= pnxn (1 − pn )m−xn
xn !(m − xn )! x1 ! · · · xn−1 ! 1 − pn
i=1
m!
= p xn (1 − pn )m−xn
xn !(m − xn )! n
Hence, the marginal distribution of Xn is binomial(m, pn ).
Similar arguments show that each of the other coordinates is
marginally binomially distributed.
67 / 97
Mutually independent random vectors
68 / 97
Conditional pdf of multinomial distribution
f (x1 , · · · , xn )
f (x1 , · · · , xn−1 |xn ) =
f (xn )
m! x1 xn n−1
x1 !···xn ! p1 · · · pn (m − xn )! Y pi xi
= m! xn =
xn !(m−xn )! pn (1 − pn )
m−xn x1 ! · · · xn−1 ! 1 − pn
i=1
Conditional pdf of multinomial distribution
f (x1 , · · · , xn )
f (x1 , · · · , xn−1 |xn ) =
f (xn )
m! x1 xn n−1
x1 !···xn ! p1 · · · pn (m − xn )! Y pi xi
= m! xn =
xn !(m−xn )! pn (1 − pn )
m−xn x1 ! · · · xn−1 ! 1 − pn
i=1
70 / 97
Mgf of mutually independent random variables Cont’d
Corollary
71 / 97
Mgf of mutually independent random variables Cont’d
Corollary
72 / 97
Generalization
73 / 97
Generalization
73 / 97
Tail bounds
Question
Consider the experiment of tossing a fair coin n times. What is
the probability that the number of heads exceeds 3n4 .
Notes
The tail bounds of a r.v. X are concerned with the probability
that it deviates significantly from its expected value E (X ) on a
run of the experiment
74 / 97
Markov inequality
Markov inequality
E (X ) 1
P(X > a) ≤ or P(X > aE (X )) ≤
a a
Proof.
Z Z
X E (X )
P(X > a) = dx ≤ dx = .
X >a a a
Example
3n n/2 2
P(X > )≤ =
4 3n/4 3
75 / 97
Chebyshev’s inequality
If r.v. X is a random variable and let g (x) be a nonnegative
function. Then, for any r > 0,
E (g (X ))
P(g (X ) ≥ r ) ≤ .
r
Proof.
Z ∞ Z
E (g (X )) = g (x)fX (x)dx ≥ g (x)fX (x)dx
−∞ x:g (x)≥r
Z
≥r fX (x)dx = rP(g (X ) ≥ r )
x:g (x)≥r
MX (t)
P(X ≥ a) ≤ .
77 / 97
e at
Chernoff bound
Deriving Chernoff bound
Proof.
For t > 0,
P(X < (1 − δ)µ) = P exp (−tX ) > exp (−t(1 − δ)µ)
Qn
E (exp (−tXi ))
< i=1 .
exp (−t(1 − δ)µ)
78 / 97
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
That is
exp (µ(e −t − 1))
P(X < (1 − δ)µ) < = exp (µ(e (−t) + t − tδ − 1))
exp (−t(1 − δ)µ)
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
That is
exp (µ(e −t − 1))
P(X < (1 − δ)µ) < = exp (µ(e (−t) + t − tδ − 1))
exp (−t(1 − δ)µ)
Now its time to choose t to make the bound as tight as possible.
Taking the derivative of µ(e (−t) +t−tδ−1) and setting −e (−t) +
1 − δ = 0. We have t = ln (1/1 − δ),
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
That is
exp (µ(e −t − 1))
P(X < (1 − δ)µ) < = exp (µ(e (−t) + t − tδ − 1))
exp (−t(1 − δ)µ)
Now its time to choose t to make the bound as tight as possible.
Taking the derivative of µ(e (−t) +t−tδ−1) and setting −e (−t) +
1 − δ = 0. We have t = ln (1/1 − δ),
e −δ µ
P(X < (1 − δ)µ) < .
(1 − δ)(1−δ)
79 / 97
Proof of second statement
To get the simpler form of the bound, we need to get rid of the
clumsy term (1 − δ)(1−δ) .
Proof of second statement
To get the simpler form of the bound, we need to get rid of the
clumsy term (1 − δ)(1−δ) . Note that
X δi δ2
(1 − δ) ln (1 − δ) = (1 − δ)( − ) > −δ +
i 2
i=1
Thus, we have
δ2
(1 − δ)(1−δ) > exp (−δ + )
2
Furthermore,
e −δ µ
P(X < (1 − δ)µ) < (1−δ)
(1 − δ)
e −δ µ
< δ2
= exp (−µδ 2 /2).
80 / 97
e (−δ+ 2 )
Chernoff bound (Upper tail)
Theorem
Let Xi be a sequence of independent Pr.v.s with P(Xi P
= 1) = pi
and P(Xi = 0) = 1 − pi . r.v. X = ni=1 Xi and µ = ni=1 pi .
µ
P(X > (1 + δ)µ) < eδ
(1+δ)(1+δ)
P(X > (1 + δ)µ) < exp (−µδ 2 /4)
81 / 97
Chernoff bound (Upper tail)
Theorem
Let Xi be a sequence of independent Pr.v.s with P(Xi P
= 1) = pi
and P(Xi = 0) = 1 − pi . r.v. X = ni=1 Xi and µ = ni=1 pi .
µ
P(X > (1 + δ)µ) < eδ
(1+δ)(1+δ)
P(X > (1 + δ)µ) < exp (−µδ 2 /4)
Example
n
Let X be # heads in n tosses of a fair coin, then µ = 2 and
δ = 21 , we have
3n 1 n n
P(X > ) = P(X > (1 + ) ) < exp (− δ 2 /4) = exp (−n/32)
4 2 2 2
If we toss the coin 1000 times, the probability is less than
exp (−125/4).
81 / 97
Hoeffding inequality
Let X1 , X2 , · · · , Xn be i.i.d. observations such that E (Xi ) = µ
and a ≤ Xi ≤ b. Then, for any > 0,
Example
If X1 , X2 , · · · , Xn ∼ Bernoulli(p)
In terms of Hoeffding inequality, we have
If p = 0.5,
1 1
P(X − 0.5 > ) < P(|X − 0.5| > ) ≤ 2 exp (−8n).
4 4
82 / 97
Outline
Joint and Marginal Distributions
Conditional Distribution and Independence
Bivariate Transformations
Hierarchical Models and Mixture Distributions
Hierarchical Models and Mixture Distributions
Covariance and Correlation
Multivariate Distributions
Inequalities
Numerical Inequalities
Functional Inequalities
Take-aways
83 / 97
Lemma
Let a and b be any positive numbers, and let p and q be any
positive numbers satisfying p1 + q1 = 1. Then
1 p 1 q
a + b ≥ ab,
p q
with equality if and only if ap = b q .
Lemma
Let a and b be any positive numbers, and let p and q be any
positive numbers satisfying p1 + q1 = 1. Then
1 p 1 q
a + b ≥ ab,
p q
with equality if and only if ap = b q .
Proof.
Fix b, and consider the function
1 p 1 q
g (a) = a + b − ab.
p q
Lemma
Let a and b be any positive numbers, and let p and q be any
positive numbers satisfying p1 + q1 = 1. Then
1 p 1 q
a + b ≥ ab,
p q
with equality if and only if ap = b q .
Proof.
Fix b, and consider the function
1 p 1 q
g (a) = a + b − ab.
p q
85 / 97
Proof cont’d
85 / 97
Proof cont’d
85 / 97
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Proof.
The first inequality follows from −|XY | ≤ XY ≤ |XY |. To
prove the second inequality, define
|X | |Y |
a= 1 and b = 1 .
(E |X |p ) p (E |Y |q ) q
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Proof.
The first inequality follows from −|XY | ≤ XY ≤ |XY |. To
prove the second inequality, define
|X | |Y |
a= 1 and b = 1 .
(E |X |p ) p (E |Y |q ) q
Applying the above lemma,
1 |X |p 1 |Y |q |XY |
p
+ q
≥ 1 1 .
p (E |X | ) q (E |Y | ) (E |X |p ) p (E |Y |q ) q
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Proof.
The first inequality follows from −|XY | ≤ XY ≤ |XY |. To
prove the second inequality, define
|X | |Y |
a= 1 and b = 1 .
(E |X |p ) p (E |Y |q ) q
Applying the above lemma,
1 |X |p 1 |Y |q |XY |
p
+ q
≥ 1 1 .
p (E |X | ) q (E |Y | ) (E |X |p ) p (E |Y |q ) q
Now take expectations of both sides. The expectation of the
left-hand side is 1, and rearrangement gives the conclusion.
86 / 97
Cauchy-Schwarz inequality
For any two r.v.s X and Y
1 1
|E (XY )| ≤ E |XY | ≤ (E |X |2 ) 2 (E |Y |2 ) 2 .
Perhaps the most famous special case of Hölder’s inequality is
that for which p = q = 2.
87 / 97
Cauchy-Schwarz inequality
For any two r.v.s X and Y
1 1
|E (XY )| ≤ E |XY | ≤ (E |X |2 ) 2 (E |Y |2 ) 2 .
Perhaps the most famous special case of Hölder’s inequality is
that for which p = q = 2.
88 / 97
Minkowski’s inequality
Let X and Y be any two r.v.s. Then for 1 ≤ p < ∞,
1 1 1
{E |X + Y |p } p ≤ {E |X |p } p + {E |Y |p } p .
Minkowski’s inequality
Let X and Y be any two r.v.s. Then for 1 ≤ p < ∞,
1 1 1
{E |X + Y |p } p ≤ {E |X |p } p + {E |Y |p } p .
Proof:
E |X + Y |p = E |X + Y ||X + Y |p−1
E |X + Y |p = E |X + Y ||X + Y |p−1
90 / 97
Outline
Joint and Marginal Distributions
Conditional Distribution and Independence
Bivariate Transformations
Hierarchical Models and Mixture Distributions
Hierarchical Models and Mixture Distributions
Covariance and Correlation
Multivariate Distributions
Inequalities
Numerical Inequalities
Functional Inequalities
Take-aways
91 / 97
Convex inequality
A function g (x) is convex if
E (g (X )) ≥ g (E (X )).
Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then
E (g (X )) ≥ g (E (X )).
Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )).
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then
E (g (X )) ≥ g (E (X )).
Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )). Write l(x) = a + bx for some a and b.
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then
E (g (X )) ≥ g (E (X )).
Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )). Write l(x) = a + bx for some a and b.
Now, by the convexity of g we have g (x) ≥ a + bx. Since
expectations preserve inequalities,
E (g (X )) ≥ E (a + bX ) = a + bE (X ) = l(E (X )) = g (E (X )).
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then
E (g (X )) ≥ g (E (X )).
Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )). Write l(x) = a + bx for some a and b.
Now, by the convexity of g we have g (x) ≥ a + bx. Since
expectations preserve inequalities,
E (g (X )) ≥ E (a + bX ) = a + bE (X ) = l(E (X )) = g (E (X )).
One immediate application of Jensen’s Inequality shows that
E (X 2 ) ≥ (E (X ))2 , since g (x) = x 2 is convex.
93 / 97
An inequality for means
Jensen’s inequality can be used to prove an inequality between
three different kinds of means. If a1 , · · · , an are positive num-
bers, define
1
aA = (a1 + a2 + · · · + an ), (arithmetic mean)
n
1
aG = a1 · a2 · · · · · an n , (geometric mean)
1
aH = 1 1 1 1 .(harmonic mean)
(
n a1 + a2 + · · · + an )
aH ≤ aG ≤ aA .
So aG ≤ aA .
Now again use the fact that log x is concave to get
n
1 1X 1 1 1
log = log = log E ( ) ≥ E (log ) = −E (log X ).
aH n ai X X
i=1
Since E (log X ) = log aG , it then follows that log a1H ≥ log a1G ,
or aG ≥ aH .
95 / 97
Covariance inequality
If X is a r.v. with finite mean µ and g (x) is a nondecreasing function,
then E (g (X )(X − µ)) ≥ 0. Since
Theorem
If X is a r.v., g (x) and h(x) are any functions s.t. E (g (X )), E (h(X )),
and E (g (X )h(X )) exist.
Covariance inequality
If X is a r.v. with finite mean µ and g (x) is a nondecreasing function,
then E (g (X )(X − µ)) ≥ 0. Since
Theorem
If X is a r.v., g (x) and h(x) are any functions s.t. E (g (X )), E (h(X )),
and E (g (X )h(X )) exist.
If g (x) is nondecreasing and h(x) is nonincreasing, then
E (g (X )h(X )) ≤ E (g (X ))E (h(X )).
Covariance inequality
If X is a r.v. with finite mean µ and g (x) is a nondecreasing function,
then E (g (X )(X − µ)) ≥ 0. Since
Theorem
If X is a r.v., g (x) and h(x) are any functions s.t. E (g (X )), E (h(X )),
and E (g (X )h(X )) exist.
If g (x) is nondecreasing and h(x) is nonincreasing, then
E (g (X )h(X )) ≤ E (g (X ))E (h(X )).
If g (x) and h(x) are nondecreasing or nonincreasing, then
E (g (X )h(X )) ≥ E (g (X ))E (h(X )).
96 / 97
Take-aways
Conclusions
Joint and marginal distributions
Continuous distributions
Independence
Bivariate transformation
Hierarchical models and mixture distributions
Multivariate distribution
Inequalities
97 / 97