Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views244 pages

Multiple Random Variables

The document is a lecture on statistical inference focusing on multiple random variables, covering topics such as joint and marginal distributions, conditional distribution, and independence. It introduces concepts like random vectors, joint probability mass functions (pmf), and expectations of functions of random vectors, along with examples to illustrate these concepts. The lecture also discusses joint probability density functions (pdf) and methods for calculating probabilities and marginal pdfs.

Uploaded by

bocerin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views244 pages

Multiple Random Variables

The document is a lecture on statistical inference focusing on multiple random variables, covering topics such as joint and marginal distributions, conditional distribution, and independence. It introduces concepts like random vectors, joint probability mass functions (pmf), and expectations of functions of random vectors, along with examples to illustrate these concepts. The lecture also discusses joint probability density functions (pdf) and methods for calculating probabilities and marginal pdfs.

Uploaded by

bocerin283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 244

Statistical Inference

Lecture 4: Multiple Random Variables

MING GAO

DASE @ ECNU
(for course related communications)
[email protected]

Mar. 21, 2018


Outline
Joint and Marginal Distributions
Conditional Distribution and Independence
Bivariate Transformations
Hierarchical Models and Mixture Distributions
Hierarchical Models and Mixture Distributions
Covariance and Correlation
Multivariate Distributions
Inequalities
Numerical Inequalities
Functional Inequalities
Take-aways
2 / 97
Random vector
Definition
An n-dimensional random vector is a function from a sample
space Ω into Rn , n−dimensional Euclidean space.

3 / 97
Random vector
Definition
An n-dimensional random vector is a function from a sample
space Ω into Rn , n−dimensional Euclidean space.

Example

Consider the experiment of tossing two fair dices. Let

X = sum of the two dices, and Y = |difference of the two dices|.


 For the sample point (3, 3), X = 6 and Y = 0.
 For the sample point (4, 1) or (1, 4), X = 5 and Y = 3.
 Since each of the 36 sample points in Ω is equally likely, thus
1
P(X = 5 ∧ Y = 3) = .
18
3 / 97
Joint PMF
Definition
Let (X , Y ) be a discrete bivariate random vector. Then the
function f (x, y ) from R2 into R defined by

f (x, y ) = P(X = x, Y = y )

is called the joint probability mass function or joint pmf of


(X , Y ). If it is necessary, the notation fX ,Y (x, y ) will be used.

4 / 97
Joint PMF
Definition
Let (X , Y ) be a discrete bivariate random vector. Then the
function f (x, y ) from R2 into R defined by

f (x, y ) = P(X = x, Y = y )

is called the joint probability mass function or joint pmf of


(X , Y ). If it is necessary, the notation fX ,Y (x, y ) will be used.

Example

There are 21 possible values of (X , Y ).


1
Two of these values, f (5, 3) = 18 and
1
f (6, 0) = 36 .
4 / 97
Probability calculation

The joint pmf can be used to compute the probability of any


event defined in terms of (X , Y ). Let A be any subset of R2 .
Then X
P((X , Y ) ∈ A) = f (x, y ).
(x,y )∈A

5 / 97
Probability calculation

The joint pmf can be used to compute the probability of any


event defined in terms of (X , Y ). Let A be any subset of R2 .
Then X
P((X , Y ) ∈ A) = f (x, y ).
(x,y )∈A

Example

Let A = {(x, y )|x = 7 and y ≤ 4.}


Thus
1 1 1
P(A) = P(X = 7, Y ≤ 4) = f (7, 1) + f (7, 3) = + = .
18 18 9

5 / 97
Expectation
Expectations of functions of random vectors are computed just
as with univariate r.v.s. Let g (x, y ) be a real-valued function
defined for all possible values (x, y ) of the discrete random vec-
tor (X , Y ). Then g (X , Y ) is itself a random variable and its
expected value E (g (X , Y )) is given by
X
E (g (X , Y )) = g (x, y )f (x, y ).
(x,y )∈R2
Expectation
Expectations of functions of random vectors are computed just
as with univariate r.v.s. Let g (x, y ) be a real-valued function
defined for all possible values (x, y ) of the discrete random vec-
tor (X , Y ). Then g (X , Y ) is itself a random variable and its
expected value E (g (X , Y )) is given by
X
E (g (X , Y )) = g (x, y )f (x, y ).
(x,y )∈R2

Question:
For the above given (X , Y ), what is the average value of XY ?
Expectation
Expectations of functions of random vectors are computed just
as with univariate r.v.s. Let g (x, y ) be a real-valued function
defined for all possible values (x, y ) of the discrete random vec-
tor (X , Y ). Then g (X , Y ) is itself a random variable and its
expected value E (g (X , Y )) is given by
X
E (g (X , Y )) = g (x, y )f (x, y ).
(x,y )∈R2

Question:
For the above given (X , Y ), what is the average value of XY ?
Answer: Letting g (x, y ) = xy , we compute E (XY ) =
E (g (X , Y )). Thus,

1 1 11
E (XY ) = 2 × 0 × + ··· + 7 × 5 × = 13 .
6 / 97 36 18 18
Properties of joint pmf

Properties
 For any (x, y ), f (x, y ) ≥ 0 since f (x, y ) is a probability.
 Since (X , Y ) is certain to be in R2
X
f (x, y ) = P((X , Y ) ∈ R2 ) = 1.
(x,y )∈R2

 It turns out that any nonnegative function from R2 to R


that is nonzero for at most a countable number of (x, y )
pairs and sums to 1 is the joint pmf for some bivariate
discrete random vector (X , Y ).

7 / 97
Marginal pmf
Theorem
Let (X , Y ) be a discrete bivariate random vector with join-
t pmf fX ,Y (x, y ). Then the marginal pmfs of X and Y ,
fX (x) = P(X = x) and fY (y ) = P(Y = y ), are given by
X X
fX (x) = fX ,Y (x, y ), fY (y ) = fX ,Y (x, y ).
y ∈R x∈R

Proof.
For any x ∈ R, let Ax = {(x, y )|y ∈ R}. That is, Ax is the
line in the plane with first coordinate equal to x. Then, for any
x ∈ R:

fX (x) = P(X = x) = P(X = x, −∞ < Y < ∞) = P((X , Y ) ∈ Ax )


X X
= fX ,Y (x, y ) = fX ,Y (x, y ).
(x,y )∈Ax y ∈R

8 / 97
Example

Given the above joint pmf, we can compute the marginal pmf
of Y .

fY (0) = fX ,Y (2, 0) + fX ,Y (4, 0) + fX ,Y (6, 0)


1
+ fX ,Y (8, 0) + fX ,Y (10, 0) + fX ,Y (12, 0) =
6
Example

Given the above joint pmf, we can compute the marginal pmf
of Y .

fY (0) = fX ,Y (2, 0) + fX ,Y (4, 0) + fX ,Y (6, 0)


1
+ fX ,Y (8, 0) + fX ,Y (10, 0) + fX ,Y (12, 0) =
6
5
Similarly, we have fY (1) = 18 , fY (2) = 29 , fY (3) = 61 , fY (4) =
1 1
9 , and fY (1) = 18 .
Example

Given the above joint pmf, we can compute the marginal pmf
of Y .

fY (0) = fX ,Y (2, 0) + fX ,Y (4, 0) + fX ,Y (6, 0)


1
+ fX ,Y (8, 0) + fX ,Y (10, 0) + fX ,Y (12, 0) =
6
5
Similarly, we have fY (1) = 18 , fY (2) = 29 , fY (3) = 61 , fY (4) =
1
9 , and fY (1) = 1.
P5 18
Note that k=0 fY (k) = 1, as it must, since these are the only
six possible values of Y .

9 / 97
Joint PDF
Definition

A function f (x, y ) from R2 into R is called a joint probability


density function or joint pdf of the continuous bivariate random
vector (X , Y ) if, for every A ∈ R2
Z Z
P((X , Y ) ∈ A) = f (x, y )dxdy .
A
 If g (x, y ) be a real-valued function, then the expected
values of g (X , Y ) is defined to be
Z +∞ Z +∞
E (g (X , Y )) = g (x, y )f (x, y )dxdy .
−∞ −∞
 The marginal probability density functions of X and Y are
Z +∞ Z +∞
fX (x) = fX ,Y (x, y )dy , fY (y ) = fX ,Y (x, y )dx.
10 / 97 −∞ −∞
Example

Define a joint pdf by

6xy 2 , 0 < x < 1 and 0 < y < 1;



f (x, y ) =
0, otherwise.

It is indeed a joint pdf, since


Example

Define a joint pdf by

6xy 2 , 0 < x < 1 and 0 < y < 1;



f (x, y ) =
0, otherwise.

It is indeed a joint pdf, since


 f (x, y ) ≥ 0 for all (x, y ) in the defined range;
Example

Define a joint pdf by

6xy 2 , 0 < x < 1 and 0 < y < 1;



f (x, y ) =
0, otherwise.

It is indeed a joint pdf, since


 f (x, y ) ≥ 0 for all (x, y ) in the defined range;


Z +∞ Z +∞ Z 1 Z 1
f (x, y )dxdy = 6xy 2 dxdy
−∞ −∞ 0 0
Z 1 Z 1
= 3x 2 y 2 |10 dy = 3y 2 dy = y 3 |10 = 1.
0 0

11 / 97
Calculating probability I

Now consider calculating a probability such as P(X + Y ≥ 1).


Letting A = {(x, y )|x + y ≥ 1}, i.e., P((X , Y ) ∈ A).
Calculating probability I

Now consider calculating a probability such as P(X + Y ≥ 1).


Letting A = {(x, y )|x + y ≥ 1}, i.e., P((X , Y ) ∈ A).

A = {(x, y )|x + y ≥ 1, 0 < x < 1, 0 < y < 1}


= {(x, y )|x ≥ 1 − y , 0 < x < 1, 0 < y < 1}
= {(x, y )|1 − y ≤ x < 1, 0 < y < 1}
Calculating probability I

Now consider calculating a probability such as P(X + Y ≥ 1).


Letting A = {(x, y )|x + y ≥ 1}, i.e., P((X , Y ) ∈ A).

A = {(x, y )|x + y ≥ 1, 0 < x < 1, 0 < y < 1}


= {(x, y )|x ≥ 1 − y , 0 < x < 1, 0 < y < 1}
= {(x, y )|1 − y ≤ x < 1, 0 < y < 1}

Thus,
Z Z Z 1Z 1
9
P((X , Y ) ∈ A) = f (x, y )dxdy = 6xy 2 dxdy = .
A 0 1−y 10

12 / 97
Calculating marginal pdf
To calculate fX (x), we note that for x ≥ 1 or x ≤ 0, f (x, y ) = 0.
Thus for x ≥ 1 or x ≤ 0, we have
Z +∞
fX (x) = f (x, y )dy = 0.
−∞
Calculating marginal pdf
To calculate fX (x), we note that for x ≥ 1 or x ≤ 0, f (x, y ) = 0.
Thus for x ≥ 1 or x ≤ 0, we have
Z +∞
fX (x) = f (x, y )dy = 0.
−∞

For 0 < x < 1, we have


Z +∞ Z 1
fX (x) = f (x, y )dy = 6xy 2 dy = 2xy 3 |10 = 2x.
−∞ 0
Calculating marginal pdf
To calculate fX (x), we note that for x ≥ 1 or x ≤ 0, f (x, y ) = 0.
Thus for x ≥ 1 or x ≤ 0, we have
Z +∞
fX (x) = f (x, y )dy = 0.
−∞

For 0 < x < 1, we have


Z +∞ Z 1
fX (x) = f (x, y )dy = 6xy 2 dy = 2xy 3 |10 = 2x.
−∞ 0

Similarly, we can calculate

3y 2 , 0 < y < 1;

fX (x) = .
0, otherwise.

13 / 97
Calculating probability II
Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.
Calculating probability II
Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.
Notice that region A is an unbounded region with three sides
given by the lines y = x, x + y = 1 and x = 0. To integrate
over this region, we would have to break the region into at least
two parts to write this appropriate limits of integration.
Calculating probability II
Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.
Notice that region A is an unbounded region with three sides
given by the lines y = x, x + y = 1 and x = 0. To integrate
over this region, we would have to break the region into at least
two parts to write this appropriate limits of integration.
Thus P((X , Y ) ∈ A) can be calculated as

P(X + Y ≥ 1) = 1 − P(X + Y < 1)


Z 1 Z 1−x
2
=1− e −y dydx
0 x
Z 1
2
=1− (e −x − e −(1−x) )dx
0
− 21
= 2e − e −1
14 / 97
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by

F (x, y ) = P(X ≤ x, Y ≤ y ).
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by

F (x, y ) = P(X ≤ x, Y ≤ y ).
 The joint cdf is usually not very handy for discrete cases;
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by

F (x, y ) = P(X ≤ x, Y ≤ y ).
 The joint cdf is usually not very handy for discrete cases;
 For continuous bivariate random vector,
Z x Z y
F (x, y ) = f (s, t)dsdt.
−∞ −∞
Joint cdf
The joint probability distribution of (X , Y ) can be completely
described with the joint cdf rather than with the joint pmf or
joint pdf.
The joint cdf is the function F (x, y ) defined by

F (x, y ) = P(X ≤ x, Y ≤ y ).
 The joint cdf is usually not very handy for discrete cases;
 For continuous bivariate random vector,
Z x Z y
F (x, y ) = f (s, t)dsdt.
−∞ −∞


∂ 2 F (x, y )
= f (x, y ).
∂x∂y
15 / 97
Conditional pmf
Let (X , Y ) be a discrete bivariate random vector with joint pmf
f (x, y ) and marginal pmfs fX (x) and fY (y ). For any x such that
P(X = x) = fX (x) > 0, the conditional pmf of Y given that
X = x is the function of y denoted by f (y |x) and defined by

f (x, y )
f (y |x) = P(Y = y |X = x) = .
fX (x)
Conditional pmf
Let (X , Y ) be a discrete bivariate random vector with joint pmf
f (x, y ) and marginal pmfs fX (x) and fY (y ). For any x such that
P(X = x) = fX (x) > 0, the conditional pmf of Y given that
X = x is the function of y denoted by f (y |x) and defined by

f (x, y )
f (y |x) = P(Y = y |X = x) = .
fX (x)

For any y such that P(Y = y ) = fY (y ) > 0, the conditional


pmf of X given that Y = y is the function of x denoted by
f (x|y ) and defined by

f (x, y )
f (x|y ) = P(X = x|Y = y ) = .
fY (y )

16 / 97
Example
Define the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) =
18 18
4 4
f (1, 20) = , f (2, 30) = .
18 18
Example
Define the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) =
18 18
4 4
f (1, 20) = , f (2, 30) = .
18 18
First, the marginal pmf of X is
4 4
fX (0) = f (0, 10) + f (0, 20) = , fX (2) = f (2, 30) =
18 18
10
fX (1) = f (1, 10) + f (1, 20) + f (1, 30) =
18
Example
Define the joint pmf of (X , Y ) by
2 3
f (0, 10) = f (0, 20) = , f (1, 10) = f (1, 30) =
18 18
4 4
f (1, 20) = , f (2, 30) = .
18 18
First, the marginal pmf of X is
4 4
fX (0) = f (0, 10) + f (0, 20) = , fX (2) = f (2, 30) =
18 18
10
fX (1) = f (1, 10) + f (1, 20) + f (1, 30) =
18
For x = 0,
f (0, 10) 1 f (0, 20) 1
fX (10|0) = = , fX (20|0) = =
fX (0) 2 fX (0) 2
17 / 97
Example Cont’d
For x = 1,
f (1, 10) 3
fX (10|1) = =
fX (1) 10
f (1, 20) 4
fX (20|1) = =
fX (1) 10
f (1, 30) 3
fX (30|1) = =
fX (1) 10
Example Cont’d
For x = 1,
f (1, 10) 3
fX (10|1) = =
fX (1) 10
f (1, 20) 4
fX (20|1) = =
fX (1) 10
f (1, 30) 3
fX (30|1) = =
fX (1) 10

For x = 2,
f (2, 30)
fX (30|2) = =1
fX (2)

18 / 97
Conditional pdf

Let (X , Y ) be a continuous bivariate random vector with joint


pmf f (x, y ) and marginal pdfs fX (x) and fY (y ). For any x such
that P(X = x) = fX (x) > 0, the conditional pdf of Y given
that X = x is the function of y denoted by f (y |x) and defined
by
f (x, y )
f (y |x) = .
fX (x)
Conditional pdf

Let (X , Y ) be a continuous bivariate random vector with joint


pmf f (x, y ) and marginal pdfs fX (x) and fY (y ). For any x such
that P(X = x) = fX (x) > 0, the conditional pdf of Y given
that X = x is the function of y denoted by f (y |x) and defined
by
f (x, y )
f (y |x) = .
fX (x)
For any y such that P(Y = y ) = fY (y ) > 0, the conditional pdf
of X given that Y = y is the function of x denoted by f (x|y )
and defined by
f (x, y )
f (x|y ) = .
fY (y )

19 / 97
Calculating conditional pdf

Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.


We need to compute the conditional pdf of Y given X = x.
Calculating conditional pdf

Let f (x, y ) = e −y , 0 < x < y < ∞, and A = {(x, y )|x +y ≥ 1}.


We need to compute the conditional pdf of Y given X = x.
The marginal pdf of X is computed as
 For x ≤ 0, fX (x) = 0 since f (x, y ) = 0;
 For x > 0,
Z ∞ Z ∞
fX (x) = f (x, y )dy = e −y dy = e −x .
−∞ x

Thus, the conditional pdf of Y given X = x can be


f (x,y ) e −y
 f (y |x) = fX (x) = e −x
= e −(y −x) , if y > x;
f (x,y ) 0
 f (y |x) = fX (x) = e −x
= 0, if y ≤ x;

20 / 97
Conditional expectation
If g (Y ) is a function of Y , then the conditional expected value
of g (Y ) given that X = x is denoted by E (g (Y |x)) and is
defined by
X
E (g (Y |x)) = g (y )f (y |x)
y
Z ∞
E (g (Y |x)) = g (y )f (y |x)dy
−∞
Conditional expectation
If g (Y ) is a function of Y , then the conditional expected value
of g (Y ) given that X = x is denoted by E (g (Y |x)) and is
defined by
X
E (g (Y |x)) = g (y )f (y |x)
y
Z ∞
E (g (Y |x)) = g (y )f (y |x)dy
−∞

 The conditional expected value has all of the properties of


the usual expected value;
 E (Y |x) provides the best guess at Y based on knowledge of
X.

21 / 97
Calculating conditional expectation and variance
Given above example, the conditional expected value of Y given
X = x can be calculated as
Z ∞
E (Y |X = x) = ye −(y −x) dy = 1 + x.
x
Calculating conditional expectation and variance
Given above example, the conditional expected value of Y given
X = x can be calculated as
Z ∞
E (Y |X = x) = ye −(y −x) dy = 1 + x.
x

The conditional variance can be computed as

Var (Y |X = x) = E (Y 2 |x) − (E (Y |x))2


Z ∞ Z ∞
2 −(y −x)
= y e dy − ( ye −(y −x) dy )2 = 1
x x
Calculating conditional expectation and variance
Given above example, the conditional expected value of Y given
X = x can be calculated as
Z ∞
E (Y |X = x) = ye −(y −x) dy = 1 + x.
x

The conditional variance can be computed as

Var (Y |X = x) = E (Y 2 |x) − (E (Y |x))2


Z ∞ Z ∞
2 −(y −x)
= y e dy − ( ye −(y −x) dy )2 = 1
x x

Note that the marginal distribution of Y is gamma(2, 1), which


has Var (Y ) = 2. Given the knowledge that X = x, the variabil-
ity in Y is considerably reduced.
22 / 97
Independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ) and marginal pdfs or pmfs fX (x) and fY (y ). Then X
and Y are called independent r.v.s if, for any x ∈ R and y ∈ R

f (x, y ) = fX (x)fY (y ).
Independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ) and marginal pdfs or pmfs fX (x) and fY (y ). Then X
and Y are called independent r.v.s if, for any x ∈ R and y ∈ R

f (x, y ) = fX (x)fY (y ).
 If X and Y are independent, the conditional pdf of Y given
X = x is
f (x, y ) fX (x)fY (y )
f (y |x) = = = fY (y ).
fX (x) fX (x)
Independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ) and marginal pdfs or pmfs fX (x) and fY (y ). Then X
and Y are called independent r.v.s if, for any x ∈ R and y ∈ R

f (x, y ) = fX (x)fY (y ).
 If X and Y are independent, the conditional pdf of Y given
X = x is
f (x, y ) fX (x)fY (y )
f (y |x) = = = fY (y ).
fX (x) fX (x)
 For any A ⊂ R and x ∈ R,
Z Z
P(Y ∈ A|x) = f (y |x)dy = fY (y )dy = P(Y ∈ A).
A A
23 / 97
Checking independent I
Define the joint pmf of (X , Y ) by
1
f (10, 1) = f (20, 1) = f (20, 2) =
10
1 3
f (10, 2) = f (10, 3) = , f (20, 3) = .
5 10
Checking independent I
Define the joint pmf of (X , Y ) by
1
f (10, 1) = f (20, 1) = f (20, 2) =
10
1 3
f (10, 2) = f (10, 3) = , f (20, 3) = .
5 10
The marginal pmfs are
1
fX (10) = fX (20) =
2
1 3 1
fY (1) = , fY (2) = , fY (3) =
5 10 2
Checking independent I
Define the joint pmf of (X , Y ) by
1
f (10, 1) = f (20, 1) = f (20, 2) =
10
1 3
f (10, 2) = f (10, 3) = , f (20, 3) = .
5 10
The marginal pmfs are
1
fX (10) = fX (20) =
2
1 3 1
fY (1) = , fY (2) = , fY (3) =
5 10 2
Thus, the r.v.s X and Y are not independent since
1 11
f (10, 3) = 6= = fX (10)fY (3).
5 22
24 / 97
Lemma for independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ). Then X and Y are independent r.v.s if and only if there
exist functions g (x) and h(y ) such that, for every x ∈ R and
y ∈R
f (x, y ) = g (x)h(y ).

Proof.
Lemma for independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ). Then X and Y are independent r.v.s if and only if there
exist functions g (x) and h(y ) such that, for every x ∈ R and
y ∈R
f (x, y ) = g (x)h(y ).

Proof.
⇒: Easily to prove based on the definition.
Lemma for independent r.v.s
Let (X , Y ) be a bivariate random vector with joint pdf or pmf
f (x, y ). Then X and Y are independent r.v.s if and only if there
exist functions g (x) and h(y ) such that, for every x ∈ R and
y ∈R
f (x, y ) = g (x)h(y ).

Proof.
⇒: Easily to prove based on the definition.
⇐: Let f (x, y ) = g (x)h(y ). We define
Z ∞
g (x)dx = c
−∞
Z ∞
h(y )dy = d
−∞

25 / 97
Proof Cont’d
Z ∞ Z ∞ Z ∞Z ∞
cd = ( g (x)dx)( h(y )dy ) = g (x)h(y )dxdy
−∞ −∞ −∞ −∞
Z ∞Z ∞
f (x, y )dxdy = 1
−∞ −∞

Furthermore, the marginal pdfs are given by


Z ∞ Z ∞
fX (x) = g (x)h(y )dy = dg (x), fY (y ) = g (x)h(y )dx = ch(y )
−∞ −∞

Thus we have

f (x, y ) = g (x)h(y ) = g (x)h(y )cd = fX (x)fY (y )

That is X and Y are independent.


26 / 97
Checking independent II
x
1 2 4 −y − 2
Consider the joint pdf f (x, y ) = 384 x y e , x > 0 and
y > 0.
Question: Please confirm whether r.v.s X and Y are indepen-
dent.
Checking independent II
x
1 2 4 −y − 2
Consider the joint pdf f (x, y ) = 384 x y e , x > 0 and
y > 0.
Question: Please confirm whether r.v.s X and Y are indepen-
dent.
Answer: If we define
 1
x 2 e − 2 , x > 0;
g (x) =
0, otherwise.
1 4 −y

h(y ) = 384 y e , y > 0;
0, otherwise.
Checking independent II
x
1 2 4 −y − 2
Consider the joint pdf f (x, y ) = 384 x y e , x > 0 and
y > 0.
Question: Please confirm whether r.v.s X and Y are indepen-
dent.
Answer: If we define
 1
x 2 e − 2 , x > 0;
g (x) =
0, otherwise.
1 4 −y

h(y ) = 384 y e , y > 0;
0, otherwise.

Then f (x, y ) = g (x)h(y ) for all x ∈ R and y ∈ R. In terms of


the lemma, we conclude that X and Y are independent r.v.s.
Checking independent II
x
1 2 4 −y − 2
Consider the joint pdf f (x, y ) = 384 x y e , x > 0 and
y > 0.
Question: Please confirm whether r.v.s X and Y are indepen-
dent.
Answer: If we define
 1
x 2 e − 2 , x > 0;
g (x) =
0, otherwise.
1 4 −y

h(y ) = 384 y e , y > 0;
0, otherwise.

Then f (x, y ) = g (x)h(y ) for all x ∈ R and y ∈ R. In terms of


the lemma, we conclude that X and Y are independent r.v.s.
Note that we no not have to compute the marginal pdfs.

27 / 97
Theorem for independent r.v.s
Let X and Y are independent r.v.s
 For any A ⊂ R and B ⊂ R,

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B),

i.e., the events {X ∈ A} and {Y ∈ B} are independent


events;
Theorem for independent r.v.s
Let X and Y are independent r.v.s
 For any A ⊂ R and B ⊂ R,

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B),

i.e., the events {X ∈ A} and {Y ∈ B} are independent


events;
 Let g (x) and h(y ) be functions only of x and y ,
respectively, then

E (g (X )h(Y )) = (E (g (X )))(E (h(Y ))).


Theorem for independent r.v.s
Let X and Y are independent r.v.s
 For any A ⊂ R and B ⊂ R,

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B),

i.e., the events {X ∈ A} and {Y ∈ B} are independent


events;
 Let g (x) and h(y ) be functions only of x and y ,
respectively, then

E (g (X )h(Y )) = (E (g (X )))(E (h(Y ))).

 The moment generating function of the r.v. Z = X + Y is


given by
MZ (t) = MX (t)MY (t).
28 / 97
Expectation of independent r.v.s

Let X and Y are independent exponential(1) r.v.s


Expectation of independent r.v.s

Let X and Y are independent exponential(1) r.v.s




P(X ≥ 4, Y < 3) = P(X ≥ 4)P(Y < 3)


= e −4 (1 − e −3 )
Expectation of independent r.v.s

Let X and Y are independent exponential(1) r.v.s




P(X ≥ 4, Y < 3) = P(X ≥ 4)P(Y < 3)


= e −4 (1 − e −3 )

 Letting g (x) = x 2 and h(y ) = y , we see that

E (X 2 Y ) = (E (X 2 ))(E (Y ))
= (Var (X ) + (E (X ))2 )E (Y )
= (1 + 12 )1 = 2.

29 / 97
MGF of a sum of normal variables
Let X ∼ N(µ1 , σ12 ) and Y ∼ N(µ2 , σ22 ) are independent r.v.s.
Then, the mgfs of X and Y are
2 2 /2
MX (t) = exp µ1 t+σ1 t
2 2 /2
MY (t) = exp µ2 t+σ2 t

In terms of the theorem, the mgf of Z = X + Y is


2 2 2 /2
MZ (t) = MX (t)MY (t) = exp (µ1 +µ2 )t+(σ1 +σ2 )t .

30 / 97
MGF of a sum of normal variables
Let X ∼ N(µ1 , σ12 ) and Y ∼ N(µ2 , σ22 ) are independent r.v.s.
Then, the mgfs of X and Y are
2 2 /2
MX (t) = exp µ1 t+σ1 t
2 2 /2
MY (t) = exp µ2 t+σ2 t

In terms of the theorem, the mgf of Z = X + Y is


2 2 2 /2
MZ (t) = MX (t)MY (t) = exp (µ1 +µ2 )t+(σ1 +σ2 )t .

Theorem

Let X ∼ N(µ1 , σ12 ) and Y ∼ N(µ2 , σ22 ) are independent r.v.s.


Then, the r.v. Z = X +Y has a N(µ1 +µ2 , σ12 +σ22 ) distribution.

30 / 97
Distribution of bivariate function

Let (X , Y ) be a bivariate random vector with a known proba-


bility distribution. Now consider a new bivariate random vector
(U, V ) defined by U = g1 (X , Y ) and V = g2 (X , Y ), where
gi (x, y ) is some specified function.
Distribution of bivariate function

Let (X , Y ) be a bivariate random vector with a known proba-


bility distribution. Now consider a new bivariate random vector
(U, V ) defined by U = g1 (X , Y ) and V = g2 (X , Y ), where
gi (x, y ) is some specified function.
If B ⊂ R2 if and only if (X , Y ) ∈ A, where

A = {(x, y )|(g1 (x, y ), g2 (x, y )) ∈ B}.


Distribution of bivariate function

Let (X , Y ) be a bivariate random vector with a known proba-


bility distribution. Now consider a new bivariate random vector
(U, V ) defined by U = g1 (X , Y ) and V = g2 (X , Y ), where
gi (x, y ) is some specified function.
If B ⊂ R2 if and only if (X , Y ) ∈ A, where

A = {(x, y )|(g1 (x, y ), g2 (x, y )) ∈ B}.

Thus
P((U, V ) ∈ B) = P((X , Y ) ∈ A),
i.e., the probability distribution of (U, V ) is completely deter-
mined by the probability distribution of (X , Y ).

31 / 97
Transformation of discrete r.v.s
If (X , Y ) is discrete bivariate random vector, then there is only
a countable set of values for which the joint pmf of (X , y ) is
positive. Call this set A.
Transformation of discrete r.v.s
If (X , Y ) is discrete bivariate random vector, then there is only
a countable set of values for which the joint pmf of (X , y ) is
positive. Call this set A.
Define the set

B = {(u, v )|u = g1 (x, y ) and v = g2 (x, y ) for some (x, y ) ∈ A}.


Transformation of discrete r.v.s
If (X , Y ) is discrete bivariate random vector, then there is only
a countable set of values for which the joint pmf of (X , y ) is
positive. Call this set A.
Define the set

B = {(u, v )|u = g1 (x, y ) and v = g2 (x, y ) for some (x, y ) ∈ A}.

Then B is the countable set of possible values for the discrete


random vector (U, V ). And if, for any (u, v ) ∈ B, we define

Auv = {(x, y ) ∈ A|u = g1 (x, y ) and v = g2 (x, y )}.


Transformation of discrete r.v.s
If (X , Y ) is discrete bivariate random vector, then there is only
a countable set of values for which the joint pmf of (X , y ) is
positive. Call this set A.
Define the set

B = {(u, v )|u = g1 (x, y ) and v = g2 (x, y ) for some (x, y ) ∈ A}.

Then B is the countable set of possible values for the discrete


random vector (U, V ). And if, for any (u, v ) ∈ B, we define

Auv = {(x, y ) ∈ A|u = g1 (x, y ) and v = g2 (x, y )}.

Then the joint pmf of (U, V ) can be computed as

fU,V (u, v ) = P(U = u, V = v ) = P((X , Y ) ∈ Auv )


X
= fX ,Y (x, y ).
(x,y )∈Auv
32 / 97
Distribution of the sum of Poisson variables
Let X and Y are independent Poission r.v.s with parameters θ1
and θ2 , respectively. Thus the joint pmf of (X , Y ) is

θ1x e −θ1 θ2y e −θ2


fX ,Y (x, y ) = , x ∈ N, y ∈ N
x! y!
Distribution of the sum of Poisson variables
Let X and Y are independent Poission r.v.s with parameters θ1
and θ2 , respectively. Thus the joint pmf of (X , Y ) is

θ1x e −θ1 θ2y e −θ2


fX ,Y (x, y ) = , x ∈ N, y ∈ N
x! y!

Now define U = X + Y and V = Y . That is, g1 (x, y ) = x + y


and g2 (x, y ) = y . Thus,

A = {(x, y )|x ∈ N, y ∈ N}
B = {(u, v )|v ∈ N, u ≥ v , u ∈ N}.
Distribution of the sum of Poisson variables
Let X and Y are independent Poission r.v.s with parameters θ1
and θ2 , respectively. Thus the joint pmf of (X , Y ) is

θ1x e −θ1 θ2y e −θ2


fX ,Y (x, y ) = , x ∈ N, y ∈ N
x! y!

Now define U = X + Y and V = Y . That is, g1 (x, y ) = x + y


and g2 (x, y ) = y . Thus,

A = {(x, y )|x ∈ N, y ∈ N}
B = {(u, v )|v ∈ N, u ≥ v , u ∈ N}.

θ1u−v e −θ1 θ2v e −θ2


fU,V (u, v ) = fX ,Y (u − v , v ) =
(u − v )! v !
33 / 97
Distribution of the sum of Poisson variables Cont’d
In this example it is interesting to compute the marginal pmf of
U. Thus
u
X θ1u−v e −θ1 θ2v e −θ2
fU (u) =
(u − v )! v !
v =0
u
X θ1u−v θ2v
= e −(θ1 +θ2 )
(u − v )! v !
v =0
u  
e −(θ1 +θ2 ) X u u−v v
= θ θ2
u! v 1
v =0
e −(θ1 +θ2 )
= (θ1 + θ2 )u
u!
(θ1 + θ2 )u −(θ1 +θ2 )
= e
u!
Distribution of the sum of Poisson variables Cont’d
In this example it is interesting to compute the marginal pmf of
U. Thus
u
X θ1u−v e −θ1 θ2v e −θ2
fU (u) =
(u − v )! v ! Theorem
v =0
u
X θ1u−vθ2v Let X and Y are in-
= e −(θ1 +θ2 )
(u − v )! v ! dependent Poission r.v.s
v =0
u   with parameters θ1 and
e −(θ1 +θ2 ) X u u−v v
= θ θ2 θ2 , respectively. Thus
u! v 1
v =0 X + Y ∼ Poisson(θ1 +
e −(θ1 +θ2 ) θ2 ).
= (θ1 + θ2 )u
u!
(θ1 + θ2 )u −(θ1 +θ2 )
= e
u!
34 / 97
Transformation of continuous r.v.s
If (X , Y ) is a continuous bivariate random vector with joint pdf
fX ,Y (x, y ), then the joint pdf of U, V can be expressed in terms
of fX ,Y (x, y ).
Transformation of continuous r.v.s
If (X , Y ) is a continuous bivariate random vector with joint pdf
fX ,Y (x, y ), then the joint pdf of U, V can be expressed in terms
of fX ,Y (x, y ).
Define the sets

A = {(x, y )|fX ,Y (x, y ) > 0}


B = {(u, v )|u = g1 (x, y ) and v = g2 (x, y ) for some (x, y ) ∈ A}
Transformation of continuous r.v.s
If (X , Y ) is a continuous bivariate random vector with joint pdf
fX ,Y (x, y ), then the joint pdf of U, V can be expressed in terms
of fX ,Y (x, y ).
Define the sets

A = {(x, y )|fX ,Y (x, y ) > 0}


B = {(u, v )|u = g1 (x, y ) and v = g2 (x, y ) for some (x, y ) ∈ A}

For the simplest version of this result we assume that the trans-
formation u = g1 (x, y ) and v = g2 (x, y ) defines a one-to-one
transformation of A onto B.
Transformation of continuous r.v.s
If (X , Y ) is a continuous bivariate random vector with joint pdf
fX ,Y (x, y ), then the joint pdf of U, V can be expressed in terms
of fX ,Y (x, y ).
Define the sets

A = {(x, y )|fX ,Y (x, y ) > 0}


B = {(u, v )|u = g1 (x, y ) and v = g2 (x, y ) for some (x, y ) ∈ A}

For the simplest version of this result we assume that the trans-
formation u = g1 (x, y ) and v = g2 (x, y ) defines a one-to-one
transformation of A onto B.
For such a one-to-one, onto transformation, we can obtain a
reverse transformation by x = h1 (u, v ) and y = h2 (u, v ). The
role played by a derivative in the univariate case is now played
by a quantity called the Jacobian of the transformation.
35 / 97
Transformation of continuous r.v.s
We further define the Jacobian determinant of the transforma-
tion as
∂x ∂x
∂u ∂v
∂x ∂y ∂x ∂y
J = ∂y ∂y = − ,
∂u ∂v ∂u ∂v ∂v ∂u
∂x
where ∂u = ∂h1∂u
(u,v ) ∂y
, ∂v = ∂h2∂v
(u,v ) ∂x
, ∂v = ∂h1∂v
(u,v ) ∂y
, ∂u = ∂h2∂u
(u,v )
.
The joint pdf of (U, V ) is 0 outside the set B and on the set B
is given by

fU,V (u, v ) = fX ,Y (h1 (u, v ), h2 (u, v ))|J|,

where |J| is the absolute value of J.


Transformation of continuous r.v.s
We further define the Jacobian determinant of the transforma-
tion as
∂x ∂x
∂u ∂v
∂x ∂y ∂x ∂y
J = ∂y ∂y = − ,
∂u ∂v ∂u ∂v ∂v ∂u
∂x
where ∂u = ∂h1∂u
(u,v ) ∂y
, ∂v = ∂h2∂v
(u,v ) ∂x
, ∂v = ∂h1∂v
(u,v ) ∂y
, ∂u = ∂h2∂u
(u,v )
.
The joint pdf of (U, V ) is 0 outside the set B and on the set B
is given by

fU,V (u, v ) = fX ,Y (h1 (u, v ), h2 (u, v ))|J|,

where |J| is the absolute value of J.


Note that it is sometimes just as difficult to determine the set
B and verify that the transformation is one-to-one as it is to
substitute into the formula.
36 / 97
Sum and difference of normal variables
Let X and Y are independent, standard normal r.v.s. Consider
the transformation U = X + Y and V = X − Y , thus we have

g1 (x, y ) = x + y , g2 (x, y ) = x − y
u+v u−v
h1 (u, v ) = , h2 (u, v ) = .
2 2
Sum and difference of normal variables
Let X and Y are independent, standard normal r.v.s. Consider
the transformation U = X + Y and V = X − Y , thus we have

g1 (x, y ) = x + y , g2 (x, y ) = x − y
u+v u−v
h1 (u, v ) = , h2 (u, v ) = .
2 2
Furthermore,
∂x ∂x 1 1
∂u ∂v 2 2
1
J= ∂y ∂y = 1 =−
∂u ∂v 2 − 12 2
Sum and difference of normal variables
Let X and Y are independent, standard normal r.v.s. Consider
the transformation U = X + Y and V = X − Y , thus we have

g1 (x, y ) = x + y , g2 (x, y ) = x − y
u+v u−v
h1 (u, v ) = , h2 (u, v ) = .
2 2
Furthermore,
∂x ∂x 1 1
∂u ∂v 2 2
1
J= ∂y ∂y = 1 =−
∂u ∂v 2 − 12 2

fU,V (u, v ) = fX ,Y (h1 (u, v ), h2 (u, v ))|J|


1 −((u+v )/2)2 −((u−v )/2)2 1 2 1 2
= e e = ( √ √ e −u /4 )( √ √ e −v /4 )
4π 2π 2 2π 2
37 / 97
Analysis
 The joint pdf has factored into a function of u and a
function of v . By the above lemma, U and V are
independent.
 U ∼ N(0, 2) and V ∼ N(0, 2).
 This important fact, that sums and differences of
independent normal r.v.s are independent normal r.v.s, is
true regardless of the means of X and Y , so long as
Var (X ) = Var (Y ).

38 / 97
Analysis
 The joint pdf has factored into a function of u and a
function of v . By the above lemma, U and V are
independent.
 U ∼ N(0, 2) and V ∼ N(0, 2).
 This important fact, that sums and differences of
independent normal r.v.s are independent normal r.v.s, is
true regardless of the means of X and Y , so long as
Var (X ) = Var (Y ).

Theorem
Let X and Y be independent r.v.s. Let g (x) be a function only
of x and h(y ) be a function only of y . Then the r.v.s U = g (X )
and V = h(Y ) are independent.
38 / 97
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Note that this transformation is not one-to-one since the points
(x, y ) and (−x, −y ) are both mapped into the same (u, v ) point.
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Note that this transformation is not one-to-one since the points
(x, y ) and (−x, −y ) are both mapped into the same (u, v ) point.
Let

A1 = {(x, y ) : y > 0}, A2 = {(x, y ) : y < 0}, A0 = {(x, y ) : y = 0}.


Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Note that this transformation is not one-to-one since the points
(x, y ) and (−x, −y ) are both mapped into the same (u, v ) point.
Let

A1 = {(x, y ) : y > 0}, A2 = {(x, y ) : y < 0}, A0 = {(x, y ) : y = 0}.

Thus, B = {(u, v ) : v > 0} is the image of both A1 and A2


under the transformation.
Distribution of the ratio of normal variables
Let X and Y be independent N(0, 1) r.v.s. Consider the trans-
X
formation U = Y and V = |Y |.
Note that this transformation is not one-to-one since the points
(x, y ) and (−x, −y ) are both mapped into the same (u, v ) point.
Let

A1 = {(x, y ) : y > 0}, A2 = {(x, y ) : y < 0}, A0 = {(x, y ) : y = 0}.

Thus, B = {(u, v ) : v > 0} is the image of both A1 and A2


under the transformation.
The inverse transformation from B to A1 and B to A2 are given
by

x = h11 (u, v ) = uv , y = h21 (u, v ) = v


x = h12 (u, v ) = −uv , y = h22 (u, v ) = −v
39 / 97
Distribution of the ratio of normal variables Cont’d
Note that the Jacobians from the two inverses are J1 = J2 = v ,
and the joint pdf is
1 − x2 − y2
fX ,Y (x, y ) = e 2e 2.

Distribution of the ratio of normal variables Cont’d
Note that the Jacobians from the two inverses are J1 = J2 = v ,
and the joint pdf is
1 − x2 − y2
fX ,Y (x, y ) = e 2e 2.

Thus, we obtain
1 − (uv )2 − v 2 1 − (−uv )2 − (−v )2
fU,V (u, v ) = e 2 e 2 |v | + e 2 e 2 |v |
2π 2π
v (u 2 +1)2 v 2
= e − 2 , −∞ < u, v < ∞
π
Distribution of the ratio of normal variables Cont’d
Note that the Jacobians from the two inverses are J1 = J2 = v ,
and the joint pdf is
1 − x2 − y2
fX ,Y (x, y ) = e 2e 2.

Thus, we obtain
1 − (uv )2 − v 2 1 − (−uv )2 − (−v )2
fU,V (u, v ) = e 2 e 2 |v | + e 2 e 2 |v |
2π 2π
v (u 2 +1)2 v 2
= e − 2 , −∞ < u, v < ∞
π
From this the marginal pdf of U can be computed to be
Z ∞ Z ∞
v − (u2 +1)2 v 2 1 (u 2 +1)2 z 1
fU (u) = e 2 dv = e− 2 dz = .
0 π 2π 0 π(u 2+ 1)
So we see that the ratio of two independent standard normal
r.v.s is a Cauchy r.v.
40 / 97
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials.
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let

X = number of survivors, Y = number of eggs laid,

Thus, we have a hierarchical model as

X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let

X = number of survivors, Y = number of eggs laid,

Thus, we have a hierarchical model as

X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Recall that we use notation such as X |Y ∼ Binomial(Y , p) to
mean that the conditional distribution of X given Y = y is
Binomial(y , p).
41 / 97
Binomial-Poisson hierarchy Cont’d

X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h  ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!

(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!

(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Binomial-Poisson hierarchy Cont’d

X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h  ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!

(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!

(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
Binomial-Poisson hierarchy Cont’d

X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h  ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!

(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!

(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
The answer to the original question is now easy to compute
E (X ) = λp.
42 / 97
Theorem for expectation of conditional expectation
If X and Y are any two r.v.s, then

E (X ) = E (E (X |Y ))

provided that the expectations exist.


Proof.
Let f (x, y ) denote that joint pdf of X and Y . By definition,
we have
Z Z Z hZ i
E (X ) = xf (x, y )dxdy = xf (x|y )dx fY (y )dy .

Thus, we have
Z
E (X ) = E (X |y )fY (y )dy = E (E (X |Y )).

Replace integrals by sums to prove the discrete case.


43 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors

E (X ) = E (E (X |Y )) = E (pY ) = pλ.

44 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors

E (X ) = E (E (X |Y )) = E (pY ) = pλ.

Definition
A r.v. X is said to have a mixture distribution if the distribution
of X depends on quantity that also has a distribution.

44 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors

E (X ) = E (E (X |Y )) = E (pY ) = pλ.

Definition
A r.v. X is said to have a mixture distribution if the distribution
of X depends on quantity that also has a distribution.

In the above example, the Poisson(λp) distribution is a mixture


distribution since it is the result of combining a Binomial(Y , p) with
Y ∼ Poisson(λ).

44 / 97
Mixture distribution
From the above theorem, we can easily compute the expected
number of survivors

E (X ) = E (E (X |Y )) = E (pY ) = pλ.

Definition
A r.v. X is said to have a mixture distribution if the distribution
of X depends on quantity that also has a distribution.

In the above example, the Poisson(λp) distribution is a mixture


distribution since it is the result of combining a Binomial(Y , p) with
Y ∼ Poisson(λ).
In general, we can say that hierarchical models lead to mixture
distributions.
44 / 97
Example generalization
Instead of one mother insect, there are a large number of moth-
ers and one mother is chosen at random. We are still interested
in knowing the average number of survivors, but is is no longer
clear that the number of eggs laid follows the same Poisson
distribution for each mother.
Example generalization
Instead of one mother insect, there are a large number of moth-
ers and one mother is chosen at random. We are still interested
in knowing the average number of survivors, but is is no longer
clear that the number of eggs laid follows the same Poisson
distribution for each mother.
The following three-stage hierarchy may be more appropriate.
Let

X = number of survivors, X ∼ binomial(Y , p)


Y |Λ ∼ Poisson(Λ), Λ ∼ exponential(β),

Thus, the expectation of X can easily be calculated as

E (X ) = E (E (X |Y )) = E (pY ) = E (E (pY |Λ)) = E (pΛ) = pβ.

45 / 97
Rethinking the three-stage model
Note that this three-stage model can also be thought of as a
two-stage hierarchy by combining the last two stages.
Rethinking the three-stage model
Note that this three-stage model can also be thought of as a
two-stage hierarchy by combining the last two stages. If Y |Λ ∼
Poisson(Λ) and Λ ∼ exponential(β), then
Z ∞
P(Y = y ) = P(Y = y , 0 < Λ < ∞) = f (y , λ)dλ
0
Z ∞ Z ∞ h −λ y i
e λ 1 − βλ
= f (y |λ)f (λ)dλ = e dλ
0 0 y! β
Z ∞
1 −1 1  1 y +1
= λy e −λ(1+β ) dλ = Γ(y + 1)
βy ! 0 βy ! 1 + β −1
1  1  y +1
= −1
.
1+β 1+β
It forms a negative binomial pmf. Therefore, our three-stage
hierarchy is equivalent to the two-stage hierarchy
46 / 97
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials.
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let

X = number of survivors, Y = number of eggs laid,

Thus, we have a hierarchical model as

X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Binomial-Poisson hierarchy
An insect lays a large number of eggs, each surviving with prob-
ability p. On the average, how many eggs will survive?
The “large number” of eggs laid is a r.v., often taken to be
Poisson(λ). Furthermore, if we assume that each egg’s survival
is independent, then we have Bernoulli trials. Let

X = number of survivors, Y = number of eggs laid,

Thus, we have a hierarchical model as

X |Y ∼ Binomial(Y , p),
Y ∼ Poisson(λ).
Recall that we use notation such as X |Y ∼ Binomial(Y , p) to
mean that the conditional distribution of X given Y = y is
Binomial(y , p).
47 / 97
Binomial-Poisson hierarchy Cont’d

X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h  ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!

(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!

(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Binomial-Poisson hierarchy Cont’d

X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h  ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!

(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!

(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
Binomial-Poisson hierarchy Cont’d

X ∞
X
P(X = x) = P(X = x, Y = y ) = P(X = x|Y = y )P(Y = y )
y =0 y =0
∞ h  ih λy e −λ i
X y x
= p (1 − p)(y −x)
y =0
x y!

(λp)x e −λ X ((1 − p)λ)(y −x)
=
x! y =x
(y − x)!

(λp)x e −λ X ((1 − p)λ)t (λp)x e −λ (1−p)λ (λp)x λp
= = e = e .
x! t=0
t! x! x!
Thus, any marginal inference on X is with respect to a
Poisson(λp) distribution, with Y playing not part at all.
The answer to the original question is now easy to compute
E (X ) = λp.
48 / 97
Beta-binomial hierarchy

One generalization of the binomial distribution is to allow the


success probability to vary according to a distribution. A stan-
dard model for this situation is

X |P ∼ Binomial(P),
P ∼ β(α, β).
Beta-binomial hierarchy

One generalization of the binomial distribution is to allow the


success probability to vary according to a distribution. A stan-
dard model for this situation is

X |P ∼ Binomial(P),
P ∼ β(α, β).
By iterating the expectation, we calculate the mean of X asThus,
any marginal inference on X as

E (X ) = E (E (X |P)) = E (n|P) = .
α+β

49 / 97
Conditional variance identity
Theorem
For any two r.v.s

Var (X ) = E (Var (X |Y )) + Var (E (X |Y )),

provided that the expectations exist.


Proof.
By definition, we have

Var (X ) = E ((X − E (X ))2 )


= E ([X − E (X |Y ) + E (X |Y ) − E (X )]2 )
0 = E ([X − E (X |Y )][E (X |Y ) − E (X )])
E ([X − E (X |Y )] ) = E (E {[X − E (X |Y )]2 |Y }) = E (Var (X |Y ))
2

E ([E (X |Y ) − E (X )]2 ) = Var (E (X |Y ))


50 / 97
Beta-binomial hierarchy Cont’d
To calculate the variance of X , we have from

Var (X ) = Var (E (X |P)) + E (Var (X |P))

Note that E (X |P) = nP and Var (X |P) = nP(1 − P), where


P ∼ beta(α, β),
Beta-binomial hierarchy Cont’d
To calculate the variance of X , we have from

Var (X ) = Var (E (X |P)) + E (Var (X |P))

Note that E (X |P) = nP and Var (X |P) = nP(1 − P), where


P ∼ beta(α, β),
αβ
Var (E (X |P)) = Var (nP) = n2 .
(α + β)2 (α
+ β + 1)
nΓ(α + β) 1
Z
E (Var (X |P)) = nE (P(1 − P)) = p(1 − p)p α−1 (1 − p)β−1 dp
Γ(α)Γ(β) 0
Γ(α + β) Γ(α + 1)Γ(β + 1) nαβ
=n = .
Γ(α)Γ(β) Γ(α + β + 2) (α + β)(α + β + 1)
Beta-binomial hierarchy Cont’d
To calculate the variance of X , we have from

Var (X ) = Var (E (X |P)) + E (Var (X |P))

Note that E (X |P) = nP and Var (X |P) = nP(1 − P), where


P ∼ beta(α, β),
αβ
Var (E (X |P)) = Var (nP) = n2 .
(α + β)2 (α
+ β + 1)
nΓ(α + β) 1
Z
E (Var (X |P)) = nE (P(1 − P)) = p(1 − p)p α−1 (1 − p)β−1 dp
Γ(α)Γ(β) 0
Γ(α + β) Γ(α + 1)Γ(β + 1) nαβ
=n = .
Γ(α)Γ(β) Γ(α + β + 2) (α + β)(α + β + 1)
Thus we have
nαβ(α + β + n)
Var (X ) = .
51 / 97
(α + β)2 (α + β + 1)
Dirichlet-multinomial hierarchy
Suppose we have a dice of K sides. We toss the dice and the
probability of landing on side k is p(t = k|f ) = fi . We throw
the dice N times and obtain a set of results s = {s1 , s2 , · · · , sN }.
The joint probability is
Dirichlet-multinomial hierarchy
Suppose we have a dice of K sides. We toss the dice and the
probability of landing on side k is p(t = k|f ) = fi . We throw
the dice N times and obtain a set of results s = {s1 , s2 , · · · , sN }.
The joint probability is
N
Y K
Y
p(s|f ) = p(sn |f ) = f1n1 f2n2 · · · fKnK = fi ni ,
n=1 i=1

where ni is the number of i−th slides.


Dirichlet-multinomial hierarchy
Suppose we have a dice of K sides. We toss the dice and the
probability of landing on side k is p(t = k|f ) = fi . We throw
the dice N times and obtain a set of results s = {s1 , s2 , · · · , sN }.
The joint probability is
N
Y K
Y
p(s|f ) = p(sn |f ) = f1n1 f2n2 · · · fKnK = fi ni ,
n=1 i=1

where ni is the number of i−th slides.


Suppose that f is a Dirichlet distribution with α as hyper-
parameter. Then we express the probability of f as
K
Γ( K
P
k=1 αk )
fkαk −1 .
Y
Dir (f |α) = QK
k=1 Γ(αk ) k=1
52 / 97
Example Cont’d
If we want to estimate the parameter f based on the observation
of s, then we can express f in the following manner
Example Cont’d
If we want to estimate the parameter f based on the observation
of s, then we can express f in the following manner

p(s|f , α)p(f |α)


p(f |s, α) = R 1
0 p(s|f , α)p(f |α)df
PK
QK ni Γ( k=1 αk ) QK αk −1
i=1 if Q K k=1 fk
k=1 Γ(αk )
= R 1 QK ni Γ(PKk=1 αk ) QK αk −1
0 i=1 fi QK Γ(αk ) k=1 fk df
k=1
PK
Γ( k=1 αk ) QK nk +αk −1
QK k=1 fk
k=1 Γ(αk )
=
Γ( K
P
αk ) R 1 Q K nk +αk −1
QK k=1
Γ(αk ) 0 k=1 fk df
k=1
K
Γ( K
P
k=1 (nk + αk ))
fknk +αk −1
Y
= QK
k=1 Γ(nk + αk ) k=1
Example Cont’d
If we want to estimate the parameter f based on the observation
of s, then we can express f in the following manner

p(s|f , α)p(f |α) Notice that after estimating


p(f |s, α) = R 1
0 p(s|f , α)p(f |α)df f based on s observations, f
QK ni Γ(PKk=1 αk ) QK αk −1 is still a Dirichlet
i=1 fi QK Γ(αk ) k=1 fk distribution with parameter
k=1
=R Q PK
ni Γ( k=1 αk )
df α + n, where
1 K Q K αk −1
0 i=1 fi QK Γ(αk ) k=1 fk
k=1 n = (n1 , n2 , · · · , nk ). This
Γ( K
P
α k ) Q K n k +α k −1 property is known as
QK k=1 k=1 fk
k=1 Γ(αk ) conjugate priors. Based on
= PK
Γ( k=1 αk ) R 1 QK nk +αk −1
QK
0 k=1 fk df this property, estimating the
k=1 Γ(αk )
parameters fi after
K
Γ( K
P
(n k + α k )) Y n +α −1 observing N trials is a
= QK k=1 fk k k
Γ(nk + αk ) k=1 simple counting procedure.
53 / 97 k=1
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.

54 / 97
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.
The covariance and correlation of X and Y are the numbers
defined by

Cov (X , Y ) = E ((X − µX )(Y − µY )),


Cov (X , Y )
ρXY = ,
σX σY

where the value of ρXY is also called the correlation coefficient.

54 / 97
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.
The covariance and correlation of X and Y are the numbers
defined by

Cov (X , Y ) = E ((X − µX )(Y − µY )),


Cov (X , Y )
ρXY = ,
σX σY

where the value of ρXY is also called the correlation coefficient.

 The large values of X tend to be observed with large values of Y


and small values of X with small values of Y , then Cov (X , Y )
with be positive.

54 / 97
Covariance and correlation
In this section, we discuss two numerical measures of the strength of
a relationship between two r.v.s, the covariance and correlation.
The covariance and correlation of X and Y are the numbers
defined by

Cov (X , Y ) = E ((X − µX )(Y − µY )),


Cov (X , Y )
ρXY = ,
σX σY

where the value of ρXY is also called the correlation coefficient.

 The large values of X tend to be observed with large values of Y


and small values of X with small values of Y , then Cov (X , Y )
with be positive.
 Thus the sign of Cov (X , Y ) gives information regarding the

relationship
54 / 97 between X and Y .
Theorem
For any r.v.s X and Y ,

Cov (X , Y ) = E (XY ) − µX µY
Theorem
For any r.v.s X and Y ,

Cov (X , Y ) = E (XY ) − µX µY

Proof.
Cov (X , Y ) = E ((X − µX )(Y − µY ))
= E (XY − µX Y − µY X + µX µY )
= E (XY ) − µX E (Y ) − µY E (X ) + µX µY
= E (XY ) − µX µY

The correlation is always between −1 and 1, with the values −1


and 1 indicating a perfect linear relationship between X and Y .
55 / 97
Example of correlation
Let the joint pdf of (X , Y ) be
f (x, y ) = 1, 0 < x < 1, x < y < x +1.
Example of correlation
Let the joint pdf of (X , Y ) be
f (x, y ) = 1, 0 < x < 1, x < y < x +1.
The marginal distribution of X is
uniform(0, 1) so µX = 21 and
1
σX2 = 12 .
Example of correlation
Let the joint pdf of (X , Y ) be
f (x, y ) = 1, 0 < x < 1, x < y < x +1.
The marginal distribution of X is
uniform(0, 1) so µX = 21 and
1
σX2 = 12 .
The marginal distribution of Y is
fY (y ) = y , 0 < y < 1 and
fY (y ) = 2 − y , 1 ≤ y < 2 so µY = 1
and σY2 = 16 .
Z 1 Z x+1 Z 1 Z 1
1 2 x+1 1 7
E (XY ) = xydxdy =xy |x dx = (x 2 + x)dx = .
0 x 0 2 0 2 12
7
Cov (X , Y ) − 1 ×1 1
ρXY = = 12 q 2 =√ .
σX σY 1 1 2
12 6
56 / 97
Theorem

If r.v.s X and Y are


independent r.v.s, then
Cov (X , Y ) = 0 and
ρXY = 0.

For X ∼ f (x − θ), symmetric around 0 with E (X ) = θ, and Y


is the indicator function Y = I (|X − θ| < 2), then X and Y are
obviously not independent. However,
Z ∞ Z 2
E (XY ) = xI (|X − θ| < 2)f (x − θ)dx = (t + θ)f (t)dt
−∞ −2
Z 2 Z 2
=θ f (t)dt = E (X )E (Y ), ( tf (t)dt = 0)
−2 −2
Thus, it is easy to find uncorrelated, dependent r.v.s.
Theorem
Proof.
If r.v.s X and Y are Since X and Y are independent, we
independent r.v.s, then have E (XY ) = E (X )E (Y ). Thus
Cov (X , Y ) = 0 and
Cov (X , Y ) = E (XY ) − E (X )E (Y ) = 0
ρXY = 0.
ρXY = 0

57 / 97
Theorem
Proof.
If r.v.s X and Y are Since X and Y are independent, we
independent r.v.s, then have E (XY ) = E (X )E (Y ). Thus
Cov (X , Y ) = 0 and
Cov (X , Y ) = E (XY ) − E (X )E (Y ) = 0
ρXY = 0.
ρXY = 0

For X ∼ f (x − θ), symmetric around 0 with E (X ) = θ, and Y


is the indicator function Y = I (|X − θ| < 2), then X and Y are
obviously not independent. However,
Z ∞ Z 2
E (XY ) = xI (|X − θ| < 2)f (x − θ)dx = (t + θ)f (t)dt
−∞ −2
Z 2 Z 2
=θ f (t)dt = E (X )E (Y ), ( tf (t)dt = 0)
−2 −2
Theorem
Proof.
If r.v.s X and Y are Since X and Y are independent, we
independent r.v.s, then have E (XY ) = E (X )E (Y ). Thus
Cov (X , Y ) = 0 and
Cov (X , Y ) = E (XY ) − E (X )E (Y ) = 0
ρXY = 0.
ρXY = 0

For X ∼ f (x − θ), symmetric around 0 with E (X ) = θ, and Y


is the indicator function Y = I (|X − θ| < 2), then X and Y are
obviously not independent. However,
Z ∞ Z 2
E (XY ) = xI (|X − θ| < 2)f (x − θ)dx = (t + θ)f (t)dt
−∞ −2
Z 2 Z 2
=θ f (t)dt = E (X )E (Y ), ( tf (t)dt = 0)
−2 −2
Thus, it is easy to find uncorrelated, dependent r.v.s.
57 / 97
Theorem
If X and Y are any two r.v.s, a and b are any two constants,
then

Var (aX + bY ) = a2 Var (X ) + b 2 Var (Y ) + 2abCov (X , Y ).

If X and Y are independent r.v.s, then

Var (aX + bY ) = a2 Var (X ) + b 2 Var (Y ).

Proof.
The mean of aX + bY is E (aX + bY ) = aµX + bµY . Thus,

Var (aX + bY ) = E ((aX + bY ) − (aµX + bµY ))2


= E ((aX − aµX ) + (bY − bµY ))2
= E (a2 (X − µX )2 + b 2 (Y − µY )2 + 2ab(X − µX )(Y − µY ))
= a2 Var (X ) + b 2 Var (Y ) + 2abCov (X , Y )
58 / 97
Theorem
If X and Y are any two r.v.s,
a. −1 ≤ ρXY ≤ 1.
b. |ρXY | = 1 if and only if there exist numbers a 6= 0 and b
such that P(Y = aX + b) = 1. If ρXY = 1, then a > 0; and
if ρXY = −1, then a < 0.

Proof.
Consider the function h(t) defined by

h(t) = E ((X − µX )t + (Y − µY ))2 .


Theorem
If X and Y are any two r.v.s,
a. −1 ≤ ρXY ≤ 1.
b. |ρXY | = 1 if and only if there exist numbers a 6= 0 and b
such that P(Y = aX + b) = 1. If ρXY = 1, then a > 0; and
if ρXY = −1, then a < 0.

Proof.
Consider the function h(t) defined by

h(t) = E ((X − µX )t + (Y − µY ))2 .


Expanding this expression, we obtain

h(t) = t 2 E (X − µX )2 + (Y − µY )2 + 2t(X − µX )(Y − µY )


= t 2 σX2 + 2tCov (X , Y ) + σY2 .

59 / 97
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.

This is equivalent to
Cov (X , Y )
−σX σY ≤ Cov (X , Y ) ≤ σX σY , i.e., − 1 ≤ ρXY = ≤ 1.
σX σY
Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.

This is equivalent to
Cov (X , Y )
−σX σY ≤ Cov (X , Y ) ≤ σX σY , i.e., − 1 ≤ ρXY = ≤ 1.
σX σY
|ρXY | = 1 if and only if h(t) has a single root. But since
((X − µX )t + (Y − µY ))2 ≥ 0, the expected value h(t) =
E ((X − µX )t + (Y − µY ))2 = 0 if and only if

P ((X − µX )t + (Y − µY ))2 = 0 = 1.

Proof Cont’d
∆ = (2Cov (X , Y ))2 − 4σX2 σY2 ≤ 0.

This is equivalent to
Cov (X , Y )
−σX σY ≤ Cov (X , Y ) ≤ σX σY , i.e., − 1 ≤ ρXY = ≤ 1.
σX σY
|ρXY | = 1 if and only if h(t) has a single root. But since
((X − µX )t + (Y − µY ))2 ≥ 0, the expected value h(t) =
E ((X − µX )t + (Y − µY ))2 = 0 if and only if

P ((X − µX )t + (Y − µY ))2 = 0 = 1.


This is equivalent to

P (X − µX )t + (Y − µY ) = 0 = 1
60 / 97
Proof Cont’d

This is P(Y = aX + b) = 1 with a = −t and b = µX t + µY ,


where t is the root of h(t). Using the quadratic formula, we see
that this root is t = − Covσ(X2 ,Y ) . Thus a = −t has the same sign
X
as ρXY , proving the final assertion.

61 / 97
Proof Cont’d

This is P(Y = aX + b) = 1 with a = −t and b = µX t + µY ,


where t is the root of h(t). Using the quadratic formula, we see
that this root is t = − Covσ(X2 ,Y ) . Thus a = −t has the same sign
X
as ρXY , proving the final assertion.

If there is a line y = ax + b (a 6= 0), such that the values of (X , Y )


have a high probability of being near this line, then the correlation
between X and Y will be near 1 or −1.

61 / 97
Proof Cont’d

This is P(Y = aX + b) = 1 with a = −t and b = µX t + µY ,


where t is the root of h(t). Using the quadratic formula, we see
that this root is t = − Covσ(X2 ,Y ) . Thus a = −t has the same sign
X
as ρXY , proving the final assertion.

If there is a line y = ax + b (a 6= 0), such that the values of (X , Y )


have a high probability of being near this line, then the correlation
between X and Y will be near 1 or −1.
But if no such line exists, the correlation will be near 0. This is an
intuitive notion of the linear relationship that is being measured by
correlation.

61 / 97
Example
Let X have a uniform(−1, 1) distribution and Z have a
1
uniform(0, 10 ) distribution. Suppose X and Z are independent.
Let Y = X 2 + Z and consider the random vector (X , Y ). The
1
conditional distribution of Y given X = x is uniform(x 2 , x 2 + 10 ).
The joint pdf of (X , Y ) is
1
f (x, y ) = 5, −1 < x < 1, x 2 < y < x 2 + .
10
Example
Let X have a uniform(−1, 1) distribution and Z have a
1
uniform(0, 10 ) distribution. Suppose X and Z are independent.
Let Y = X 2 + Z and consider the random vector (X , Y ). The
1
conditional distribution of Y given X = x is uniform(x 2 , x 2 + 10 ).
The joint pdf of (X , Y ) is
1
f (x, y ) = 5, −1 < x < 1, x 2 < y < x 2 + .
10
There is a strong relationship between X
and Y , as indicated by the conditional
distribution of Y given X = x.
Example
Let X have a uniform(−1, 1) distribution and Z have a
1
uniform(0, 10 ) distribution. Suppose X and Z are independent.
Let Y = X 2 + Z and consider the random vector (X , Y ). The
1
conditional distribution of Y given X = x is uniform(x 2 , x 2 + 10 ).
The joint pdf of (X , Y ) is
1
f (x, y ) = 5, −1 < x < 1, x 2 < y < x 2 + .
10
There is a strong relationship between X
and Y , as indicated by the conditional
distribution of Y given X = x.
In fact, E (X ) = E (X 3 ) = 0, since X and Z are independent,
E (XZ ) = E (X )E (Z ).
Cov (X , Y ) = E (X (X 2 + Z )) − E (X )(E (X 2 + Z )) = 0, ρXY = 0.
62 / 97
Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
 
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp
Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
 
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp

 The marginal distribution of X is N(µX , σX2 );


Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
 
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp

 The marginal distribution of X is N(µX , σX2 );


 The marginal distribution of Y is N(µY , σY2 );
Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
 
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp

 The marginal distribution of X is N(µX , σX2 );


 The marginal distribution of Y is N(µY , σY2 );
 The correlation between X and Y is ρXY = ρ;
Bivariate normal pdf
Let µX , µY ∈ R, σX , σY ∈ R+ and ρ ∈ [−1, 1] be five real
numbers. The bivariate normal pdf with means µX and µY ,
variances σX2 and σY2 , and correlation ρ is the bivariate pdf given
by
p
f (x, y ) = (2πσX σY 1 − ρ2 )−1
 
x−µX x−µX y −µY y −µY
− 1
( )2 −2ρ( )( )+( )2
2(1−ρ2 ) σX σX σY σY
· exp

 The marginal distribution of X is N(µX , σX2 );


 The marginal distribution of Y is N(µY , σY2 );
 The correlation between X and Y is ρXY = ρ;
 For any constants a and b, the distribution of aX + bY is
N(aµX + bµY , a2 σX2 + b 2 σY2 + 2abρσX σY ).
63 / 97
Multivariate distributions
We will use boldface letters to denote multiple variates. Thus, we
write X to denote the r.v.s X1 , · · · , Xn and x to denote the sample
x1 , · · · , xn .
The random vector X = (X1 , · · · , Xn ) has a sample space that
is a subset of Rn .
 If (X1 , · · · , Xn ) is a discrete random vector, then the joint
pmf of (X1 , · · · , Xn ) is the function defined by
f (x) = f (x1 , · · · , xn ) = P(X1 = x1 , · · · , Xn = xn )
X
for any A ⊂ Rn , P(X ∈ A) = f (x).
x∈A
Multivariate distributions
We will use boldface letters to denote multiple variates. Thus, we
write X to denote the r.v.s X1 , · · · , Xn and x to denote the sample
x1 , · · · , xn .
The random vector X = (X1 , · · · , Xn ) has a sample space that
is a subset of Rn .
 If (X1 , · · · , Xn ) is a discrete random vector, then the joint
pmf of (X1 , · · · , Xn ) is the function defined by
f (x) = f (x1 , · · · , xn ) = P(X1 = x1 , · · · , Xn = xn )
X
for any A ⊂ Rn , P(X ∈ A) = f (x).
x∈A
 If (X1 , · · · , Xn ) is a continuous random vector, then the
joint pdf of (X1 , · · · , Xn ) is the function defined by
Z Z
for any A ⊂ Rn , P(X ∈ A) = · · · f (x)dx.
A
64 / 97
Multivariate distributions Cont’d
Let g (x) = g (x1 , · · · , xn ) be a real-valued function defined on
the sample space of X. Then the expected value of g (X) is
Z ∞ Z ∞
E (g (X)) = ··· g (x)f (x)dx
−∞ −∞
X
E (g (X)) = g (x)f (x)
x∈Rn

65 / 97
Multivariate distributions Cont’d
Let g (x) = g (x1 , · · · , xn ) be a real-valued function defined on
the sample space of X. Then the expected value of g (X) is
Z ∞ Z ∞
E (g (X)) = ··· g (x)f (x)dx
−∞ −∞
X
E (g (X)) = g (x)f (x)
x∈Rn

Let (X1 , · · · , Xk ) be the first k coordinates of X =


(X1 , · · · , Xn ), is given by the pdf or pmf
Z ∞ Z ∞
f (x1 , · · · , xk ) = ··· f (x1 , · · · , xk )dxk+1 · · · dxn
−∞ −∞
X
f (x1 , · · · , xk ) = f (x1 , · · · , xk )
(xk+1 ,··· ,xn )∈Rn−k

65 / 97
Multinomial distribution
Multinomial theory

Let n and m be positive integers, and A be the set of vectors


xP= (x1 , · · · , xn ) such that each xi is a nonnegative integer and
n
i=1 xi = m, then for any real numbers p1 , · · · , pn
X m!
(p1 + · · · + pn )m = p x1 · · · pnxn .
x1 ! · · · Xn ! 1
x∈A

66 / 97
Multinomial distribution
Multinomial theory

Let n and m be positive integers, and A be the set of vectors


xP= (x1 , · · · , xn ) such that each xi is a nonnegative integer and
n
i=1 xi = m, then for any real numbers p1 , · · · , pn
X m!
(p1 + · · · + pn )m = p x1 · · · pnxn .
x1 ! · · · Xn ! 1
x∈A

Let n and m be positive integers and p1 ,P · · · , pn be numbers


satisfying 0 ≤ pi ≤ 1, i = 1, · · · , n, and ni=1 pi = 1. Then
X = (X1 , · · · , Xn ) has a multinomial distribution with m trials
and cell probabilities p1 , · · · , pn if the joint pmf of X is small
n
m! Y pixi
f (x1 , · · · , xn ) = p1x1 · · · pnxn = m!
x1 ! · · · Xn ! xi !
i=1
66 / 97
Marginal pdf of multinomial distribution
X m!
f (xn ) = p x1 · · · pnxn
x1 ! · · · xn ! 1
(x1 ,··· ,xn−1 )∈B
X m! (m − xn )!(1 − pn )m−xn
= p1x1 · · · pnxn
x1 ! · · · xn ! (m − xn )!(1 − pn )m−xn
(x1 ,··· ,xn−1 )∈B

m! X (m − xn )! n−1 Y pi xi
= pnxn (1 − pn )m−xn
xn !(m − xn )! x1 ! · · · xn−1 ! 1 − pn
i=1
m!
= p xn (1 − pn )m−xn
xn !(m − xn )! n
Marginal pdf of multinomial distribution
X m!
f (xn ) = p x1 · · · pnxn
x1 ! · · · xn ! 1
(x1 ,··· ,xn−1 )∈B
X m! (m − xn )!(1 − pn )m−xn
= p1x1 · · · pnxn
x1 ! · · · xn ! (m − xn )!(1 − pn )m−xn
(x1 ,··· ,xn−1 )∈B

m! X (m − xn )! n−1 Y pi xi
= pnxn (1 − pn )m−xn
xn !(m − xn )! x1 ! · · · xn−1 ! 1 − pn
i=1
m!
= p xn (1 − pn )m−xn
xn !(m − xn )! n
Hence, the marginal distribution of Xn is binomial(m, pn ).
Marginal pdf of multinomial distribution
X m!
f (xn ) = p x1 · · · pnxn
x1 ! · · · xn ! 1
(x1 ,··· ,xn−1 )∈B
X m! (m − xn )!(1 − pn )m−xn
= p1x1 · · · pnxn
x1 ! · · · xn ! (m − xn )!(1 − pn )m−xn
(x1 ,··· ,xn−1 )∈B

m! X (m − xn )! n−1 Y pi xi
= pnxn (1 − pn )m−xn
xn !(m − xn )! x1 ! · · · xn−1 ! 1 − pn
i=1
m!
= p xn (1 − pn )m−xn
xn !(m − xn )! n
Hence, the marginal distribution of Xn is binomial(m, pn ).
Similar arguments show that each of the other coordinates is
marginally binomially distributed.

67 / 97
Mutually independent random vectors

Let (X1 , · · · , Xn ) be random vectors with joint pdf or pm-


f f (x1 , · · · , xn ). Let fXi (xi ) denote the marginal pdf or pmf of
Xi . Then (X1 , · · · , Xn ) are called mutually independent random
vectors if, for every (x1 , · · · , xn )
n
Y
f (x1 , · · · , xn ) = fXi (xi ).
i=1

If Xi are all one-dimensional, then (X1 , · · · , Xn ) are called mu-


tually independent random variables.

68 / 97
Conditional pdf of multinomial distribution
f (x1 , · · · , xn )
f (x1 , · · · , xn−1 |xn ) =
f (xn )
m! x1 xn n−1
x1 !···xn ! p1 · · · pn (m − xn )! Y pi xi
= m! xn =
xn !(m−xn )! pn (1 − pn )
m−xn x1 ! · · · xn−1 ! 1 − pn
i=1
Conditional pdf of multinomial distribution
f (x1 , · · · , xn )
f (x1 , · · · , xn−1 |xn ) =
f (xn )
m! x1 xn n−1
x1 !···xn ! p1 · · · pn (m − xn )! Y pi xi
= m! xn =
xn !(m−xn )! pn (1 − pn )
m−xn x1 ! · · · xn−1 ! 1 − pn
i=1

 This is the pmf of a multinomial distribution with m − xn


p1 pn−1
trials and cell probabilities 1−p n
, · · · , 1−p n
.
Conditional pdf of multinomial distribution
f (x1 , · · · , xn )
f (x1 , · · · , xn−1 |xn ) =
f (xn )
m! x1 xn n−1
x1 !···xn ! p1 · · · pn (m − xn )! Y pi xi
= m! xn =
xn !(m−xn )! pn (1 − pn )
m−xn x1 ! · · · xn−1 ! 1 − pn
i=1

 This is the pmf of a multinomial distribution with m − xn


p1 pn−1
trials and cell probabilities 1−p n
, · · · , 1−p n
.
 The conditional distribution of any subset of the coordinates
of X1 , · · · , Xn given the values of the rest of the coordinates
is a multinomial distribution.
Conditional pdf of multinomial distribution
f (x1 , · · · , xn )
f (x1 , · · · , xn−1 |xn ) =
f (xn )
m! x1 xn n−1
x1 !···xn ! p1 · · · pn (m − xn )! Y pi xi
= m! xn =
xn !(m−xn )! pn (1 − pn )
m−xn x1 ! · · · xn−1 ! 1 − pn
i=1

 This is the pmf of a multinomial distribution with m − xn


p1 pn−1
trials and cell probabilities 1−p n
, · · · , 1−p n
.
 The conditional distribution of any subset of the coordinates
of X1 , · · · , Xn given the values of the rest of the coordinates
is a multinomial distribution.
 We see from the conditional distribution that the
coordinates of the vector X1 , · · · , Xn are related. It turns
out that all of the pairwise covariances are negative and are
given by Cov (Xi , Xj ) = E [(Xi − pi )(Xj − pj )] = −mpi pj .
69 / 97
Mgf of mutually independent random variables

Let (X1 , · · · , Xn ) be mutually independent r.v.s.


Mgf of mutually independent random variables

Let (X1 , · · · , Xn ) be mutually independent r.v.s.


 Let g1 , · · · , gn be real-valued functions such that gi (xi ) is a
function only of xi , i = 1, · · · , n. Then
n
Y n
Y
E( gi (Xi )) = E (gi (Xi )).
i=1 i=1
Mgf of mutually independent random variables

Let (X1 , · · · , Xn ) be mutually independent r.v.s.


 Let g1 , · · · , gn be real-valued functions such that gi (xi ) is a
function only of xi , i = 1, · · · , n. Then
n
Y n
Y
E( gi (Xi )) = E (gi (Xi )).
i=1 i=1

 Let MX1 (t), · · · , MXn (t) be mgfs, and Z = X1 + · · · + Xn .


Then the mgf of Z is
n
Y
MZ (t) = MXi (t).
i=1

70 / 97
Mgf of mutually independent random variables Cont’d
Corollary

Let (X1 , · · · , Xn ) be mutually independent r.v.s. Let


MX1 (t), · ·P
· , MXn (t) be mgfs. Let ai and bi be fixed constants,
and Z = ni=1 (ai Xi + bi ). Then the mgf of Z is
n
P Y
MZ (t) = e t( bi )
MXi (t).
i=1

71 / 97
Mgf of mutually independent random variables Cont’d
Corollary

Let (X1 , · · · , Xn ) be mutually independent r.v.s. Let


MX1 (t), · ·P
· , MXn (t) be mgfs. Let ai and bi be fixed constants,
and Z = ni=1 (ai Xi + bi ). Then the mgf of Z is
n
P Y
MZ (t) = e t( bi )
MXi (t).
i=1

Let (X1 , · · · , Xn ) be mutually independent r.v.s, and the dis-


tribution of Xi is Gamma(αi , β) with mgf M(t) = (1 − βt)αi .
Thus, the mgf of Z = X1 + · · · + Xn is
n
Y n
Y Pn
MZ (t) = MXi (t) = (1 − βt)αi = (1 − βt)−( i=1 αi )
.
i=1 i=1 Pn
This is the mgf of a Gamma( i=1 αi , β) distribution.
71 / 97
Linear combination of independent normal r.v.s

Let (X1 , · · · , Xn ) be mutually independent r.v.s. with Xi ∼


N(µi , σi2 ). Let ai and bi be fixed constants,
n
X n
X n
X
Z= (ai Xi + bi ) ∼ N( (ai µi + bi ), ai2 σi2 ).
i=1 i=1 i=1

 A linear combination of independent normal r.v.s is normally


distributed.
 It can be proved by the above corollary.

72 / 97
Generalization

Let (X1 , · · · , Xn ) be random vectors. Then X1 , · · · , Xn are mu-


tually independent random vectors if and only if there exist func-
tions gi (xi ) such that the joint pdf or pmf of (X1 , · · · , Xn ) can
be written as n Y
f (x1 , · · · , xn ) = gi (xi ).
i=1

73 / 97
Generalization

Let (X1 , · · · , Xn ) be random vectors. Then X1 , · · · , Xn are mu-


tually independent random vectors if and only if there exist func-
tions gi (xi ) such that the joint pdf or pmf of (X1 , · · · , Xn ) can
be written as n Y
f (x1 , · · · , xn ) = gi (xi ).
i=1

Let X1 , · · · , Xn be random vectors. Let gi (xi ) be a function


only of xi . Then the random variables Ui = gi (Xi ) are mutually
independent

73 / 97
Tail bounds

Question
Consider the experiment of tossing a fair coin n times. What is
the probability that the number of heads exceeds 3n4 .

Notes
The tail bounds of a r.v. X are concerned with the probability
that it deviates significantly from its expected value E (X ) on a
run of the experiment

74 / 97
Markov inequality
Markov inequality

If X is any r.v. and 0 < a < +∞, then

E (X ) 1
P(X > a) ≤ or P(X > aE (X )) ≤
a a

Proof.
Z Z
X E (X )
P(X > a) = dx ≤ dx = .
X >a a a

Example

3n n/2 2
P(X > )≤ =
4 3n/4 3
75 / 97
Chebyshev’s inequality
If r.v. X is a random variable and let g (x) be a nonnegative
function. Then, for any r > 0,

E (g (X ))
P(g (X ) ≥ r ) ≤ .
r

Proof.

Z ∞ Z
E (g (X )) = g (x)fX (x)dx ≥ g (x)fX (x)dx
−∞ x:g (x)≥r
Z
≥r fX (x)dx = rP(g (X ) ≥ r )
x:g (x)≥r

Rearranging now produces the desired inequality.


76 / 97
Widespread used Chebyshev’s inequality
2
Let g (x) = (x−µ)
σ2
, where µ = E (X ) and σ 2 = Var (X ). For
convenience write r = t 2 . Then
2
(x − µ)2 2 E ( (x−µ)
σ2 ) 1
P( 2
≥ t ) ≤ 2
= 2.
σ t t
Widespread used Chebyshev’s inequality
2
Let g (x) = (x−µ)
σ2
, where µ = E (X ) and σ 2 = Var (X ). For
convenience write r = t 2 . Then
2
(x − µ)2 2 E ( (x−µ)
σ2 ) 1
P( 2
≥ t ) ≤ 2
= 2.
σ t t
1 1
 i.e., P(|x − µ| ≥ tσ) ≤ t2
and P(|x − µ| ≤ tσ) ≥ 1 − t2
.
Widespread used Chebyshev’s inequality
2
Let g (x) = (x−µ)
σ2
, where µ = E (X ) and σ 2 = Var (X ). For
convenience write r = t 2 . Then
2
(x − µ)2 2 E ( (x−µ)
σ2 ) 1
P( 2
≥ t ) ≤ 2
= 2.
σ t t
 i.e., P(|x − µ| ≥ tσ) ≤ t12 and P(|x − µ| ≤ tσ) ≥ 1 − 1
t2
.
 For example, tossing a fair coin n times.
3n n n Var (X ) 4
P(X > ) < P(|X − | > ) ≤ n 2 = .
4 2 4 (4) n
Widespread used Chebyshev’s inequality
2
Let g (x) = (x−µ)
σ2
, where µ = E (X ) and σ 2 = Var (X ). For
convenience write r = t 2 . Then
2
(x − µ)2 2 E ( (x−µ)
σ2 ) 1
P( 2
≥ t ) ≤ 2
= 2.
σ t t
 i.e., P(|x − µ| ≥ tσ) ≤ t12 and P(|x − µ| ≤ tσ) ≥ 1 − 1
t2
.
 For example, tossing a fair coin n times.
3n n n Var (X ) 4
P(X > ) < P(|X − | > ) ≤ n 2 = .
4 2 4 (4) n
Widespread used Chebyshev’s inequality
2
Let g (x) = (x−µ)
σ2
, where µ = E (X ) and σ 2 = Var (X ). For
convenience write r = t 2 . Then
2
(x − µ)2 2 E ( (x−µ)
σ2 ) 1
P( 2
≥ t ) ≤ 2
= 2.
σ t t
 i.e., P(|x − µ| ≥ tσ) ≤ t12 and P(|x − µ| ≤ tσ) ≥ 1 − 1
t2
.
 For example, tossing a fair coin n times.
3n n n Var (X ) 4
P(X > ) < P(|X − | > ) ≤ n 2 = .
4 2 4 (4) n
 Many other probability inequalities exist similar in spirit to
Chebyshev’s inequality, e.g.,

MX (t)
P(X ≥ a) ≤ .
77 / 97
e at
Chernoff bound
Deriving Chernoff bound

Let Xi be a sequence of independent Pr.v.s with P(Xi = 1) = pi


and P(Xi = 0) = 1 − pi . r.v. X = ni=1 Xi .
 µ
 P(X < (1 − δ)µ) < e −δ Pn
(1−δ) (1−δ) , where µ = i=1 pi
 P(X < (1 − δ)µ) < exp (−µδ 2 /2)

Proof.
For t > 0,

P(X < (1 − δ)µ) = P exp (−tX ) > exp (−t(1 − δ)µ)
Qn
E (exp (−tXi ))
< i=1 .
exp (−t(1 − δ)µ)

78 / 97
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
That is
exp (µ(e −t − 1))
P(X < (1 − δ)µ) < = exp (µ(e (−t) + t − tδ − 1))
exp (−t(1 − δ)µ)
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
That is
exp (µ(e −t − 1))
P(X < (1 − δ)µ) < = exp (µ(e (−t) + t − tδ − 1))
exp (−t(1 − δ)µ)
Now its time to choose t to make the bound as tight as possible.
Taking the derivative of µ(e (−t) +t−tδ−1) and setting −e (−t) +
1 − δ = 0. We have t = ln (1/1 − δ),
Proof of Chernoff bound Cont.d
Note that 1 − x < e −x if x > 0,
n
Y n
Y n
Y
E (exp (−tXi )) = (pi e −t + (1 − pi )) = (1 − pi (1 − e −t ))
i=1 i=1 i=1
n
Y
< exp (pi (e −t − 1)) = exp (µ(e −t − 1)).
i=1
That is
exp (µ(e −t − 1))
P(X < (1 − δ)µ) < = exp (µ(e (−t) + t − tδ − 1))
exp (−t(1 − δ)µ)
Now its time to choose t to make the bound as tight as possible.
Taking the derivative of µ(e (−t) +t−tδ−1) and setting −e (−t) +
1 − δ = 0. We have t = ln (1/1 − δ),
 e −δ µ
P(X < (1 − δ)µ) < .
(1 − δ)(1−δ)
79 / 97
Proof of second statement
To get the simpler form of the bound, we need to get rid of the
clumsy term (1 − δ)(1−δ) .
Proof of second statement
To get the simpler form of the bound, we need to get rid of the
clumsy term (1 − δ)(1−δ) . Note that
X δi δ2
(1 − δ) ln (1 − δ) = (1 − δ)( − ) > −δ +
i 2
i=1

Thus, we have
δ2
(1 − δ)(1−δ) > exp (−δ + )
2
Furthermore,
 e −δ µ
P(X < (1 − δ)µ) < (1−δ)
(1 − δ)
 e −δ µ
< δ2
= exp (−µδ 2 /2).
80 / 97
e (−δ+ 2 )
Chernoff bound (Upper tail)
Theorem
Let Xi be a sequence of independent Pr.v.s with P(Xi P
= 1) = pi
and P(Xi = 0) = 1 − pi . r.v. X = ni=1 Xi and µ = ni=1 pi .
 µ
 P(X > (1 + δ)µ) < eδ
(1+δ)(1+δ)
 P(X > (1 + δ)µ) < exp (−µδ 2 /4)

81 / 97
Chernoff bound (Upper tail)
Theorem
Let Xi be a sequence of independent Pr.v.s with P(Xi P
= 1) = pi
and P(Xi = 0) = 1 − pi . r.v. X = ni=1 Xi and µ = ni=1 pi .
 µ
 P(X > (1 + δ)µ) < eδ
(1+δ)(1+δ)
 P(X > (1 + δ)µ) < exp (−µδ 2 /4)

Example
n
Let X be # heads in n tosses of a fair coin, then µ = 2 and
δ = 21 , we have
3n 1 n n
P(X > ) = P(X > (1 + ) ) < exp (− δ 2 /4) = exp (−n/32)
4 2 2 2
If we toss the coin 1000 times, the probability is less than
exp (−125/4).
81 / 97
Hoeffding inequality
Let X1 , X2 , · · · , Xn be i.i.d. observations such that E (Xi ) = µ
and a ≤ Xi ≤ b. Then, for any  > 0,

P(|X − µ| > ) < 2 exp (−2n2 /(b − a)2 )

Example

If X1 , X2 , · · · , Xn ∼ Bernoulli(p)
 In terms of Hoeffding inequality, we have

P(|X − p| > ) ≤ 2 exp (−2n2 )

 If p = 0.5,
1 1
P(X − 0.5 > ) < P(|X − 0.5| > ) ≤ 2 exp (−8n).
4 4
82 / 97
Outline
Joint and Marginal Distributions
Conditional Distribution and Independence
Bivariate Transformations
Hierarchical Models and Mixture Distributions
Hierarchical Models and Mixture Distributions
Covariance and Correlation
Multivariate Distributions
Inequalities
Numerical Inequalities
Functional Inequalities
Take-aways
83 / 97
Lemma
Let a and b be any positive numbers, and let p and q be any
positive numbers satisfying p1 + q1 = 1. Then

1 p 1 q
a + b ≥ ab,
p q
with equality if and only if ap = b q .
Lemma
Let a and b be any positive numbers, and let p and q be any
positive numbers satisfying p1 + q1 = 1. Then

1 p 1 q
a + b ≥ ab,
p q
with equality if and only if ap = b q .
Proof.
Fix b, and consider the function
1 p 1 q
g (a) = a + b − ab.
p q
Lemma
Let a and b be any positive numbers, and let p and q be any
positive numbers satisfying p1 + q1 = 1. Then

1 p 1 q
a + b ≥ ab,
p q
with equality if and only if ap = b q .
Proof.
Fix b, and consider the function
1 p 1 q
g (a) = a + b − ab.
p q

To minimize g (a), differentiate and set equal to 0:


d
g (a) = 0 ⇒ ap−1 − b = 0 ⇒ b = ap−1 .
da
84 / 97
Proof cont’d

A check of the second derivative will establish that this is indeed


a minimum. Note that (p − 1)q = p, the value of the function
at the minimum is
1 p 1 p−1 q 1 1
a + (a ) − aap−1 = ap + ap − ap = 0.
p q p q

Since the minimum is unique, equality holds only if ap−1 = b,


which is equivalent to ap = b q .

85 / 97
Proof cont’d

A check of the second derivative will establish that this is indeed


a minimum. Note that (p − 1)q = p, the value of the function
at the minimum is
1 p 1 p−1 q 1 1
a + (a ) − aap−1 = ap + ap − ap = 0.
p q p q

Since the minimum is unique, equality holds only if ap−1 = b,


which is equivalent to ap = b q .

The inequalities in this subsection, although often stated in terms of


expectations, rely mainly on properties of numbers.

85 / 97
Proof cont’d

A check of the second derivative will establish that this is indeed


a minimum. Note that (p − 1)q = p, the value of the function
at the minimum is
1 p 1 p−1 q 1 1
a + (a ) − aap−1 = ap + ap − ap = 0.
p q p q

Since the minimum is unique, equality holds only if ap−1 = b,


which is equivalent to ap = b q .

The inequalities in this subsection, although often stated in terms of


expectations, rely mainly on properties of numbers. In fact, they are
all based on the following simple lemma.

85 / 97
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Proof.
The first inequality follows from −|XY | ≤ XY ≤ |XY |. To
prove the second inequality, define
|X | |Y |
a= 1 and b = 1 .
(E |X |p ) p (E |Y |q ) q
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Proof.
The first inequality follows from −|XY | ≤ XY ≤ |XY |. To
prove the second inequality, define
|X | |Y |
a= 1 and b = 1 .
(E |X |p ) p (E |Y |q ) q
Applying the above lemma,
1 |X |p 1 |Y |q |XY |
p
+ q
≥ 1 1 .
p (E |X | ) q (E |Y | ) (E |X |p ) p (E |Y |q ) q
Hölder’s inequality
1
Let X and Y be any two r.v.s, and let p and q satisfy p + q1 = 1.
Then 1 1
|E (XY )| ≤ E |XY | ≤ (E |X |p ) p (E |Y |q ) q .
Proof.
The first inequality follows from −|XY | ≤ XY ≤ |XY |. To
prove the second inequality, define
|X | |Y |
a= 1 and b = 1 .
(E |X |p ) p (E |Y |q ) q
Applying the above lemma,
1 |X |p 1 |Y |q |XY |
p
+ q
≥ 1 1 .
p (E |X | ) q (E |Y | ) (E |X |p ) p (E |Y |q ) q
Now take expectations of both sides. The expectation of the
left-hand side is 1, and rearrangement gives the conclusion.
86 / 97
Cauchy-Schwarz inequality
For any two r.v.s X and Y
1 1
|E (XY )| ≤ E |XY | ≤ (E |X |2 ) 2 (E |Y |2 ) 2 .
Perhaps the most famous special case of Hölder’s inequality is
that for which p = q = 2.

87 / 97
Cauchy-Schwarz inequality
For any two r.v.s X and Y
1 1
|E (XY )| ≤ E |XY | ≤ (E |X |2 ) 2 (E |Y |2 ) 2 .
Perhaps the most famous special case of Hölder’s inequality is
that for which p = q = 2.

Example: covariance inequality

If X and Y have means µX and µY , and variances σX2 and σY2 ,


respectively, we can apply the Cauchy-Schwarz inequality to get
1 1
E |(X − µX )(Y − µY )| ≤ {E (X − µX )2 } 2 {E (Y − µY )2 } 2 .

Squaring both sides and using statistical notation, we have

(Cov (X , Y ))2 ≤ σX2 σY2 .


87 / 97
Special cases of Hölder’s inequality
If we set Y = 1, we get
1
E |X | ≤ (E |X |p ) p , 1 < p < ∞.
Special cases of Hölder’s inequality
If we set Y = 1, we get
1
E |X | ≤ (E |X |p ) p , 1 < p < ∞.

For 1 < r < p, if we replace |X | by |X |r , we obtain


1
E |X |r ≤ (E |X |pr ) p , 1 < p < ∞.
Special cases of Hölder’s inequality
If we set Y = 1, we get
1
E |X | ≤ (E |X |p ) p , 1 < p < ∞.

For 1 < r < p, if we replace |X | by |X |r , we obtain


1
E |X |r ≤ (E |X |pr ) p , 1 < p < ∞.

Now write s = pr (note that s > r ) and rearrange terms to get


1 1
{E |X |r } r ≤ (E |X |s ) s , 1 < r < s < ∞.

which is known as Liapounov’s inequality.

88 / 97
Minkowski’s inequality
Let X and Y be any two r.v.s. Then for 1 ≤ p < ∞,
1 1 1
{E |X + Y |p } p ≤ {E |X |p } p + {E |Y |p } p .
Minkowski’s inequality
Let X and Y be any two r.v.s. Then for 1 ≤ p < ∞,
1 1 1
{E |X + Y |p } p ≤ {E |X |p } p + {E |Y |p } p .
Proof:

E |X + Y |p = E |X + Y ||X + Y |p−1


≤ E |X ||X + Y |p−1 + E |Y ||X + Y |p−1 ,


 

where we have used the fact that |X + Y | ≤ |X | + |Y |.


Minkowski’s inequality
Let X and Y be any two r.v.s. Then for 1 ≤ p < ∞,
1 1 1
{E |X + Y |p } p ≤ {E |X |p } p + {E |Y |p } p .
Proof:

E |X + Y |p = E |X + Y ||X + Y |p−1


≤ E |X ||X + Y |p−1 + E |Y ||X + Y |p−1 ,


 

where we have used the fact that |X + Y | ≤ |X | + |Y |.


Now apply Hölder’s inequality to each expectation on the right-
hand side of above inequality to get
1 1 1 1
E |X + Y |p ≤ {E |X |p } p {E |X + Y |q(p−1) } q + {E |Y |p } p {E |X + Y |q(p−1) } q ,
1
Now divide through by {E |X + Y |q(p−1) } q , noting that q(p −
1) = p and 1 − q1 = p1 , we obtain the conclusion.
89 / 97
A new version of Hölder’s inequality

For numbers ai and bi , i = 1, 2, · · · , n, the inequality


n n n
1 X 1 1 1
aip biq
X X
|ai bi | ≤ p q
, + = 1.
p q
i=1 i=1 i=1
A new version of Hölder’s inequality

For numbers ai and bi , i = 1, 2, · · · , n, the inequality


n n n
1 X 1 1 1
aip biq
X X
|ai bi | ≤ p q
, + = 1.
p q
i=1 i=1 i=1

To establish the conclusion occurs when bi = 1, p = q = 2. We


then have
n n
1 X 2 X
|ai | ≤ ai2 .
n
i=1 i=1

90 / 97
Outline
Joint and Marginal Distributions
Conditional Distribution and Independence
Bivariate Transformations
Hierarchical Models and Mixture Distributions
Hierarchical Models and Mixture Distributions
Covariance and Correlation
Multivariate Distributions
Inequalities
Numerical Inequalities
Functional Inequalities
Take-aways
91 / 97
Convex inequality
A function g (x) is convex if

g (λx + (1 − λ)y ) ≤ λg (x) + (1 − λ)g (y ),


for all x and y , and 0 < λ < 1. The function g (x) is concave if
−g (x) is convex.
Informally, we can think of convex functions
as functions that “hold water”-that is, they
are bowl-shaped (g (x) = x 2 is convex),
while concave functions “spill
water”(g (x) = log x is concave).
Convex inequality
A function g (x) is convex if

g (λx + (1 − λ)y ) ≤ λg (x) + (1 − λ)g (y ),


for all x and y , and 0 < λ < 1. The function g (x) is concave if
−g (x) is convex.
Informally, we can think of convex functions
as functions that “hold water”-that is, they
are bowl-shaped (g (x) = x 2 is convex),
while concave functions “spill
water”(g (x) = log x is concave).
More formally, convex functions lie below lines connecting any
two points. As λ from 0 to 1, λg (x1 ) + (1 − λ)g (x2 ) defines
a line connecting g (x1 ) and g (x2 ). This line lies above g (x) if
g (x) is convex.
92 / 97
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then

E (g (X )) ≥ g (E (X )).

Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then

E (g (X )) ≥ g (E (X )).

Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )).
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then

E (g (X )) ≥ g (E (X )).

Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )). Write l(x) = a + bx for some a and b.
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then

E (g (X )) ≥ g (E (X )).

Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )). Write l(x) = a + bx for some a and b.
Now, by the convexity of g we have g (x) ≥ a + bx. Since
expectations preserve inequalities,

E (g (X )) ≥ E (a + bX ) = a + bE (X ) = l(E (X )) = g (E (X )).
Jensen’s inequality
For any r.v. X , if g (x) is a convex, then

E (g (X )) ≥ g (E (X )).

Equality holds if and only if, for every line a + bx that a tangent
to g (x) at x = E (X ), P(g (X ) = a + bX ) = 1.
Proof.
To establish the inequality, let l(x) be a tangent line to g (x) at
the point g (E (X )). Write l(x) = a + bx for some a and b.
Now, by the convexity of g we have g (x) ≥ a + bx. Since
expectations preserve inequalities,

E (g (X )) ≥ E (a + bX ) = a + bE (X ) = l(E (X )) = g (E (X )).
One immediate application of Jensen’s Inequality shows that
E (X 2 ) ≥ (E (X ))2 , since g (x) = x 2 is convex.
93 / 97
An inequality for means
Jensen’s inequality can be used to prove an inequality between
three different kinds of means. If a1 , · · · , an are positive num-
bers, define
1
aA = (a1 + a2 + · · · + an ), (arithmetic mean)
n
1
aG = a1 · a2 · · · · · an n , (geometric mean)
1
aH = 1 1 1 1 .(harmonic mean)
(
n a1 + a2 + · · · + an )

An inequality relating these means is

aH ≤ aG ≤ aA .

To apply Jensen’s inequality, let X be a r.v. with range


a1 , · · · , an and P(X = ai ) = n1 , i = 1, · · · , n.
94 / 97
An inequality for means Cont’d
Since log x is a concave function, Jensen’s inequality shows that
E (log X ) ≤ log E (X ); hence
n
1X
log aG = log ai = E (log X ) ≤ log E (X ) = log aA ,
n
i=1

So aG ≤ aA .
Now again use the fact that log x is concave to get
n
1 1X 1 1 1
log = log = log E ( ) ≥ E (log ) = −E (log X ).
aH n ai X X
i=1

Since E (log X ) = log aG , it then follows that log a1H ≥ log a1G ,
or aG ≥ aH .
95 / 97
Covariance inequality
If X is a r.v. with finite mean µ and g (x) is a nondecreasing function,
then E (g (X )(X − µ)) ≥ 0. Since

E (g (X )(X − µ)) = E (g (X )(X − µ)[I(−∞,0) (X − µ) + I(0,∞) (X − µ)])


≥ E (g (µ)(X − µ)I(−∞,0) (X − µ)) + E (g (µ)(X − µ)I(0,∞) (X − µ))
= g (µ)E (X − µ) = 0.

Theorem

If X is a r.v., g (x) and h(x) are any functions s.t. E (g (X )), E (h(X )),
and E (g (X )h(X )) exist.
Covariance inequality
If X is a r.v. with finite mean µ and g (x) is a nondecreasing function,
then E (g (X )(X − µ)) ≥ 0. Since

E (g (X )(X − µ)) = E (g (X )(X − µ)[I(−∞,0) (X − µ) + I(0,∞) (X − µ)])


≥ E (g (µ)(X − µ)I(−∞,0) (X − µ)) + E (g (µ)(X − µ)I(0,∞) (X − µ))
= g (µ)E (X − µ) = 0.

Theorem

If X is a r.v., g (x) and h(x) are any functions s.t. E (g (X )), E (h(X )),
and E (g (X )h(X )) exist.
 If g (x) is nondecreasing and h(x) is nonincreasing, then
E (g (X )h(X )) ≤ E (g (X ))E (h(X )).
Covariance inequality
If X is a r.v. with finite mean µ and g (x) is a nondecreasing function,
then E (g (X )(X − µ)) ≥ 0. Since

E (g (X )(X − µ)) = E (g (X )(X − µ)[I(−∞,0) (X − µ) + I(0,∞) (X − µ)])


≥ E (g (µ)(X − µ)I(−∞,0) (X − µ)) + E (g (µ)(X − µ)I(0,∞) (X − µ))
= g (µ)E (X − µ) = 0.

Theorem

If X is a r.v., g (x) and h(x) are any functions s.t. E (g (X )), E (h(X )),
and E (g (X )h(X )) exist.
 If g (x) is nondecreasing and h(x) is nonincreasing, then
E (g (X )h(X )) ≤ E (g (X ))E (h(X )).
 If g (x) and h(x) are nondecreasing or nonincreasing, then
E (g (X )h(X )) ≥ E (g (X ))E (h(X )).
96 / 97
Take-aways

Conclusions
 Joint and marginal distributions
 Continuous distributions
 Independence
 Bivariate transformation
 Hierarchical models and mixture distributions
 Multivariate distribution
 Inequalities

97 / 97

You might also like