Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
133 views200 pages

Solutions To Probability Book

This document provides a contents guide for a textbook on probability, statistics, and random processes. It outlines 11 chapters that cover basic concepts, counting methods, discrete and continuous random variables, joint distributions, methods for more than two random variables, limit theorems, classical and Bayesian statistical inference, introduction to simulation using Python, and recursive methods. It includes example problems at the end of Chapter 1 on basic set theory concepts.

Uploaded by

Nilesh Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views200 pages

Solutions To Probability Book

This document provides a contents guide for a textbook on probability, statistics, and random processes. It outlines 11 chapters that cover basic concepts, counting methods, discrete and continuous random variables, joint distributions, methods for more than two random variables, limit theorems, classical and Bayesian statistical inference, introduction to simulation using Python, and recursive methods. It includes example problems at the end of Chapter 1 on basic set theory concepts.

Uploaded by

Nilesh Thakur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 200

1

DOUGLAS RUBIN PhD

A Complete Solutions Guide


to Pishro-Nik’s:
Introduction to Probability,
Statistics and Random Processes
 
2
Contents

1 Basic Concepts 5

2 Combinatorics: Counting Methods 27

3 Discrete Random Variables 35

4 Continuous and Mixed Random Variables 53

5 Joint Distributions 71

6 Methods for More Than Two Random Variables 105

7 Limit Theorems and Convergence of Random Variables 123

8 Statistical Inference I: Classical Methods 135

9 Statistical Inference II: Bayesian Inference 153

10 Introduction to Simulation Using Python 169

11 Recursive Methods 195

3
4 CONTENTS
Chapter 1

Basic Concepts

5
6 CHAPTER 1. BASIC CONCEPTS

Problem 1.

(a)
A ∪ B = {1, 2, 3} ∪ {2, 3, 4, 5, 6, 7} = {1, 2, 3, 4, 5, 6, 7}

(b)

(A ∪ C) − B = {1, 2, 3} ∪ {7, 8, 9, 10} − {2, 3, 4, 5, 6, 7}


= {1, 2, 3, 7, 8, 9, 10} − {2, 3, 4, 5, 6, 7}
= {1, 8, 9, 10}

(c)

Ā ∪ (B − C) = ({1, 2, . . . , 10} − A) ∪ ({2, 3, 4, 5, 6, 7} − {7, 8, 9, 10})


= {4, 5, 6, 7, 8, 9, 10} ∪ {2, 3, 4, 5, 6}
= {2, 3, . . . , 10}

(d) No, A, B and C do not partition S since 2, 3 ∈ A as well as in B. 7 is also in both B and C.

Problem 2.

(a)
[6, 8] ∪ [2, 7) = [2, 8]

(b)
[6, 8] ∩ [2, 7) = [6, 7)

(c)
[0, 1]c = (−∞, 0) ∪ (1, ∞)

(d)
[6, 8] − (2, 7) = [7, 8]

Problem 3.

(a)

(A ∪ B) − (A ∩ B) =
= (A ∪ B) ∩ (A ∩ B)c
= (A ∪ B) ∩ (Ac ∪ B c ),

where I have used De Morgan.

(b)
B − C = B ∩ Cc

(c)
(A ∩ C) ∪ (A ∩ B)

(d)
(C − A − B) ∪ ((A ∩ B) − C)
7

Problem 4.
(a)
A = {(H, H), (H, T )}

(b)
B = {(H, T ), (T, H), (T, T )}

(c)
C = {(H, T ), (T, H)}

Problem 5.
(a) |A2 | is half of the numbers from 1 to 100, so |A2 | = 50. To solve for |A3 | note that there are 2
numbers between each pair of elements in A3 where A3 is assumed to be pre-sorted (e.g., 4, 5
are between 3 and 6). There are also |A3 |−1 of these pairs, and thus |A3 |+2(|A3 |−1)+3 = 100,
where I have added 3 to account for the numbers at the beginning and end of the sequence
which are not divisible by 3 (1, 2 and 100). Thus, I find that |A3 | = 33. |A4 | is exactly half
of |A2 |, and thus |A4 | = 25. Finally, to solve for |A5 | we may use the same method we used
to solve for |A3 |: |A5 | + 4(|A5 | − 1) + 4 = 100, from which we find that |A5 | = 20.

(b) By inclusion-exclusion:

|A2 ∪ A3 ∪ A5 | = |A2 | + |A3 | + |A5 | − |A2 ∩ A3 | − |A2 ∩ A5 | − |A3 ∩ A5 | + |A2 ∩ A3 ∩ A5 |.

Note that |A2 ∩ A3 | = |A6 | = 16, |A2 ∩ A5 | = |A10 | = 10, |A3 ∩ A5 | = |A15 | = 6, where |A10 |
and |A15 | were found by counting (since there are very few elements in these sets), and |A6 |
was found by the same method I used to compute |A3 |. Lastly, the intersection of all 3 sets
is given by the set of multiples of 30, so that |A2 ∩ A3 ∩ A5 | = |{30, 60, 90}| = 3. Therefore:
|A2 ∪ A3 ∪ A5 | = 50 + 33 + 20 − 16 − 10 − 6 + 3 = 74.

Problem 6. From the following figure, it is clear that |B| = 10 + 20 + 15 = 45.


S
A1 A2 A3

B
10 20 15

Problem 7.
(a) A is a subset of a countable set, N, and is thus countable.

(b) As shown in the book, if we can write any set, S in the form:
[[
S= {qij }, (1.1)
i∈B j∈C

where B and C are countable sets, then S is a countable set. It is easy to see that we may
re-write B as:
8 CHAPTER 1. BASIC CONCEPTS

[ [ √
B= {ai + bj 2}, (1.2)
i∈Q j∈Q

where qij ≡ ai + bj 2, and thus B is countable.

(c) C is uncountable. One way to prove this is to note that for all x ∈ [0, 1], (x, 0) ∈ C, so that
C ⊃ [0, 1], i.e, C is a superset of an uncountable set and is thus uncountable.

Problem 8. I first prove that An ⊂ An+1 (a proper subset) for all n = 1, 2, . . .. To do this, it
suffices to prove that (n − 1)/n < n/(n + 1), which I do with proof by contradiction. By assuming
(n − 1)/n ≥ n/(n + 1), after a little algebra, one concludes that −1 ≥ 0, which is clearly a
contradiction and therefore (n − 1)/n < n/(n + 1). Thus the union of all the An s is given by the
largest set in the sequence, which is A∞ (= limn→∞ [0, n−1
n )). After applying L’hopitals rule, one
can show that A∞ = [0, 1), and thus:

[
A= An = [0, 1).
n=1

Problem 9. As with the previous problem, one may show that An+1 ⊂ An for all n = 1, 2, . . . by
proving that 1/(n + 1) < 1/n. This is somewhat obvious, but if you really want to be formal, you
can prove it with a proof by contradiction. Therefore, the intersection of all the An s is given by
the smallest set, A∞ , which is limn→∞ [0, n1 ) = [0, 0) = {0}, and thus:

\
A= An = {0}.
n=1

Problem 10.

(a) To motivate the bijection (the one-to-one mapping between 2N and C) we are about to
construct, note that for every set in 2N , a natural number n will either appear once, or not at
all. Therefore, it is convenient to indicate its presence in the set with a 1 and its absence with
a 0. For example {1, 3, 6} will get mapped to the sequence 101001000 . . . (this is implicitly
assuming that we have pre-ordered the elements in the particular set from 2N ). In general,
the bijective mapping we use f : 2N → C, is given by:

f (x) = 1(1 ∈ x)1(2 ∈ x) . . . ,

where 1 is the so-called indicator function which is 1 if its argument evaluates to true and
0 otherwise. To prove that this mapping is bijective, we must prove it is both injective and
surjective.
To prove it is injective, I use a proof by contradiction. Assume it is not injective. Under this
assumption there exists x, x0 ∈ 2N , where x 6= x0 such that f (x) = f (x0 ). x and x0 can either
have the same cardinality, or they can be different. Without loss of generality, if they are
different, let us call x the one with the larger cardinality. Since x 6= x0 there exists at least 1
natural number n in x which is not in x0 . Therefore in the sequences f (x) and f (x0 ), there
is at least one value in the sequences which does not match up, namely the value at position
n, and therefore f (x) 6= f (x0 ), which violates our original assumption.
The proof of surjectivity is also straightforward.
9

(b) Any number in x ∈ [0, 1) always has a unique binary expansion given by x = b1 /2+b2 /22 +...,
and therefore we can construct a bijective mapping between x ∈ [0, 1) and C by computing
b1 /2 + b2 /22 + ..., and then by dropping the 0. at the beginning of the sequence. Since there
is a bijection between 2N and C and a bijection between C and [0, 1) (and given the fact that
the composition of 2 bijections is a bijection) there is thus a bijection between 2N and [0, 1).
Assuming (correctly so) that the interval [0, 1) is uncountable, then so too is 2N .

Problem 11. As shown in the previous problem, there is a bijection between [0, 1) and C. There-
fore, if C is uncountable, then so too is [0, 1). We can use what is known as Cantor’s diagonal
argument to prove that C is uncountable.
Let us try to search for a bijective mapping between C and N. Suppose, for example, that the
first few mappings are given by:

1 → 0000000 . . .
2 → 1111111 . . .
3 → 0101010 . . .
4 → 1010101 . . .
5 → 1101011 . . .
6 → 0011011 . . .
7 → 1000100 . . .
..
.
Let us now construct a new sequence, s ∈ C by enumerating the complement of the elements
along the diagonal of the mapping (which I have highlighted in boldface above), s = 1011101 . . ..
By construction, s differs from every proposed mapping since the nth digit in s is different than
the nth digits in all of the mappings. Thus, no natural number gets mapped to s, and hence
the proposed mapping is not surjective. The mappings I chose for illustration in this example for
1, . . . , 7 were arbitrary, and this argument applies to any potential mapping. Therefore, there is
no bijective mapping between N and C, and hence no bijection between [0, 1) and N. Thus, the
interval [0, 1) is uncountable.

Problem 12.

(a) The domain is {H, T }3 and the codomain is N ∪ {0}

(b) range(f ) = {0, 1, 2, 3}

(c) x can be all triplets that contain exactly 2 heads: (H, H, T ), (H, T, H) or (T, H, H).

Problem 13.

(a) The universal set is partitioned by the events a, b, d, and thus P (b) = 1 − P (a) − P (d) =
1 − 0.5 − 0.25 = 0.25.

(b) Since the events b and d are disjoint, by the 3rd axiom of probability, P (b∪d) = P (b)+P (d) =
0.25 + 0.25 = 0.5.

Problem 14.

(a) By inclusion-exclusion: P (A ∩ B) = P (A) + P (B) − P (A ∪ B) = 0.4 + 0.7 − 0.9 = 0.2.


10 CHAPTER 1. BASIC CONCEPTS

(b)

P (Ac ∩ B) = P (B − A)
= P (B) − P (A ∩ B)
= 0.7 − 0.2
= 0.5

(c)

P (A − B) = P (A) − P (A ∩ B)
= 0.4 − 0.2
= 0.2

(d) By drawing the Venn diagram, one can see that

P (Ac − B) = P (S) − P (A ∪ B)
= 1 − 0.9
= 0.1,

where S is the universal set.

(e) By drawing the Venn diagram, one can see that

P (Ac ∪ B) = P (S) − P (A − B)
= 1 − 0.2
= 0.8.

(f)

P (A ∩ (B ∪ Ac )) = P ((A ∩ B) ∪ (A ∩ Ac ))
= P ((A ∩ B) ∪ ∅)
= P (A ∩ B)
= 0.2.

Problem 15.

(a) The second roll is independent of the first, so we only need to consider the second roll, in
which case P (X2 = 4) = 1/6 since this is a finite sample space with equal probabilities for all
outcomes.

(b) The sample space is {1, 2, . . . , 6}×{1, 2, . . . , 6}, which has a cardinality of 36, and the possible
outcomes corresponding to the event that X1 + X2 = 7 are given by the set
{(1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3)}, which has a cardinality of 6, and therefore P (X1 +
X2 = 7) = 6/36 = 1/6.

(c) Listing out the tuples that satisfy the second condition in a matrix-like representation, we
have:
11

(1, 4) (1, 5) (1, 6)


(2, 4) (2, 5) (2, 6)
.. ,
.
(6, 4) (6, 5) (6, 6)
of which there are 3 × 6 elements. However, the first condition does not allow the elements
(2, 4), (2, 5), (2, 6), and thus the total size of the event space is 3 × 6 − 3 = 15. Thus
P (X1 6= 2 ∩ X2 ≥ 4) = 15/36 = 5/12.

Problem 16.
P∞ k
(a) The formula for a geometric series will be useful here: k=0 cr = a/(1 − r) for |r| < 1. To
solve for c, we can use the normalization constraint:


X
1= P (k)
k=1
X∞  k
1
= −c + c
3
k=0
c
= −c + ,
1 − 1/3

and therefore c = 2.

(b)

P ({2, 4, 6}) = P ({2} ∪ {4} ∪ {6})


= P (2) + P (4) + P (6)
 
1 1 1
=2 2 + 4 + 6
3 3 3
182
=
729

(c)
∞  k
X 1
P ({3, 4, 5, . . .}) = 2
3
k=3
  X∞  k
1 1 1
= −2 1 + + +2
3 9 3
k=0
   
13 3
= −2 +2
9 2
1
=
9
This answer may also have been computed 1 − P (1) − P (2).

Problem 17. Let us write down what we know in equations. Let a, b, c, d represent the events
that teams, A, B, C and D win the tournament respectively. Then as stated in the problem,
12 CHAPTER 1. BASIC CONCEPTS

P (a) = P (b), P (c) = 2P (d) and P (a ∪ c) = 0.6. Since the events partition the sample space,
P (a ∪ c) = P (a) + P (c). We know one more equation, which is that the probabilities must sum to
one: P (a) + P (b) + P (c) + P (d) = 1. We therefore have a linear system with 4 equations and 4
unknowns, and it will thus be convenient to write this in matrix notation in order to solve for the
probabilities:
    
1 −1 0 0 P (a) 0
0 0 1 −2  P (b)   0 
    
1 0 1 0   P (c)  = 0.6 .
1 1 1 1 P (d) 1
=⇒

   −1  
P (a) 1 −1 0 0 0
 P (b)  0 0 1 −2  0 
     
 P (c)  = 1 0 1 0  0.6
P (d) 1 1 1 1 1
  
2 1 −3 2 0
1  
1 −3 2   0 
=−2 −1 4

2  0.6
−1 −1 2 −1 1
 
0.2
0.2
= 
0.4 .
0.2

Notice that, as required, the probabilities sum to 1.

Problem 18.
1
(a) P (T ≤ 1) = 16

(b)

P (T > 2) = 1 − P (T ≤ 2)
4
=1−
16
3
=
4

(c)

P (1 ≤ T ≤ 3) = P (T ≤ 3) − P (T < 1)
9 1
= −
16 16
1
=
2

Problem 19. The solutions to the quadratic are given by the quadratic formula:

−1 ± 1 − 4AB
X= , (1.3)
2A
13

y=

y
1 
4 1
x

0
0 1
x

Figure 1.1: area of unit square resulting in real solutions

which has real solutions iff the condition 1 − 4AB ≥ 0 is satisfied. We therefore seek the probability
that P (1 − 4AB ≥ 0) (in the unit square), which, since the point (A, B) is picked uniformly, is the
fraction of area in the unit square which satisfies this constraint. Therefore points which satisfy
the following inequalities contribute to this probability:
 
1 1
≥ y,
4 x

x≤1
and
y ≤ 1,
where the last 2 inequalities follow since the randomly drawn points must lie within the unit square.
The area in the unit square which satisfies these constraints is shown in Fig. 6.2.
It is clear from the figure that the area is given by:

Z
1 1 1 1
P (real solns.) = + dx
4 4 1/4 x
1 1
= + ln 4
4 4
≈ 0.60.

Problem 20.

(a) To solve this problem, note that:



[
A≡ Ai = A1 ∪ (A2 − A1 ) ∪ (A3 − A2 ) ∪ . . . ,
i=1

where in the figure, A1 is the innermost circle, (A2 −A1 ) is the “annulus” around A1, (A3 −A2 )
is the next “annulus” and so-forth. It is clear that the union of A1 and all of the annuli results
in A, and that each of these regions are also disjoint. I utilize the previous equation in the
desired proof:
14 CHAPTER 1. BASIC CONCEPTS

Figure 1.2: Venn diagram of events A1 , A2 , . . .

Proof.

X
P (A) = P (A1 ) + P (Ai − Ai−1 )
i=2
n
X
= P (A1 ) + lim P (Ai − Ai−1 )
n→∞
i=2
Xn
= P (A1 ) + lim [P (Ai ) − P (Ai−1 )]
n→∞
i=2
= P (A1 ) + lim {[P (A2 ) − P (A1 )] + [P (A3 ) − P (A2 )] + [P (A4 ) − P (A3 )]
n→∞
+ . . . + [P (An ) − P (An−1 )]}
= P (A1 ) + lim [P (An ) − P (A1 )]
n→∞
= lim P (An ) 
n→∞

T
(b) Redefining A: A ≡ ∞ i=1 Ai , we seek to find P (A). If A1 , A2 , . . . is a series of decreasing
c c
events, then A1 , A2 , . . . must be a series of increasing events, and we can can therefore utilize
the results of the part (a) on sequence of the complements (as well as De Morgan):


!c ! ∞
!
\ [
P (Ac ) = P Ai =P Aci = lim P (Acn ).
n→∞
i=1 i=1

A few more steps completes the proof:

Proof.

P (A) = 1 − P (Ac )
= 1 − lim P (Acn )
n→∞
= lim [1 − P (Acn )]
n→∞
= lim P (An ) 
n→∞
15

Problem 21.
(a) Let us define new events, Bi , such that B1 =A1 , B2 = A2 − A1 , B3 = A3 − A2 − A1 , . . .. Note
that the Bi s are disjoint. Also note that:

n
[
Bi = A1 ∪ (A2 − A1 ) ∪ (A3 − A2 − A1 ) ∪ . . . ∪ (An − An−1 − . . . − A1 )
i=1
= A1 ∪ A2 ∪ A3 . . . ∪ An
[n
= Ai ,
i=1
S∞ S∞
and for the same reason i=1 Bi = i=1 Ai . Using these facts, the proof is now straightfor-
ward:

Proof.

! ∞
!
[ [
P Ai =P Bi
i=1 i=1

X
= P (Bi )
i=1
n
X
= lim P (Bi )
n→∞
i=1
n
!
[
= lim P Bi
n→∞
i=1
n
!
[
= lim P Ai 
n→∞
i=1

(b) The prove this second result I use the previous result as well as De Morgan (twice):

Proof.

! ∞
!
\ [
P Ai =P Aci
i=1 i=1
n
!
[
= lim P Aci
n→∞
i=1
n
!
\
= lim P Ai 
n→∞
i=1

Problem 22. Let Acof f be the event that a customer purchases coffee and Acake be the event that a
customer purchases cake. We know that P (Acof f ) = 0.7, P (Acake ) = 0.4 and P (Acof f , Acake ) = 0.2.
Thus, the conditional probability we seek is:

P (Acof f , Acake ) 0.2


P (Acof f |Acake ) = = = 0.5.
P (Acake ) 0.4
16 CHAPTER 1. BASIC CONCEPTS

Problem 23.

(a)
P (A ∩ B) 0.1 + 0.1
P (A|B) = = ≈ 0.57
P (B) 0.1 + 0.1 + 0.1 + 0.05

(b)
P (C ∩ B) 0.1 + 0.05
P (C|B) = = ≈ 0.43
P (B) 0.1 + 0.1 + 0.1 + 0.05

(c)
P (B ∩ (A ∪ C)) 0.1 + 0.1 + 0.05
P (B|A ∪ C) = = ≈ 0.36
P (A ∪ C) 0.1 + 0.2 + 0.1 + 0.1 + 0.05 + 0.15

(d)
P (B ∩ (A ∩ C)) 0.1
P (B|A, C) = = = 0.5
P (A ∩ C) 0.1 + 0.1

Problem 24.

(a)
3
P (2 ≤ X ≤ 5) = = 0.3
10

(b)
2
P (X ≤ 2|X ≤ 5) = = 0.4
5

(c)
P (3 ≤ X ≤ 8 ∩ X ≥ 4) 4 2
P (3 ≤ X ≤ 8|X ≥ 4) = = =
P (X ≥ 4) 6 3

Problem 25. Let ON denote the event that a student lives on campus, OF F denote the event
that a student lives off campus and A denote the event that a student receives an A. Given the
data I compute the following probabilities:

200 1
P (ON ) ≈ =
600 3

120 1
P (A) ≈ =
600 5

1 80 1
P (A ∩ ON ) = P (A) − P (A ∩ OF F ) ≈ − =
5 600 15
If the events ON and A are independent, then P (A ∩ ON ) = P (A)P (ON ). Looking at the
probabilities above, we see that the data suggests this relationship, and thus the data suggests that
getting an A and living on campus are independent.

Problem 26. Let N1 be the number of times out of n that a 1 is rolled, N6 be the number of
times out of n that a 6 is rolled and Xi be the value of the ith roll. Then:
17

0.1 0.08
0.8
0.8 0.9 0.72

0.3 0.06
0.2
0.2

0.7 0.14

Figure 1.3: Tree diagram for problem 27.

P (N1 ≥ 1 ∩ N6 ≥ 1) = 1 − P ((N1 ≥ 1 ∩ N6 ≥ 1)c )


= 1 − P (N1 = 0 ∪ N6 = 0)
= 1 − [P (X1 6= 1, X2 6= 1, . . . , Xn 6= 1) + P (X1 6= 6, X2 6= 6, . . . , Xn 6= 6)
− P ((X1 6= 1, X2 6= 1, . . . , Xn 6= 1) ∩ (X1 6= 6, X2 6= 6, . . . , Xn 6= 6))]
 n  n 
5 5
=1− + − P (X1 6= 1, X1 6= 6, X2 6= 1, X2 6= 6, . . . , Xn 6= 1Xn 6= 6)
6 6
  n 
5 n
=1− 2 − P (X1 6= 1, X1 6= 6)
6
  n  n 
5 4
=1− 2 −
6 6
n
2(5 ) − 4 n
=1− .
6n

In the second line I have used De Morgan, and I have also used the fact, several times, that the
outcome of roll i is independent of the outcome of roll j. Testing for a for values of n, I find that,
when n = 1, the probability is 0, which makes sense because at the very minimum we would need
at least one 1 and one 6, which cannot happen if we have only rolled once. The probability then
monotonically increases, which also makes sense because it becomes more and more likely that we
roll at least one 1 and at least one 6 the more rolls we throw. Note that as a sanity check, one can
show that limn→∞ 1 − (2(5n ) − 4n )/(6n ) is 1, so that our formula for the probability is bounded
between 0 and 1. Also note that this formula can also be obtained more easily with combinatorics,
which will be introduced in Chapter 2.

Problem 27.

(a) Refer to Fig. 5.4

(b) P (E) = P (E ∩ G) + P (E ∩ Gc ) = 0.08 + 0.06 = 0.14

P (G∩E c ) 0.72
(c) P (G|E c ) = P (E c ) = 1−0.14 ≈ 0.84.
18 CHAPTER 1. BASIC CONCEPTS

Problem 28. Let Ai be the event that the ith (i = 1, 2, 3) unit of the 3 picks is defective, while
the other 2 are not defective. Note that A1 , A2 and A3 are all disjoint, since it is impossible for any
unit to be both defective and not defective simultaneously. The probability we seek is therefore:

P (A1 ∪ A2 ∪ A3 ) = P (A1 ) + P (A2 ) + P (A3 )


           
5 95 94 95 5 94 95 94 5
= + +
100 99 98 100 99 98 100 99 98
≈ 0.14.

Problem 29. Let F be the event that the system is functional, and Ci be the event that component
i is functional.

(a) P (F ) = P (C1 , C2 , C3 ) = P1 P2 P3

(b) By inclusion-exclusion:

P (F ) = P (C1 ∪ C2 ∪ C3 )
= P1 + P2 + P3 − P1 P2 − P1 P3 − P2 P3 + P1 P2 P3

(c)

P (F ) = P ((C1 , C3 ) ∪ (C2 , C3 ))
= P (C1 , C3 ) + P (C2 , C3 ) − P ((C1 ∩ C3 ) ∩ (C2 ∩ C3 ))
= P1 P3 + P2 P3 − P (C1 , C2 , C3 )
= P1 P3 + P2 P3 − P1 P2 P3

(d) P (F ) = P (C1 , C2 ) + P (C3 ) = P1 P2 + P3 − P1 P2 P3

(e) P (F ) = P (C1 , C2 , C5 )+P (C3 , C4 , C5 )−P (C1 , C2 , C3 , C4 , C5 ) = P1 P2 P5 +P3 P4 P5 −P1 P2 P3 P4 P5

Problem 30.

(a) The region in the unit square corresponding to set A can be made more clear if we write the
absolute value as a piecewise function:
( (
1
1 x−y ≤ 2 if x ≥ y y ≥ − 12 + x if y ≤ x
|x − y| ≤ =⇒ 1
=⇒ .
2 y−x≤ 2 if x < y y ≤ 12 + x if y > x
This piecewise function, along with the fact that A must be bounded in the unit square leads
to hashed-in region in Fig. 5.6. The region corresponding to set B is just the are in the unit
square above the 45◦ line (corresponding to the gray shaded region in Fig. 5.6).
  
(b) Using a little geometry, I find: P (A) = 1 − 2 21 12 12 = 34 and P (B) = 12 .
1 1
 1 1 3
(c) Again, using
 1 3 some geometry, I find: P (A ∩ B) = 2 − 2 2 2 = 8 . Since P (A)P (B) =
3
4 2 = 8 , the 2 events are indeed independent.

Problem 31.
19

5
0.
+
x

x
=

=
y

y
1

5
0.
y


x
=
y
0
0 1
x

Figure 1.4: The unit square for Problem 30. The shaded region represents the set B and the hashed
region represents the set A

(a) Let s be the event that the received email is spam and r be the event that the received email
contains the word refinance. From the problem statment, we know that P (s) = 0.5 (so that
P (sc ) = 0.5), P (r|s) = 0.01 and P (r|sc ) = 0.00001. Using Bayes’ rule:

P (r|s)P (s)
P (s|r) =
P (r)
P (r|s)P (s)
=
P (r|s)P (s) + P (r|sc )P (sc )
(0.01)(0.5)
=
(0.01)(0.5) + (0.00001)(0.5)
≈ 0.999

Problem 32.

(a) There are 4 possible paths from A to B: 1 to 4 (path 1), 2 to 5 (path 2), 1 to 3 to 5 (path
3), 2 to 3 to 4 (path 4). Let Pi be the event that path i is open. Only 1 path needs to be
open for event A to occur, so the probability of A is given by the probability of P1 or P2 or
P3 or P4 . We expand this probability with inclusion-exclusion, making sure to enumerate all
unique pairs and all unique triplets:
20 CHAPTER 1. BASIC CONCEPTS

P (A) = P (P1 ∪ P2 ∪ P3 ∪ P4 )
= P (P1 ) + P (P2 ) + P (P3 ) + P (P4 )
− P (P1 ∩ P2 ) − P (P1 ∩ P3 ) − P (P1 ∩ P4 ) − P (P2 ∩ P3 ) − P (P2 ∩ P4 ) − P (P3 ∩ P4 )
+ P (P1 ∩ P2 ∩ P3 ) + P (P1 ∩ P2 ∩ P4 ) + P (P1 ∩ P3 ∩ P4 ) + P (P2 ∩ P3 ∩ P4 )
− P (P1 ∩ P2 ∩ P3 ∩ P4 )
= P (B1 ∩ B4 ) + P (B2 ∩ B5 ) + P (B1 ∩ B3 ∩ B5 ) + P (B2 ∩ B3 ∩ B4 )
− P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 )) − P ((B1 ∩ B4 ) ∩ (B1 ∩ B3 ∩ B5 )) − P ((B1 ∩ B4 ) ∩ (B2 ∩ B3 ∩ B4 ))
− P ((B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 )) − P ((B2 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
− P ((B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
+ P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 )) + P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
+ P ((B1 ∩ B4 ) ∩ (B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
+ P ((B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
− P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
= P (B1 ∩ B4 ) + P (B2 ∩ B5 ) + P (B1 ∩ B3 ∩ B5 ) + P (B2 ∩ B3 ∩ B4 )
− P (B1 ∩ B4 ∩ B2 ∩ B5 ) − P (B1 ∩ B4 ∩ B3 ∩ B5 ) − P (B1 ∩ B4 ∩ B2 ∩ B3 )
− P (B2 ∩ B5 ∩ B1 ∩ B3 ) − P (B2 ∩ B5 ∩ B3 ∩ B4 ) − P (B1 ∩ B3 ∩ B5 ∩ B2 ∩ B4 )
+ P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 ) + P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 )
+ P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 ) + P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 )
− P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 )
= P1 P4 + P2 P5 + P1 P3 P5 + P2 P3 P4 − P1 P4 P2 P5 − P1 P4 P3 P5 − P1 P4 P2 P3
− P2 P5 P1 P3 − P2 P5 P3 P4 + 2P1 P2 P3 P4 P5
= P1 P4 (1 − P2 P5 − P3 P5 − P2 P3 ) + P2 P5 + P1 P3 P5 + P2 P3 P4
− P2 P5 P1 P3 − P2 P5 P3 P4 + 2P1 P2 P3 P4 P5

As a sanity check, if bridge 3 does not exist (i.e., if P3 = 0), then there are only 2 paths and
by inclusion-exclusions, P (A) = P1 P4 + P2 P5 − P1 P4 P2 P5 . In the limit that P3 = 0, we see
that, indeed, the above formula matches this probability.
(b) To solve for P (B3 |A) I use Bayes’ rule:

P (A|B3 )P3
P (B3 |A) = .
P (A)
P (A) has already been calculated. To solve for the probability of A conditioned on B3 we
need only to condition each probability term in P (A) on B3 , which effectively turns all the
P3 terms in the formula for P (A) to unity. Therefore,

P (A|B3 ) = P1 P4 (1 − P2 P5 − P5 − P2 ) + P2 P5 + P1 P5 + P2 P4 − P2 P5 P1 − P2 P5 P4 + 2P1 P2 P4 P5 ,

and we can insert P (A|B3 ) and P (A) into Bayes’ rule to obtain the answer:

P (A|B3 ) =
P1 P4 P3 (1 − P2 P5 − P5 − P2 ) + P3 P2 P5 + P3 P1 P5 + P3 P2 P4 − P3 P2 P5 P1 − P3 P2 P5 P4 + 2P1 P2 P3 P4 P5
.
P1 P4 (1 − P2 P5 − P3 P5 − P2 P3 ) + P2 P5 + P1 P3 P5 + P2 P3 P4 − P2 P5 P1 P3 − P2 P5 P3 P4 + 2P1 P2 P3 P4 P5
21

Problem 33. Without loss of generality, let us call the door that you picked door 1, and let us
arbitrarily denote the remaining doors by 2 and 3. Let Ci denote the event that the car is behind
door i and Hi denote the event that the host opens door i. The original probability that you
guessed the door with the car is P (C1 ) = 1/3. Since the host will not open door 1, and he will also
not open the door with the car behind it, we have the following probabilities:

P (H1 |C1 ) = 0
1
P (H2 |C1 ) =
2
1
P (H3 |C1 ) =
2
P (H1 |C2 ) = 0
P (H2 |C2 ) = 0
P (H3 |C2 ) = 1
P (H1 |C3 ) = 0
P (H2 |C3 ) = 1
P (H3 |C3 ) = 0

If the host opens door 3, we would like to know P (C2 |H3 ), because if this value is higher than
1/3, it is in our interest to switch to door 2. Likewise if the host opens door 2, we would like
to know P (C3 |H2 ) to know if we should switch to door 3. Given the symmetry of the problem
P (C2 |H3 ) = P (C3 |H2 ), so I only need to compute the probability once, which I do using Baye’s
rule:
P (H3 |C2 )P (C2 )
P (C2 |H3 ) =
P (H3 |C1 )P (C1 ) + P (H3 |C2 )P (C2 ) + P (H3 |C3 )P (C3 )

1 31
= 1 1 1

2 3 + 1 3 +0
2
= .
3
It is therefore in your interest to switch to door 2 if the host opens door 3 or to switch to door 3 if
the host opens door 2.

Problem 34.

(a) P (A) = 1/6, P (B) = |{(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}|/36 = 1/6, P (A, B) = 1/36.
Since P (A)P (B) = 1/36 = P (A, B), the events are indeed independent.

(b) P (C) = 1/6, so that P (A, C) = 1/36, and P (A, C) = 1/36 and therefore they are indepen-
dent.

(c) P (B)P (C) = 1/36, P (B, C) = 1/36, so yes, they are independent.

(d) The events A, B and A, C and B, C are pairwise independent. We also need to check if
P (A, B, C) = P (A)P (B)P (C). The probability of P (A, B, C) equals 0 since those events
cannot all occur at once, whereas P (A)P (B)P (C) 6= 0. Therefore, the events A, B and C
are not independent.
22 CHAPTER 1. BASIC CONCEPTS

Problem 35. Let X1 denote the outcome of the first roll, W denote the event that I win, and let
the probability of tails be q (= 1 − p). From Bayes’, the probability that the first roll was heads is
given that I won the game is:

P (W |X1 = H)P (X1 = H)


P (X1 = H|W ) = .
P (W )

I first calculate the probability of winning:

P (W ) = P (HH) + P (T HH) + P (HT HH) + P (T HT HH) + . . .


= P (HH) + P (HT HH) + . . . + P (T HH) + P (T HT HH) + . . .
= q 0 p2 + qp3 + . . . + qp2 + q 2 p3 + . . .
= (q 0 p + qp2 + . . .)p + (q 0 p2 + qp3 + . . .)q

Note that since P (W ) = P (W |X1 = H)p + P (W |X1 = T )q, the first term in the parentheses in
the above equation represents P (W |X1 = H) while the second term in the parentheses represents
P (W |X1 = T ). I solve for both of these seperately:

P (W |X1 = H) = q 0 p + qp2 + . . .
= (qp)0 p + (qp)1 p + . . .

X
=p (qp)k
k=0
p
= ,
1 − qp

while

P (W |X1 = T ) = q 0 p2 + qp3 + . . .
= (qp)0 p2 + (qp)1 p2 + . . .

X
= p2 (qp)k
k=0
p2
= ,
1 − qp

where I have used the formula for a geometric series. Thus we can compute the probability of
winning as

P (W ) = P (W |X1 = H)p + P (W |X1 = T )q


p2 p2 q
= +
1 − qp 1 − qp

Finally, we can plug all of these formulas into Bayes’ equation:


23

P (W |X1 = H)P (X1 = H)


P (X1 = H|W ) =
P (W )
p2
1−qp
= p2 p2 q
1−qp + 1−qp
1
=
1+q
1
= .
2−p

Problem 36. Let Hn+1 denote the event that the (n + 1)th flip is a head, H . . . H denote the
event of observing n heads and F denote the event that we pick the fair coin. We would like to find
P (Hn+1 |H . . . H), and we know that from the law of total probability P (Hn+1 ) = P (Hn+1 |F )P (F )+
P (Hn+1 |F c )P (F c ). By conditioning all of the probabilities on H . . . H, this equation gives a formula
for the probability we desire:

P (Hn+1 |H . . . H) = P (Hn+1 |F, H . . . H)P (F |H . . . H) + P (Hn+1 |F c , H . . . H)P (F c |H . . . H).


(1.4)
c
The terms P (Hn+1 |F, H . . . H) and P (Hn+1 |F , H . . . H) are conditionally independent of H . . . H
given F (or F c ) and therefore these probabilities are 1/2 and 1 respectively. We may obtain
P (F |H . . . H) from Bayes’ rule:

P (H . . . H|F )P (F )
P (F |H . . . H) =
P (H . . . H|F )P (F ) + P (H . . . H|F c )P (F c )

1 n 1
= 2 2 1
1 n 1
2 2 + 2
1
.
1 + 2n

Thus:

   
1 1 1
P (Hn+1 |H . . . H) = + 1 −
2 1 + 2n 1 + 2n
1
=1− .
2(1 + 2n )

We can check this formula for the extremes that n = 0 and n → ∞. In the first case, if n = 0, we
can calculate the probability of heads directly: P (H) = (1/2)(1/2) + 1(1/2) = 3/4, which matches
what the formula predicts when n = 0. When n → ∞, we would expect that the coin is probably
unfair, so that the probability of the next flip landing heads is 1. Indeed, this is what the formula
predicts in the limit that n → ∞.

Problem 37. Let Xi denote the number of girls for the ith child. Note that Xi can only take on
values 0 or 1. We seek the probability P (X1 = 1, . . . , Xn = 1|X1 + . . . + Xn ≥ 1). This can be
re-written with Bayes’ rule:
24 CHAPTER 1. BASIC CONCEPTS

P (X1 + . . . + Xn ≥ 1|X1 = 1, . . . , Xn = 1)P (X1 = 1, . . . , Xn = 1)


P (X1 = 1, . . . , Xn = 1|X1 +. . .+Xn ≥ 1) = .
P (X1 + . . . + Xn ≥ 1)
(1.5)
In the numerator, the first term is just 1 since X1 + . . . + Xn is guaranteed to be at least 1 if
all X1 , . . . , Xn are equal to 1. The second term is (1/2)n since all boy/girl events are independent.
To calculate the denominator, it is easier to consider its complement: P (X1 + . . . + Xn ≥ 1) =
1 − P (X1 + . . . + Xn = 0) = 1 − P (X1 = 0, . . . , Xn = 0) = 1 − (1/2)n . Putting all of these into
Bayes’ rule, I obtain:

1
P (X1 = 1, . . . , Xn = 1|X1 + . . . + Xn ≥ 1) = .
2n −1

We can test this formula for low values of n. For n = 1, P (X1 = 1|X1 = 1) = 1, which the
above formula predicts. For n = 2, by listing out the boy/girl event space, it is not difficult to
determine that P (X1 = 1, X2 = 1|X1 + X2 ≥ 1) = 1/3, which also what the above formula predicts.

Problem 38. Let L be the event that the family has at least 1 daughter named Lilia, G . . . G be
the event that the n children are girls, BG . . . G be the event that the first child is a boy and the
following n − 1 children are girls, GBG . . . G be the event that the first child is a girl, the second
is a boy and following n − 2 children are girls, etc.... We are interested in P (G . . . G|L) which we
can obtain with Bayes’ rule:

P (L|G . . . G)P (G . . . G)
P (G . . . G|L) = . (1.6)
P (L)

The denominator is given by:

P (L) = 1 − P (Lc )
= 1 − P (G . . . G)[P (Lc |G . . . G) + P (Lc |BG . . . G) + P (Lc |GBG . . . G) + . . . + P (Lc |B . . . B)]
= 1 − P (G . . . G)[(1 − α)n + (1 − α)n−1 + (1 − α)n−1 + . . . + (1 − α)0 ]
Xn
n!
= 1 − P (G . . . G) (1 − α)k
k!(n − k!)
k=0
= 1 − P (G . . . G)(2 − α)n .

In the third line I used the fact that to have no daughters named Lilia given n daughters, the
daughters need to be not be named Lilia n times, which occurs with probability (1 − α)n . In the
fourth line, I used the fact that the total number of sequences of BG . . . G, GBG . . . G, ..., is given
by the number of permutations of n elements (n!) divided by the number of repeats for any element
(which for the Gs is k!, where k is the number of Gs in the sequence, and which for the Bs is
(n − k)!). This is a simple combinatorics problem which will be discussed in the following chapter.
Finally, to solve the summation, I used the binomial theorem.
We need one more probability, which is P (L|G . . . G) = 1 − P (Lc |G . . . G) = 1 − (1 − α)n .
Sticking in all of these probabilities into Baye’s rule, I obtain:
25

n
[1 − (1 − α)n ] 12
P (G . . . G|L) = n
1 − 12 (2 − α)n
1 − (1 − α)n
=
2n − (2 − α)n
1 − (1 − nα)

2 − 2n (1 − nα
n
2 )
1
= ,
2n−1
where I have used a Taylor expansion to simplify the polynomial terms since α  1. The case of
n = 2 corresponds to Problem 7 of section 1.4.5 in the book. Evaluating my formula with n = 2 I
find that P (GG|L) = (2 − α)/(4 − α) ≈ 1/2 which is the same formula as the answer to Problem
7 of section 1.4.5.

Problem 39. Let R be the event that a randomly chosen child is a girl, and G . . . G be the event
that the family has n girls. We seek to find P (G . . . G|R), which we can get from Bayes’ rule:

P (R|G . . . G)P (G . . . G)
P (G . . . G|R) = .
P (R)
R is certain to happen conditioned on G . . . G, P (G . . . G) is simply (1/2)n and the probability
of randomly choosing a girl from a family without any prior information about genders is 1/2 as
shown for the n = 1 (S = {G, B}) and n = 2 (S = {BB, BG, GB, GG}) cases below:
n=1:    
1 1 1
P (R) = P (R|G)P (G) + P (R|B)P (B) = 1 +0 = ,
2 2 2
n=2:

P (R) = P (R|BB)P (BB) + P (R|BG)P (BG) + P (R|GB)P (GB) + P (R|GG)P (GG)


         
1 1 1 1 1 1
=0 + + +1
4 2 4 2 4 4
1
= .
2
Therefore P (G . . . G|R) = 1/2n−1 .
26 CHAPTER 1. BASIC CONCEPTS
Chapter 2

Combinatorics: Counting Methods

27
28 CHAPTER 2. COMBINATORICS: COUNTING METHODS

Problem 1. We can use the multiplication principle, making sure to enumerate all the cream/-
sugar/milk possibilities:
       
3 3 3 3
4·3· + + + = 4 · 3 · 8 = 96. (2.1)
0 1 2 3

Problem 2. Let N be the number of unique permutations of the 8 people in the 12 chairs. The 4
empty chairs are indistinguishable, so, for any given unique permutation, the permutations amongst
those 4 chairs do not count toward the number of unique permutations, N . We know that the total
number of permutations (including the non-unique permutations is 12!), and therefore 12! = N 4!,
so that
12!
= 19958400. (2.2)
4!
Problem 3.

(a) Let B represent the set of the 20 black cell phones, {b1 , b2 , . . . , b20 }, and W represent the set
of the 30 white cell phones, {w1 , w2 , . . . , w30 }. Let B be the set containing all possible sets of
the 4 distinct black cell phones that were chosen (without replacement) from the 20 black cell
phones, B = {{b1 , b2 , b3 , b4 }, {b1 , b2 , b3 , b5 } . . . {b17 , b18 , b19 , b20 }}, and W be the corresponding
set for the 6 white cell phones. Therefore, the sets of sets representing all unique ways to
obtain 4 black cell phones and 6 white cell phones is given by {B1 ∪W1 , B1 ∪W2 . . . B|B| ∪W|W| },
 
whose total cardinality can be seen to be |B||W|. |B| is clearly 20 , and |W| is clearly 30 ,
 
20 30
4 6
so the size of this set is 4 6 . The sample space for this experiment is all possible unique
sets of size 10 that can be chosen from B ∪ W . Therefore, the probability of obtaining exactly
4 black cell phones is given by:

20
 30

4 6
P (4 black phones) = 50
 ≈ 0.28. (2.3)
10
In this problem I somewhat laboriously spelled out how to obtain the proper number of
sets from the sample space with exactly 4 black cell phones. I did this for the purpose of
illustration since this type of situation arises commonly in combinatorics problems. In the
future I will typically be typically be more terse.

(b)

P (NB < 3) = P (NB = 0 ∪ NB = 1 ∪ NB = 2)


= P (NB = 0) + P (NB = 1) + P (NB = 2)
 
20 30
 
20 30
 
20 30
0
= 10 +
50
1
9 +
50
2
8
50
10 10 10
≈ 0.14

Problem 4.

(a) The sample space is all possible sets of length 5 chosen from the 52 cards, and the events we
are interested in are all possible sets of size 5, one of which is certainly an ace. Therefore:
 
4 48
1
P (NA = 1) = 52
4 ≈ 0.30.
5
29

(b) Let NA ≥ 1 be the event that the hand contains at least 1 ace. It will be easier to consider
the complement of this event:

P (NA ≥ 1) = 1 − P (NA = 0)
48

5
=1− 52

5
≈ 0.34.

Problem 5. It will be convenient to use Bayes’ rule so that we can move NA ≥ 1 to the first slot
of P (·|·):

P (NA ≥ 1|NA = 2)P (NA = 2)


P (NA = 2|NA ≥ 1) = . (2.4)
P (NA ≥ 1)
We have already computed the denominator in the previous problem. In the numerator P (NA ≥
1|NA = 2) = 1 since the probability of obtaining at least 1 ace is unity if we already know there
are 2 aces. The probability of obtaining exactly 2 aces is
 
4 48
2
P (NA = 2) = 52
3 ≈ 0.04,
5

and therefore:
1 · 0.04
P (NA = 2|NA ≥ 1) ≈ ≈ 0.12. (2.5)
0.34
Problem 6. Let C4 be the event that C receives exactly 4 spades. Each player has 13 cards, and
between players A and B, we know there are 7 spades, and 19 non-spades. This leaves 6 spades
and 20 non-spades to be chosen amongst players C and D. If the 26 cards are first dealt to A and
B, and another 13 are dealt to C, then the probability that C obtains exactly 4 spades is:
 
6 20
4
P (C4 ) = 26
9 ≈ 0.24.
13

Problem 7. Let J be the event that Joe is chosen and Y be the event that you are chosen. By
inclusion-exclusion:
P (J ∪ Y ) = P (J) + P (Y ) − P (J, Y ).
 
1 49
There are 1 14 different ways Joe can be chosen and the same number of ways you can be chosen.
 
There are 22 4813 different ways both you and Joe can be chosen, and thus:

49
 48

14 13
P (J ∪ Y ) = 2 50
 − 50
 ≈ 0.51.
15 15

Problem 8. In general, for a sequence with n elements, r of which are unique, the number of
unique permutations is given by:

n!
N= , (2.6)
n1 !n2 ! . . . nr !
where ni is the number of repeats of the ith unique element in the original sequence. This can
easily be shown, since the number of total permutations must be equal to to the sum of each
unique permutation, times the number of times each element in the unique permutation can be
permuted amongst themselves, n! = N n1 !n2 ! . . . nr !. For example, one unique permutation of the
30 CHAPTER 2. COMBINATORICS: COUNTING METHODS

word “Massachusetts” is Massachusetts itself. We see that the “a”s can be permuted 2! ways
amongst second and fifth position, while still forming the word Massachusetts. Likewise, the
“s”s can be permuted 4! ways and the “t”s 2! ways, resulting in 2!4!2! permutations of all letters
which result in this unique permutation. Thus, the total number of ways of arranging the word
“Massachusetts” is:

n! 13!
N= = = 64864800. (2.7)
na !ns !nt ! 2!4!2!
Problem 9.

(a) Using the formula for the binomial distribution, I find:


 
20 8
P (k = 8) = p (1 − p)20−8 .
8

(b) Since both the number of heads and number of tails must be > 8, the possible observed
number of heads (tails) can be 9 (11) or 10 (10) or 11 (9). These are disjoint events, so the
total probability we are interested in is

P ({k = 9, k = 10, k = 11}) = P (k = 9) + P (k = 10) + P (k = 11)


     
20 9 20−9 20 10 20−10 20 11
= p (1 − p) + p (1 − p) + p (1 − p)20−11
9 10 11
   
20 9 9
 2
 20 10
= p (1 − p) 1 − 2p + 2p + p (1 − p)10 .
9 10

Problem 10. Let u denote a move up and r denote a move to the right. A path from (0, 0)
to (20, 10) can be represented by a sequence of us and rs. Note that in every possible sequence,
there must be 10 us and 20 rs because we always need to travel 10 units up and 20 units to the
right regardless of the path. Therefore, the problem reduces to ascertaining the number of unique
sequences with 10 us and 20 rs, which, from Problem 8 we can see to be:

30!
= 30045015.
20!10!
Problem 11. Let A denote the event that the message passes through (10, 5) on its way to (20, 10).
To reach the point (10, 5) on the way from (0, 0) to (20, 10), the first 15 entries of the sequence must
have exactly 5 us and 10 rs. This may occur in any number of the unique permutations 15!/(10!5!).
To reach (20, 10), the remaining entries must also contain exactly 5 us and 10 rs, again giving
15!/(10!5!) unique permutations from (10, 5) to (20, 10). The total number of unique permutations
starting at (0, 0) and going through (10, 5) on its way to (20, 10) is therefore (15!/(10!5!))2 , so that
the probability that the message goes through (10, 5) is:

15! 2
10!5!
P (A) = 30!
≈ 0.30
10!20!

Problem 12. Let A denote the event that the message passes through (10, 5). This occurs if, out
of the first 15 entries of the sequence there are exactly 5 us and 10 rs in any order. For a binary
outcome experiment, the probability of obtaining 5 us with probability pa is given by the binomial
distribution:
31

 
15 5
P (A) = p (1 − pa )10 .
5 a

Problem 13. Let pi be the probability of flipping a heads for coin i (i ∈ {1, 2}), let Ci be the
event that coin i is chosen. Using the law of total probability and the binomial distribution, I find:

(a)

P (NH ≥ 3) = P (NH = 3 ∪ NH = 4 ∪ NH = 5)
= P (NH = 3) + P (NH = 4) + P (NH = 5)
2
X
= [P (NH = 3|Ci )P (Ci ) + P (NH = 4|Ci )P (Ci ) + P (NH = 5|Ci )P (Ci )]
i=1
2       
1X 5 3 2 5 4 5 5
= pi (1 − pi ) + pi (1 − pi ) + p
2 3 4 5 i
i=1
≈ 0.35.

(b) From Bayes’:


P (NH ≥ 3|C2 )P (C2 )
P (C2 |NH ≥ 3) = ,
P (NH ≥ 3)
where P (NH ≥ 3) has already been solved, P (C2 ) = 0.5 and
5  
X 5 j
P (NH ≥ 3|C2 ) = p (1 − p2 )5−j ≈ 0.21.
j 2
j=3

Therefore, the probably we are interested in is


0.21 · 0.5
P (C2 |NH ≥ 3) ≈ = 0.3.
0.35
The fact that this probability is less than 0.5 makes sense, since more heads were observed
than tails, and so it is more probable that coin 1 was chosen since the probability that it
lands heads is higher.
13

Problem 14. There are 3,5,5 different ways Hannah and Sarah can be arranged on the same
team. However, we do not care about players being assigned to a particular team name, we just
care about the number of possible divisions. Therefore, to avoid over-counting, we must divide
15
this value by 2!. Likewise, there are 5,5,5 total ways to construct 3 teams of 5 each, and we must
divide by 3! since we only care about the number of possible divisions:
1 13

2! 3,5,5
P (H and S in same division) = 1 15
 ≈ 0.29.
3! 5,5,5

Problem 15. We would like to find P (N1 > 1 ∪ N2 > 1 ∪ . . . ∪ N1 > 6) = 1 − P (N1 ≤ 1, N2 ≤
1, . . . , N6 ≤ 1). For the first roll, we therefore have 6 allowable options, for the second 5 allowable
options, . . .. Therefore the probability is:
6·5·4·3·2
P (N1 > 1 ∪ N2 > 1 ∪ . . . ∪ N1 > 6) = 1 − ≈ 0.91.
65
32 CHAPTER 2. COMBINATORICS: COUNTING METHODS

Problem 16.
 10
(a) Let A be the desired event. If the first 15 cards are to have 10 red cards, then there are 10
5 10
different possible
 groups for the first 15 cards, of which we can arrange in 15! possible ways.
There are 55 possible groups for the remaining 5 cards, of which we can arrange 5! possible
ways. Finally, the total number of permutations of the 20 cards is 20!, and therefore:

10
 10
 5

5 10 15! 5 5!
P (A) = ≈ 0.02.
20!

(b) Let A0 be the event we desire. This problem is almost identical to the first:


10 10
 3
 2
0 7 8 15! 3 2 5!
P (A ) = ≈ 0.35.
20!

Problem 17. Let Bi be the event that I choose bag i (i ∈ {1, 2}) and Nr = 2 be the event that I
choose exactly 2 red marbles out of the 5. Using Bayes’

P (Nr = 2|B1 )P (B1 )


P (B1 |Nr = 2) = . (2.8)
P (Nr = 2|B1 )P (B1 ) + P (Nr = 2|B2 )P (B2 )
The probability of choosing either bag is 1/2 and the probability of Nr = 2 conditioned on choosing
bag 1 is
 
6 10
2
P (Nr = 2|B1 ) = 16
3 ≈ 0.41,
5

while the probability of Nr = 2 conditioned on choosing bag 2 is



6 15

2
P (Nr = 2|B2 ) = 21
3 ≈ 0.34.
5

Sticking all relevant probabilities into Bayes’, I arrive at

0.41 · 0.5
P (B1 |Nr = 2) = ≈ 0.55. (2.9)
0.41 · 0.5 + 0.34 · 0.5

Problem 18. Let E c denote the event that an error has not occurred for the Xith trial. We seek
the probability of all sequences of length n, ending in E c , where the 1st n − 1 entries can be any
sequence, provided they contain exactly k − 1 E c s, which I denote by An−1 . The probability we
desire is P (An−1 , Xn = E c ) = P (An−1 )P (Xn = E c ) by independence. I note that P (An−1 ) is a
binomial distribution, and therefore:
   
c n − 1 k−1 (n−1)−(k−1) n−1 k
P (An−1 , Xn = E ) = p (1 − p) p= p (1 − p)n−k
k−1 k−1

Problem 19. Let yi ≡ xi − 1 for i = 1, . . . , 5, and therefore all yi s can take on values {0, 1, 2, . . .}.
The equation for which we are trying to find the number of distinct integer solutions then becomes:

y1 + y2 + y3 + y4 + y5 = 95, (2.10)
5+95−1
 
99
which has 95 = 95 integer solutions.
33

Problem 20. It is not difficult to explicitly to enumerate the total number of solutions when
x1 = 0, 1, . . . , 10. The total number of integer valued solutions is thus the number of solutions for
when x1 = 0, plus the number of solutions for when x1 = 1, . . . plus the total number of solutions
for when x1 = 10. In each one of these instances, we must find the number of integer solutions for
the equation
x2 + x3 + x4 = 100 − i,

(where x2 , x3 , x4 ∈ {0, 1, 2, . . .}) which has 3+100−i−1
100−i solutions. Therefore, the total number of
integer solutions for this equation, N , with x1 ∈ {0, 1, . . . , 10} is:
10 
X 
3 + 100 − i − 1
N=
100 − i
i=0
X10
1  
= 10302 − 203i + i2
2
i=0
11 · 10302 203 1
= − (0 + 1 + 2 + . . . + 10) + (0 + 1 + 4 + . . . + 100)
2 2 2
= 51271.

Problem 21. Let A1 = {(x1 , x2 , x3 ) : x1 + x2 + x3 = 100, x1 ∈ {41, 42, . . .}, x2 , x3 ∈ {0, 1, 2, . . .}},
and let A2 and A3 be defined analogously. By inclusion-exclusion, the total number of possible
unique integer solutions to this problem is then:

|A1 ∪ A2 ∪ A3 | = |A1 | + |A2 | + |A3 | − |A1 ∩ A2 | − |A1 ∩ A3 | − |A2 ∩ A3 | + |A1 ∩ A2 ∩ A3 |


= 3|A1 | − 3|A1 ∩ A2 | + |A1 ∩ A2 ∩ A3 |,

where the second equality follows from symmetry. The cardinality of |A1 ∩ A2 ∩ A3 | is 0 since it is
impossible to have all xi s > 40 and constrained to add to 100.
The cardinality of |A1 | can be found
 by letting y1 ≡ x1 − 41, so that y1 , x2 , x3 ∈ {0, 1, 2, . . .}:
y1 + x2 + x3 = 59, which has 3+59−159 = 61
59 solutions. The cardinality of |A1 ∩ A2 | can be found
by letting y1 ≡ x1 −
 41 and y2 ≡ x2 − 41 so that y1 , y2 , x3 ∈ {0, 1, 2, . . .}: y1 + y2 + x3 = 18, which
has 3+18−1
18 = 20
18 solutions. Therefore, the total number of solutions to this problem is:
   
61 20
|A1 ∪ A2 ∪ A3 | = 3 −3 = 4920.
59 18
The following bit of python code confirms what we derived theoretically:
In [ 1 ] : i=r a n g e ( 1 0 1 ) ; j=r a n g e ( 1 0 1 ) ; k=r a n g e ( 1 0 1 )

In [ 2 ] : t u p s = [ ( x , y , z ) f o r x i n i f o r y i n j f o r z i n k ]

In [ 3 ] : l e n ( [ x f o r x i n t u p s i f x [ 0 ] + x [ 1 ] + x [2]==100
. . . : and ( x [0] >40 o r x [1] >40 o r x [ 2 ] > 4 0 ) ] )
Out [ 3 ] : 4920
34 CHAPTER 2. COMBINATORICS: COUNTING METHODS
Chapter 3

Discrete Random Variables

35
36 CHAPTER 3. DISCRETE RANDOM VARIABLES

Problem 1.

(a) RX = {0, 1, 2}

(b) P (X ≥ 1.5) = P (X = 2) = 1/6

(c) P (0 < X < 2) = P (X = 1) = 1/3

(d)

P (X = 0 ∩ X < 2)
P (X = 0|X < 2) =
P (X < 2)
P (X = 0)
=
P (X < 2)
1/2
=
1/2 + 1/3
3
=
5

Problem 2. From the set-up of the problem, we have that:




p0 for x = 0




p for x = 1
PX (x) = p for x = 2



p0 for x = 3



0 otherwise,

as well as the following equation P (X = 1 or X = 2) = (1/2)P (X = 0 or X = 3), so that p = p0 /2.


Finally, we know that the PMF must be normalized, leading to the following coupled equations:
(
2p0 + 2p = 1
p = 21 p0 ,

which, when solved for results in p = 1/6 and p0 = 1/3. Thus, the PMF for this problem is:

1

 for x = 0


3

 1
for x = 1
6
PX (x) = 1
for x = 2


6

 1
for x = 3

 3

0 otherwise,

which can easily be verified to be normalized.

Problem 3. The range of both X and Y is {1, 2, . . . , 6}, so that RZ = {−5, 4, . . . , 4, 5}. We may
find the PMF by conditioning and using the law of total probability:
37

P (Z = k) = P (X − Y = k)
6
X
= P (X = k + Y |Y = y)P (Y = y)
y=1
6
X
= P (X = k + y|Y = y)P (Y = y)
y=1
6
X
= P (X = k + y)P (Y = y)
y=1
6
X 1
= P (X = k + y)
6
y=1
6
1X1
= 1{1 ≤ k + y ≤ 6},
6 6
y=1

where the fourth equality follows since X and Y are independent, and where 1{·} is the so called
indicator function which is equal to 1 if its argument evaluates to true, and 0 otherwise. By explicitly
evaluating the sum for all k, I find that P (Z = −5) = 1/36, P (Z = −4) = 2/36, . . . , P (Z = 0) =
6/36, P (Z = 1) = 5/36, . . . , P (Z = 5) = 1/36, which can be conveniently written as:

6 − |k|
P (Z = k) = , (3.1)
36
and which can explicitly be checked to be normalized.

Problem 4.

(a) Since X and Y are independent:

P (X ≤ 2, Y ≤ 2) = P (X ≤ 2)P (Y ≤ 2)
= [PX (1) + PX (2)][PY (1) + PY (2)]
  
1 1 1 1
= + +
4 8 6 6
1
= .
8

(b) By inclusion-exclusion (and also using independence):

P (X > 2 ∪ Y > 2) = P (X > 2) + P (Y > 2) − P (X > 2, Y > 2)


= P (X > 2) + P (Y > 2) − P (X > 2)P (Y > 2)
= [PX (3) + PX (4)] + [PY (3) + PY (4)] − [PX (3) + PX (4)][PY (3) + PY (4)]
      
1 1 1 1 1 1 1 1
= + + + − + +
8 2 3 3 8 2 3 3
7
= .
8

(c) Since X and Y are independent, P (X > 2|Y > 2) = P (X > 2) = 1/8 + 1/2 = 5/8.
38 CHAPTER 3. DISCRETE RANDOM VARIABLES

(d) I use conditioning, the law of total probability and independence to solve for this:

4
X
P (X < Y ) = P (X < Y |Y = y)P (Y = y)
y=1
4
X
= P (X < y|Y = y)P (Y = y)
y=1
4
X
= P (X < y)P (Y = y)
y=1

= P (X < 1)P (Y = 1) + P (X < 2)P (Y = 2) + P (X < 3)P (Y = 3) + P (X < 4)P (Y = 4)


= P (X = 1)P (Y = 2) + [P (X = 1) + P (X = 2)]P (Y = 3) + [P (X = 1)
+ P (X = 2) + P (X = 3)]P (Y = 4)
   
1 1 1 1 1 1 1 1 1
= · + + + + +
4 6 4 8 3 4 8 8 3
1
= .
3
Problem 5. Let Xi denote the number of cars that student i owns (which can be either 0 or 1).
We then seek the probability P (X1 + X2 + . . . + X50 > 30), where the probability that Xi = 1
is 1/2 for all i. In other words, we seek the probability of obtaining at least 31 successes out of
50 Bernoulli trials. We may obtain the probability of 31 successes out of 50 Bernoulli trials by
evaluating a Binomial(50, 0.5) distribution at 31, the probability of 32 successes out of 50 Bernoulli
trials by evaluating a Binomial(50, 0.5) distribution at 32, etc ... . Therefore:

X50    k  50−k
50 1 1
P (X1 + X2 + . . . + X50 > 30) =
k 2 2
k=31
 50 X50  
1 50
=
2 k
k=31
≈ 0.06,
where the summation has been evaluated numerically.
Problem 6. The formula for P (XN = 0) was derived in the book and is given by:
1 1 1
P (XN = 0) = − + . . . (−1)N .
2! 3! N!
I will need this formula in my answer below. Let Ai (i = 1, . . . , N ) be the event that the ith person
receives their hat. Therefore, for XN = 1:

P (XN = 1) = P (A1 , Ac2 , Ac3 , . . . , AcN ) + P (Ac1 , A2 , Ac3 , . . . , AcN ) + . . . + P (Ac1 , Ac2 , Ac3 , . . . , AN )
= N · P (A1 , Ac2 , Ac3 , . . . , AcN )
= N P (A1 )P (Ac2 , Ac3 , . . . , AcN )
1
= N P (XN −1 = 0)
N
= P (XN −1 = 0),
39

where I have used symmetry in the second equality, independence in the third and the fact that
the probability that person 1 gets their hat out of N hats is 1/N .
For XN = 2:
X
P (XN = 2) = P (Ai , Aj )P (XN −2 = 0)
i<j
 
N
= P (A1 , A2 )P (XN −2 = 0)
2
 
N 1 1
= P (XN −2 = 0)
2 N N −1
1
= P (XN −2 = 0),
2!
where I am summing over all N choose 2 unordered pairs of people who get their hats. The
probability that the first person gets their hat is 1/N while the probability that the second person
gets their hat is 1/(N − 1).
Continuing in this fashion one can see the general formula for the PMF we would like to derive:
1
P (XN = k) = P (XN −k = 0) for k = 0, 1, 2, . . . , N,
k!
where
1 1 1
P (XN −k = 0) = − + . . . (−1)N −k .
2! 3! (N − k)!

Problem 7. Computing the probabilities will be simplified by noting that P (X > 5) = 1−P (X ≤
5), and P (X > 5|X < 8) = P (5 < X < 8)/P (X < 8). Note, that I do not explicitly evaluate the
formulas to obtain a numerical answer, but this can easily be done numerically on a computer.

(a) X ∼ Geom(1/5)
(i)
5
X  k−1
1 1
P (X > 5) = 1 − 1−
5 5
k=1

(ii)
6
X  k−1
1 1
P (2 < X ≤ 6) = 1−
5 5
k=3

(iii)

P (5 < X < 8)
P (X > 5|X < 8) =
P (X < 8)
P7 1 
1 k−1
k=6 5 1 − 5
=P 
7 1 1 k−1
k=1 5 1 − 5

(b) X ∼ Bin(10, 1/3)


(i)
5    k 
X 10−k
10 1 1
P (X > 5) = 1 − 1−
k 3 3
k=0
40 CHAPTER 3. DISCRETE RANDOM VARIABLES

(ii)
6    k 
X 10−k
10 1 1
P (2 < X ≤ 6) = 1−
k 3 3
k=3

(iii)
P7 10
 
1 k

1 10−k
k=6 k 3 1− 3
P (X > 5|X < 8) = P7   
10 1 k 1 10−k
k=0 k 3 1− 3

(c) X ∼ Pascal(3, 1/2)


(i)
5 
X   3  k−3
k−1 1 1
P (X > 5) = 1 − 1−
2 2 2
k=3

(ii)
6 
X   3  
k−1 1 1 k−3
P (2 < X ≤ 6) = 1−
2 2 2
k=3

(iii)
P7 k−1
 
1 3

1 k−3
k=6 2 2 1− 2
P (X > 5|X < 8) = P7   
k−1 1 3 1 k−3
k=3 2 2 1− 2

(d) X ∼ Hypergeom(10, 10, 12)


(i)  
5 10 10
X j 12−j
P (X > 5) = 1 − 20

j=2 12

(ii)  
6 10 10
X j 12−j
P (2 < X ≤ 6) = 20

j=3 12

(iii)
P7 (10j )(12−j
10
)
j=6 20
(12)
P (X > 5|X < 8) =
P7 (10j )(12−j
10
)
j=2 20
(12)
(e) X ∼ Pois(5)
(i)
5
X e−5 5k
P (X > 5) = 1 −
k!
k=0

(ii)
6
X e−5 5k
P (2 < X ≤ 6) =
k!
k=3

(iii)
P7 e−5 5k
k=6 k!
P (X > 5|X < 8) = P7 e−5 5k
k=0 k!
41

Problem 8.

(a) In general, for this problem P (X = x) = P (F1 , F2 , . . . Fx−1 , Sx ) = P (F1 )P (F2 ) . . . P (Fx−1 )P (Sx ),
where I have used independence. Therefore P (X = 1) = P (S1 ) = 1/2,

P (X = 2) = P (F1 )P (S2 )
"  2 #
1 1
= 1−
2 2
3
=
8
and

P (X = 3) = P (F1 )P (F2 )P (S3 )


   2 "  3 #
1 1 1
= 1−
2 2 2
7
= .
64

(b) By inspection, one can determine that the general formula for P (X = k) for k = 1, 2, . . . is:
 "
Y  1 j
k−1  k #
P (X = k) =   1− 1 .
2 2
j=0

(c)

P (X > 2) = 1 − P (X ≤ 2)
= 1 − [P (X = 1) + P (X = 2)]
 
1 3
=1− +
2 8
1
=
8

(d)

P (X = 2, X > 1)
P (X = 2|X > 1) =
P (X > 1)
P (X = 2)
=
1 − P (X = 1)
3/8
=
1 − 1/2
3
=
4

Problem 9. To prove this equation, I will work on the RHS and LHS of the equation separately.
Let me first simplify the LHS:
42 CHAPTER 3. DISCRETE RANDOM VARIABLES

P (X > m + l, X > m)
P (X > m + l|X > m) =
P (X > m)
P (X > m + l)
=
P (X > m)
P∞
p(1 − p)k−1
= Pk=m+l+1
∞ j−1
j=m+1 p(1 − p)
p(1 − p)m+l [1 + (1 − p) + (1 − p)2 + . . .]
=
p(1 − p)m [1 + (1 − p) + (1 − p)2 + . . .]
= (1 − p)l .
The RHS can also be simplified:

X
P (X > l) = p(1 − p)k−1
k=l+1

= p(1 − p)l [1 + (1 − p) + (1 − p)2 + . . .]


1
= p(1 − p)l
1 − (1 − p)
= (1 − p)l ,
where in the third equality I summed the geometric series. This result is exactly what I obtained
when I simplified the LHS of the equation, and thus the equation is proved.

Problem 10.
(a) We have dealt with this type of problem extensively in the combinatorics chapter. The
probability we seek is |A|/|S|, where A is the set of all possible ways we can pick exactly 4
red balls out of 10, and S is the set of all possible ways to pick 10 balls out of the 50. Let Xr
be total number of red balls drawn. We thus we:

20

30
4 6
P (Xr = 4) = 
50 ≈ 0.28. (3.2)
10

(b) We seek to find P (Xr = 4|Xr ≥ 3). The inequality will be easier to deal with if I put it in
the first slot of P (·|·), and thus I start by employing Bayes’ rule:

P (Xr ≥ 3|Xr = 4)P (Xr = 4)


P (Xr = 4|Xr ≥ 3) =
P (Xr ≥ 3)
P (Xr = 4)
=
1 − P (Xr = 0) − P (Xr = 1) − P (Xr = 2)
(20 30
4 )( 6 )
(50
10)
=
( 0 )(10) (20
20 30
)(30) (20)(30)
1 − 50 − 1 50 9 − 2 50 8
(10) (10) (10)
 
20 30
4
= 50
 
30 20
 630 
20

30
10 − 10 − 1 9 − 2 8
≈ 0.33,
43

where in the second equality I have used the fact that P (Xr ≥ 3|Xr = 4) = 1 and in the third
equality I used what was derived in the previous part of this problem.
Problem 11.
(a) The average number of emails received on the weekend is 2 per hour or 8 per 4 hours. Since
we are modeling this process with a Poisson distribution, the probability that you receive 0
emails on the weekend per 4 hour interval is:

e−8 · 80
P (k = 0) = ≈ 3.4 × 10−4 . (3.3)
0!
(b) The average number of emails received on the weekend is 2 per hour and 10 per hour on
any weekday. Let Awd be the event that a weekday was chosen and Awe be the event that a
weekend was chosen. This problem can be solved using Baye’s rule
P (k = 0|Awd )P (Awd )
P (Awd |k = 0) =
P (k = 0|Awd )P (Awd ) + P (k = 0|Awe )P (Awe )
e−10 · 75
= −10 5
e · 7 + e−2 · 27
≈ 8.4 × 10−4 .

Problem 12. The CDF can easily be computed from the PMF:
 

0 for x < −2 
0 for x < −2

 


0.2 for −2 ≤ x < −1 
0.2 for −2 ≤ x < −1

 


0.2 + 0.3 
0.5 for −1 ≤ x < 0
for −1 ≤ x < 0
FX (x) = =

0.2 + 0.3 + 0.2 for 0≤x<1 
0.7 for 0 ≤ x < 1

 


0.2 + 0.3 + 0.2 + 0.2 for 1≤x<2 
0.9 for 1 ≤ x < 2

 


0.2 + 0.3 + 0.2 + 0.2 + 0.1 
1
for x≥2 for x ≥ 2.
See Fig. 3.1 for a plot of this function.
Problem 13. Whenever there is a jump in the CDF at a value of x, this indicates that that value
of x is in the range of X. Therefore, RX = {0, 1, 2, 3}. The probability at x can be found by
subtracting out the probabilities at values < x from FX (x). Therefore, the following equations give
the probabilities we need:


 P (0) = FX (0)


P (1) = F (1) − P (0)
X

 P (2) = FX (2) − P (1) − P (0)


P (3) = F (3) − P (2) − P (1) − P (0),
X

and when plugging in the values for FX (x), this leads to:



1
for x = 0

 6
 1 for x = 1
PX (x) = 31

 for x = 2

 4
 1 for x = 3.
4

As a sanity check, these probabilities do indeed add up to 1.


44 CHAPTER 3. DISCRETE RANDOM VARIABLES

1.0

0.8

0.6

FX (x)
0.4

0.2

0.0
−4 −3 −2 −1 0 1 2 3 4
x

Figure 3.1: The associated CDF for the PMF of problem 12.

Problem 14.

(a)
E[X] = 1 · 0.5 + 2 · 0.3 + 3 · 0.2 = 1.7

(b)
E[X 2 ] = 1 · 0.5 + 4 · 0.3 + 9 · 0.2 = 3.5

=⇒
V ar[X] = E[X 2 ] − E[X]2 = 3.5 − 1.72 = 0.61

=⇒
p
SD[X] = V ar[X] ≈ 0.78

(c) Using LOTUS:

X 2
E[Y ] = PX (x)
x
x∈RX
2 2 2
= · 0.5 + · 0.3 + · 0.2 ≈ 1.43.
1 2 3

Problem 15. The range of X is {1, 2, 3, . . .}. For x ≥ 5, these values get mapped to 0, 1, 2, . . ..
The values x = 1, 2, 3, 4 get mapped to 4, 3, 2, 1, and thus RY = {0, 1, 2, . . .}. To solve for the the
corresponding PMF, note that P (X = k) = (1/3)(2/3)k−1 , and that PY (y = k) = P (Y = k) =
P (|X − 5| = k). We therefore have:
45

 4
1 2
PY (y = 0) = P (X = 5) =
3 3
 3  5
1 2 1 2
PY (y = 1) = P (X = 4 or X = 6) = +
3 3 3 3
   6
1 2 2 1 2
PY (y = 2) = P (X = 3 or X = 7) = +
3 3 3 3
   
1 2 1 1 2 7
PY (y = 3) = P (X = 2 or X = 8) = +
3 3 3 3
   
1 2 0 1 2 8
PY (y = 4) = P (X = 1 or X = 9) = +
3 3 3 3
 
1 2 9
PY (y = 5) = P (X = 10) =
3 3
 
1 2 10
PY (y = 6) = P (X = 11) =
3 3
..
.

One can easily check


P that this distribution is normalized by summing all terms on the RHS of the
equations: (1/3) ∞ k=0 (2/3)k = 1, where I have summed the geometric series.

Problem 16. I first note that the range of Y is {0, 1, 2, 3, 4, 5}, so that its PMF is

11
PY (y = 0) = P (X = −10 or X = −9 or . . . or X = 0) =
21
1
PY (y = 1) = P (X = 1) =
21
1
PY (y = 2) = P (X = 2) =
21
1
PY (y = 3) = P (X = 3) =
21
1
PY (y = 4) = P (X = 4) =
21
6
PY (y = 5) = P (X = 5 or X = 6 or . . . or X = 10) = ,
21
which indeed sums to 1.

Problem 17. Since E[X] was found to be 1/p for the geometric distribution from Example 3.12 in
the book, if we can solve for E[X 2 ] then we can compute the variance with V ar[X] = E[X 2 ]−E[X]2 .
To do this, we will need a few formulas involving the geometric series. I claim that:

X 1
xk = |x| < 1,
1−x
k=0


X 1
kxk−1 = |x| < 1,
(1 − x)2
k=0
46 CHAPTER 3. DISCRETE RANDOM VARIABLES

and

X 1+x
k 2 xk−1 = |x| < 1.
(1 − x)3
k=0
The first formula is simply the sum of a geometric series, the second was already proved in the
book in Example 3.12. I now prove the third formula.

Proof. We can take derivatives of the LHS and RHS of the second equation above to prove the
third. Differentiating the LHS results in:
∞ ∞
d X k−1 X
kx = k(k − 1)xk−2
dx
k=0 k=0

X ∞
X
2 k−2
= k x − kxk−2
k=1 k=1

X ∞
X
2 j−1
= (j + 1) x − (j + 1)xj−1
j=0 k=0
X∞ ∞
X ∞
X ∞
X ∞
X
= j 2 xj−1 + 2 jxj−1 + xj−1 − jxj−1 − xj−1
j=0 j=0 j=0 j=0 j=0
X∞ ∞
X
= j 2 xj−1 + jxj−1
j=0 j=0

X 1
= j 2 xj−1 + ,
(1 − x)2
j=0

where I have made the substitution j = k − 1. Differentiating the RHS results in:
d 1 2
= ,
dx (1 − x)2 (1 − x)3
and putting the two together completes the proof. 

I may now solve for E[X 2 ]



X
E[X 2 ] = x2 p(1 − p)k−1
k=1
X∞
=p x2 (1 − p)k−1
k=0
p[1 + (1 − p)]
=
[1 − (1 − p)]3
2−p
= ,
p2
so that the variance is:
2−p 1 1−p
V ar[X] = 2
− 2 = .
p p p2

Problem 18. In Problem 5 from 3.1.6 of the book, we showed that if X1 , X2 , . . . , Xm ∼ Geom(p) =
P ascal(1, p) (iid), then X = X1 + X2 + . . . + Xm ∼ P ascal(m, p). Therefore V ar[X] = V ar[X1 ] +
V ar[X2 ]+. . .+V ar[Xm ] = m(1−p)/(p2 ), by linearity in variance of independent random variables.
47

Problem 19. I use LOTUS repeatedly in this problem and linearity of expectation.
 
Y 3
E[X] = E − +
2 2
1 3
= − E[Y ] +
2 2
=1

V ar[X] = E[X 2 ] − E[X]2


 2 
Y 3Y 9
=E − + −1
4 2 4
1 3 9
= E[Y 2 ] − E[Y ] + − 1
4 2 4
9 3 9
= − + −1
4 2 4
=2

Problem 20.

(a) The range of X is {1, 2, 3, 4, 5, 6} and the probability for any of these values, x, is simply
Nx /1000, where Nx is the number of households with x people. Therefore:


0.1 for x=1



0.2 for x=2



0.3 for x=3
PX (x) =

0.2 for x=4



0.1 for x=5



0.1 for x = 6.

The expected value of X is: E[X] = 1 · 0.1 + 2 · 0.2 + 3 · 0.3 + 4 · 0.2 + 5 · 0.1 + 6 · 0.1 = 3.3.

(b) The probability of picking a person from a household with k people is equal to the total
number of people in households with k people divided by the total number of people in the
town. In other words, P (Y = k) = (k · Nk )/3300, so that:



1
for y =1

 33

 4
for y =2

 33

9
33 for y =3
PY (y) =


8
for y =4

 33

 5
for y =5

 33

6
33 for y = 6,

and E[Y ] = 1 · (1/33) + 2 · (4/33) + 3 · (9/33) + 4 · (8/33) + 5 · (5/33) + 6 · (6/33) = 43/11.

Problem 21.
48 CHAPTER 3. DISCRETE RANDOM VARIABLES

200

150

E[X]
100

50

0
0 10 20 30 40 50
N

Figure 3.2: The expected number of tries to observe all unique coupons at least once.

(a) It takes 1 try to observe the first unique coupon. Let this first coupon be called type C1 . Let
the random variable, X1 , be the number of times it takes to observe a coupon different than
type C1 . Call this type C2 . Let the random variable, X2 , be the number of times it takes to
observe a coupon different than type C1 and C2 . Call this type C3 . Let us proceed in this
fashion until we observe N − 1 unique coupons. Finally, let the random variable XN −1 be
the number of times it takes to observe a coupon different than type C1 , C2 , . . . , CN −1 , and
call this coupon type CN . Therefore, the total number of times it takes to observe all unique
coupons at least once is X = 1 + X1 + X2 + . . . + XN −1 .
For each Xi , if we consider choosing C1 , C2 , . . . or, Ci−1 as a failure and Ci as a success, we
see that this is nothing more than a geometric random variable with probability (N − i)/N
of success (since there are N − i un-observed coupons left). Therefore, X1 , X2 , . . . , XN −1 ∼
Geom( NN−i ). Further let X0 ∼ Geom( NN−0 ), and note that the probability of observing X0 = 1
for this distribution is unity since we are sure to have a success on the first trial. Thus, if we
desire, we can replace 1 in X = 1 + X1 + X2 + . . . + XN −1 with the “random variable” X0 .

(b) The expected number of tries it takes to observe all unique coupons at least once is:

E[X] = E[1 + X1 + X2 + . . . + XN −1 ]
= 1 + E[X1 ] + E[X2 ] + . . . + E[XN −1 ]
N N N
=1+ + + ... +
N −1 N −2 N − (N − 1)
N
X −1
1
=N .
N −i
i=0

The summation can be written in terms of a special function (called the digamma function),
but I believe it is more illustrative to plot the actual function itself. In Fig. 3.2, I show E[X]
for N = 1 to 50 which I calculated numerically with the summation formula I derived above.

Problem 22.

(a) Let X 0 be the number of tosses until the game ends. We recognize that X 0 is distributed
as a geometric random variable with p = q = 1/2 since the coin is fair. The range of X 0
is RX 0 = {1, 2, 3, 4 . . .}. Let the random variable X denote the amount of money won from
49

the game which has range RX = {1, 2, 4, 8 . . .}. The function f : RX 0 → RX is given by the
0
bijective mapping 2X −1 . Thus, the PMF of X is given by P (X = x) = P (X 0 = x0 ), where x0
is the pre-image of x under f . That is, the PMF of X is given by: P (X = 1) = P (X 0 = 1) = p,
P (X = 2) = P (X 0 = 2) = p2 , P (X = 4) = P (X 0 = 3) = p3 , P (X = 8) = P (X 0 = 4) = p4 , . . ..
Thus the expected value of X is given by the following summation, which we see diverges:
X ∞
X
E[X] = xP (X = x) = 2k−1 P (X 0 = k)
x∈RX k=1
X∞    k−1
k−1 1 1
= 2
2 2
k=1
= ∞.

Thus, only considering your expected winnings (and ignoring issues like the variance of your
winnings and your particular risk tolerance) you would be willing to pay any amount of money
to play this game.
(b) By noting that . . . , 26−1 , 27−1 , 28−1 . . . = . . . , 32, 64, 128, . . ., one sees that when X 0 = 8,
X = 27 = 128 which is the first time that X takes on a value greater than 65. Therefore, the
probability we desire is:

X
P (X > 65) = P (X 0 = k)
k=8
7    k−1
X 1 1
=1−
2 2
k=1
"   2  7 #
1 1 1 1
=1− + + ... +
2 2 2
1
= .
128

(c) This problem is very similar to part a, except that the summation is truncated when x takes
on the value 230 , which occurs when k = 31, thereafter, the payout remains 230 . Therefore,
the expected value of Y is:

31
X ∞
X
0
E[Y ] = 2 k−1
P (X = k) + 230 P (X 0 = k)
k=1 k=32
X31  k∞
X
1 1
= + 230
2 2
k=1 k=32
X∞  k 0
31 30 −32 1
= +2 2
2 0
2
k =0
31 1
= + 230 2−32
2 1 − 1/2
= 16.

We therefore see that in part a, the majority of the summation that contributes to the
expectation value of X occurs much later in the series. This is called a “paradox” since, in
50 CHAPTER 3. DISCRETE RANDOM VARIABLES

the first part the expectation value was infinite, but in the second part, even though 230 is a
very large number, the expected winnings is much lower than what one would have guessed.

Problem 23. The goal is to find:

α∗ = arg min f (α),


α∈R

where

f (α) = E[(X − α)2 ]


= E[X 2 − 2αX + α2 ]
= E[X 2 ] − 2αµ + α2 .

Therefore:
α∗ = arg min{E[X 2 ] − 2αµ + α2 } = arg min{−2αµ + α2 },
α∈R α∈R

which we can be found by setting the derivative of this equation,

d
(−2αµ + α2 ) = −2µ + 2α,

equal to zero, and solving for α∗ . This results in α∗ = µ.

Problem 24. If you choose to roll the die for a second time, your expected winnings is E[Y ] = 3.5.
Therefore, if you roll less than 3.5 on the first roll (i.e., 1, 2 or 3) you should roll again because you
expect to do better on the second roll. However, if you roll a 4, 5 or 6, you will expect to do worse
on the second roll, so you should not roll again.
Given this strategy, your expected winnings is:

E[W ] = E[X 1{X > 3}] + E[Y 1{X ≤ 3}]


= E[X 1{X > 3}] + E[Y ]E[1{X ≤ 3}]
6   X 6
1X 7 1
= x1{x > 3} + 1{X ≥ 3}
6 2 6
x=1 x=1
  
1 7 1
= (4 + 5 + 6) + (1 + 1 + 1)
6 2 6
17
=
4
= 4.25,

where, in the second equality, E[Y 1{X ≤ 3}] = E[Y ]E[1{X ≤ 3}] since X and Y are independent
(given the set strategy).

Problem 25.

(a) In Fig. 3.3 I have plotted both P (X ≥ x) and P (X ≤ x) for this PMF. It is clear from this
figure that in the range [2, ∞), P (X ≤ x) ≥ 1/2, and that in the range (−∞, 2] P (X ≥ x) ≥
1/2. The only value that these ranges share in common is 2, and this is therefore the median
for this PMF.
51

P (X ≤ x), P (X ≥ x) for the given PMF


1.0

P (X ≤ x), P (X ≥ x)
0.8

0.6

0.4

0.2

0.0
−1 0 1 2 3 4 5
x

Figure 3.3: P (X ≥ x) (red) and P (X ≤ x) (black) for Problem 25a.

P (X ≤ x), P (X ≥ x) for a die roll


1.0
P (X ≤ x), P (X ≥ x)

0.8

0.6

0.4

0.2

0.0
−1 0 1 2 3 4 5 6 7 8
x

Figure 3.4: P (X ≥ x) (red) and P (X ≤ x) (black) for Problem 25b.

(b) In Fig. 3.4 I have plotted both P (X ≥ x) and P (X ≤ x) for a die roll. It is clear from this
figure that in the range [3, ∞), P (X ≤ x) ≥ 1/2, and that in the range (−∞, 4] P (X ≥ x) ≥
1/2. The (not unique) medians for this distribution are the intersection of these 2 sets, which
is the interval [3, 4].

(c) We can compute P (X ≤ x) with the geometric distribution explicitly with:

kx u
X
P (X ≤ x) = p q k−1 ,
k=1

where q = 1 − p and kxu is the appropriate upper integer bound which depends on the (not
necessarily integer) value of x. By considering the staircase shape of P (X ≤ x), one can
realize that for any x, P (X ≤ x) = P (X ≤ bxc), which holds up until dxe (where b·c and
d·e are defined as rounding down and up to the nearest integer respectively). Therefore, if
we want to find the lowest value of x, x0 for P (X ≤ x0 ) still equals P (X ≤ x), this occurs at
the integer value x0 = bxc. bxc is therefore the appropriate value to use for kxu , and we can
52 CHAPTER 3. DISCRETE RANDOM VARIABLES

explicitly compute a formula for P (X ≤ x):

P (X ≤ x) = P (X ≤ bxc)
bxc
X
−1
= pq qk
k=1
 
bxc
X
= pq −1 −q 0 + qk 
k=0
 

X ∞
X
= pq −1 −1 + qk − qk 
k=0 k=bmc+1
" ∞ ∞
#
X X
k0
= pq −1 −1 + q k − q bxc+1 q
k=0 k0 =0
" #
−1 1 q bxc+1
= pq −1 + −
1−q 1−q
= 1 − q bxc .

Any value m, for which P (X ≤ m) ≥ 1/2 is a potential candidate for the median (but of
course we still have to consider the values of x for which P (X ≥ x) ≥ 1/2) and the lowest
value for which this occurs, call it bmL c, can now be found by setting P (X ≤ bmL c) = 1/2,
resulting in:
1
bmL c = .
log2 1/q
Similarly:

P (X ≥ x) = P (X ≥ dxe)

X
−1
= pq qk
k=dxe

X
dxe−1 0
= pq qk
k0 =0
pq dxe−1
=
1−q
= q dxe−1 .

Thus, the highest value for which P (X ≥ x) ≥ 1/2, call it dmU e is found when this equation
equals 1/2, resulting in:
1
dmU e = + 1.
log2 1/q
Therefore, for the geometric distribution, P (X ≤ m) ≥ 1/2 and P (X ≥ m) ≥ 1/2 for
x ∈ [bmL c, dmU e]. This interval thus gives the (not unique) medians for the geometric
distribution.
Chapter 4

Continuous and Mixed Random


Variables

53
54 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

Problem 1.
(a) We recognize that this is a uniform random variable, so its CDF is:


0 for x < 2
FX (x) = x−2
for 2 ≤ x ≤ 6


4
1 for x > 6.

(b) For a uniform random variable, the expectation value is at the midpoint: E[X] = 2 + [(6 −
2)]/2 = 4.

Problem 2.
R∞
(a) By normalization, 1 = c 0 exp (−4x)dx, which leads to c = 4.
(b) ( (
0 for x ≤ 0 0 for x ≤ 0
FX (x) = Rx 0 0 =
4 0 e−4x dx for x > 0 1 − e−4x for x > 0.

(c)
Z 5
P (2 < X < 5) = 4 e−4x dx = e−8 − e−20 .
2

(d) To find the expectation value, I use integration by parts:

Z ∞
E[X] = 4 xe−4x dx
0
Z ∞
∞
=− xe−4x 0 + e−4x dx
0
1
= ,
4
where the limits in the first term were evaluated using L’hopital’s rule.

Problem 3.
(a) Using LOTUS:
   Z 1  
n 2 2 n 2 2
E X X + = x x + dx
3 0 3
1 1  2  1 1

= x3+n + xn+1
3+n 0 3 n+1 0
 
1 2 1
= + for n = 1, 2, 3, . . .
3+n 3 n+1

(b) We have already found E[X] and E[X 2 ] in the first part and thus:

V ar[X] = E[X 2 ] − E[X]2


     
1 2 1 1 2 1 2
= + − +
3+2 3 2+1 3+1 3 1+1
59
= .
720
55

Problem 4.
(a) For this problem, we have RX = [0, 1] and RY = [1/e, 1]. Thus in range of x = 0 to 1, the
CDF of Y is given by:

FY (y) = P (Y ≤ y)
= P (e−X ≤ y)
= P (X ≥ − ln y)
= 1 − FX (− ln y)
= 1 + ln y,

where I used the fact that for x ∈ [0, 1] for a uniform 0, 1 distribution, FX (x) = x. Therefore:


0 for y < 1e
FY (y) = 1 + ln(y) for 1e ≤ y ≤ 1


1 for y > 1.

(b) 

0 for y < 1e
dFY
fY (y) = = y1 for 1e ≤ x ≤ 1
dy 

0 for x > 1

(c)
Z 1
1 1
E[Y ] = y dy = 1 −
1/e y e

Problem 5.

(a) The range of X and Y are RX = (0, 2] and RY = (0, 4], so that for y ∈ RY , we have

FY (y) = P (Y ≤ y)
= P (X 2 ≤ y)

= P (0 < X < y)
Z √y
5 4
= x dx
0 32
1
= y 5/2 ,
32
and therefore: 

0 for y ≤ 0
FY (y) = 1 5/2
y for 0 < y ≤ 4
 32

1 for y > 4.

(b) 

0 for y ≤ 0
dFY 5 3/2
fY (y) = = 64 y for 0 < y ≤ 4
dy 

0 for y > 4
56 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

(c)
Z 4
5
E[Y ] = y · y 3/2 ≈ 2.9
64 0

Problem 6. We can convert the PDF for X to the PDF for Y using the method of transformations:
(
 y  d(y/α) λ
λ −α y
fY (y) = fX = αe for y > 0
,
α dy 0 otherwise

and since both α, λ > 0, Y ∼ Exp(λ/α).

Problem 7.

(a) We can prove this relation using integration by parts:

Z ∞
n
E[X ] = xn λe−λx dx
0
∞ n Z ∞

= −xn e−λx + xn−1 λeλx dx
0 λ 0
n
= E[X n−1 ],
λ

where the first term evaluated to zero by repeated application of L’Hopital’s rule.

(b) We can use several properties of the Gamma function to prove this relation:

Z ∞
E[X n ] = xn λe−λx dx
0
λ
= Γ(n + 1)
λn+1
n!
= n,
λ

where in the second equality I used the second property of the Gamma function given in the
book, and in the third equality I used the fourth property of the Gamma function given in
the book.

Problem 8.

(a)
 
0−3
P (X > 0) = 1 − Φ ≈ 0.84
3

(b)
   
8−3 −3 − 3
P (−3 < X < 8) = Φ −Φ ≈ 0.93
3 3
57

(c)

P (X > 5, X > 3)
P (X > 5|X > 3) =
P (X > 3)
P (X > 5)
=
P (X > 3)

1 − Φ 5−3
3 
=
1 − Φ 3−3
3
≈ 0.50

Problem 9. By Theorem 4.3 in the book, if X ∼ N (3, 9), and Y = 5−X, then Y ∼ N (−3+5, 9) =
N (2, 9).

(a)  
2−3
P (X > 2) = 1 − Φ ≈ 0.63
3

(b)    
3−2 −1 − 2
P (−1 < Y < 3) = Φ −Φ ≈ 0.47
3 3

(c)

P (X > 4|Y < 2) = P (X > 4|5 − X < 2)


P (X > 4, X > 3)
=
P (X > 3)
P (X > 4)
=
P (X > 3)

1 − Φ 4−3
3 
=
1 − Φ 3−3
3
≈ 0.74

Problem 10. I first note that RX = R, and RY = [0, ∞). The range of X can be partitioned into
2 regions, X ≤ 0 and X > 0 which are strictly decreasing, and increasing respectively, where the
corresponding inverse transformation back to X for both of these regions is:
(
−Y 2 for X ≤ 0
X=
Y2 for X > 0.

Therefore:

d(y 2 ) 2
2 d(−y )
2
fY (y) = fX (y ) + fX (−y )
dy dy
4 y 4
= √ ye− 2 for y ≥ 0,

which, as a sanity check, I made sure analytically integrates to unity over the range 0 to infinity.

Problem 11.
58 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

(a)
Z ∞
P (X > 2) = 2 e−2x = e−4
2

(b) I calculate E[Y ] using LOTUS:

Z ∞
E[Y ] = (2 + 3x)2e−2x dx
0
Z ∞ Z ∞
−2x
=2 2e dx + 3 2xe−2x dx
0 0
= 2 + 3E[X]
3
=2+
2
7
= ,
2
where I have used the fact that E[X] for Exp(λ) is 1/λ (as computed in the book). To
compute V ar[Y ], I first must compute E[Y 2 ], which I do using LOTUS:

Z ∞
E[Y 2 ] = (2 + 3x)2 2e−2x dx
0
Z ∞ Z ∞ Z ∞
−2x −2x
=4 2e dx + 12 2xe dx + 9 2x2 e−2x dx
0 0 0
= 4 + 12E[X] + 9E[X 2 ]
12 9 · 2
=4+ +
2 4
29
= ,
2

where I have used the fact that E[X 2 ] for Exp(λ) is 2/(λ2 ) (as computed in the book).
Finally, the variance is:

9
V ar[Y ] = E[Y 2 ] − E[Y ]2 = .
4

(c)

P (X > 2|Y < 11) = P (X > 2|2 + 3X < 11)


= P (X > 2|X < 3)
P (2 < X < 3)
=
P (X < 3)
R 3 −2x
2 e dx
= R23
2 0 e−2x dx
e2 − 1
=
e6 − 1
≈ 1.6 × 10−2
59

Problem 12. The equations defining the median for a continuous variable, P (X < m) = 1/2 and
P (X ≥ m) = 1/2, are actually equivalent. That is, P (X < m) = 1/2 ⇔ P (X ≥ m) = 1/2 (which
can easily be verified), so we can use whichever is convenient. Since we know the CDFs for the
desired distributions, so using the condition that P (X < m) = 1/2 will be most convenient.

(a) The CDF for the U nif (a, b) is:




0 for x < a
P (X < x) = x−a
for a ≤ x ≤ b
 b−a

1 for x > b,
where I have ignored the equality in the argument of P , since P (X = x) = 0 for a continuous
random variable. Setting this equation equal to 1/2 and solving for m, I find:
b+a
m= ,
2
which is the mean of the uniform distribution, which was expected since the uniform distri-
bution is symmetric about its mean.
(b) The CDF for the Exp(λ) is:
(
0 for y < 0
P (Y < y) =
1 − e−λy for y ≥ 0,
where again, I have ignored the equality in the argument of P , since P (Y = y) = 0. Thus we
see that 1/2 = 1 − exp (−λm) =⇒ m = ln 2/λ.
(c) For the N (µ, σ 2 ),

P (W < m) = P (W ≤ m)
 
m−µ

σ
1
= .
2
Since the standard normal is symmetric about 0, this implies that Φ(0) = 1/2, and therefore
(m − µ)/σ = 0, and thus m = µ. Since we knew that a Gaussian is symmetric about its
mean, this is what we expected.

Problem 13.
(a) See Fig. 4.1 for a plot of the CDF. X is a mixed random variable because there is a jump in
the CDF at x = 1/4 (indicating a probability “point mass” at 1/4 of P (X = 1/4) = 1/2) and
the CDF does not exhibit the staircase shape associated with only discrete random variables.
(b)
   
1 1
P X≤ = FX
3 3
1 1
= +
3 2
5
=
6
60 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

1.0

0.8

0.6

FX (x)
0.4

0.2

0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x

Figure 4.1: CDF for Problem 13

(c)
   
1 1
P X≥ =1−P X <
4 4
    
1 1
=1− P X ≤ −P X =
4 4
    
1 1
= 1 − FX −P X =
4 4
 
1 1 1
=1− + −
4 2 2
3
=
4

(d) The CDFs for both the discrete and continuous contributions can be written piece-wise as:
(
0 for x < 14
D(x) = 1
2 for x ≥ 14 ,

and 

0 for x < 0
C(x) = x for 21 ≤ x ≤ 1

1
2

2 for x > 12 .
These functions can be re-written using the unit step function:
 
1 1
D(x) = u x − ,
2 4

and    
1 1 1
C(x) = xu(x) − xu x − + u x− ,
2 2 2
where, in C(x), I have started subtracting off the linear equation at x = 1/2, and adding a
constant 1/2 at x = 1/2 so as to keep the function flat at 1/2 after x = 1/2.
61

(e) Since C(x) increases linearly from 0 to 1/2, and has total probability mass 1/2, we expect
c(x) to be uniform with height 1 over the range 0 to 1/2. Differentiating C(x)

d
c(x) = C(x)
dx      
du(x) 1 d 1 1 d 1
= u(x) + x −u x− −x u x− + u x−
dx 2 dx 2 2 dx 2
     
1 1 1 1
= u(x) + xδ(x) − u x − − xδ x − + δ x−
2 2 2 2
 
1
= u(x) − u x − ,
2
which is exactly what we had anticipated. Here I used the fact that when x 6= 1/2, xδ(x−1/2)
and (−1/2)δ(x − 1/2) equal 0, while for for x = 1/2, (−1/2)δ(x − 1/2) = (−1/2)δ(0) and
xδ(x − 1/2) = (1/2)δ(0). Also, in either case, xδ(x) = 0.
(f)
Z ∞ X
E[X] = xc(x)dx + xk ak
−∞ k
Z 1/2  
1 1
= xdx + P X =
0 4 4
1 1/2 1 1
= x2 dx + ·
2 0 4 2
1
=
4
Problem 14.

(a) The generalized PDF of X is:


d
fX (x) = FX (x)
dx
d d
= D(x) + C(x)
dx  dx

1 1
= δ x− + c(x)
2 4
    
1 1 1
= δ x− + u(x) − u x − ,
2 4 2
where c(x) was found in the previous problem.
(b)
Z ∞   Z ∞ Z ∞  
1 1 1
E[X] = xδ x − dx + xu(x)dx − xu x − dx
2 −∞ 4 −∞ −∞ 2
Z   Z 1/2
1 ∞ 1
= xδ x − dx + xdx
2 −∞ 4 0
1 1 1
= · +
2 4 8
1
=
4
62 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

(c)

Z ∞ 
 Z 1/2
2 1 2 1
E[X ] = x δ x− dx + x2 dx
2 −∞ 4 0
 
1 1 2 1 3 1/2
= + x
2 4 3 0
7
=
96

=⇒

V ar[X] = E[X 2 ] − E[X]2


 2
7 1
= −
96 4
1
=
96

Problem 15.

(a) From the form of the given generalized PDF, it is clear that there are 2 probability point
masses at x = 1 and x = −2 (with P (X = 1) = 1/6 and P (X = −2) = 1/3), as well as a
continuous random variable contribution from a Gaussian PDF. Since the continuous PDF
contributes 0 probability at specific points, P (X = 1) = 1/6 and P (X = −2) = 1/3.

(b)

Z ∞ 
1 1 1 1 − x2
P (X ≥ 1) = δ(x + 2) + δ(x − 1) + √ e 2 dx
1 3 6 2 2π
Z ∞
1 1 1 x2
= + √ e− 2 dx
6 2 1 2π
1 1
= + [1 − Φ(1)]
6 2
≈ 0.25

(c)

P (X = 1, X ≥ 1)
P (X = 1|X ≥ 1) =
P (X ≥ 1)
P (X = 1)
=
P (X ≥ 1)
1
6
= 1
6 + 12 [1 − Φ(1)]
≈ 0.68
63

(d) We can calculate E[X] by explicitly integrating over the generalized PDF:
Z ∞  
1 1 1 1 x2
E[X] = x δ(x + 2) + δ(x − 1) + √ e− 2 dx
−∞ 3 6 2 2π
Z ∞
1 1 1 1 x2
= (−2) + (1) + x √ e− 2 dx
3 6 2 −∞ 2π
−2 1 1
= + + ·0
3 6 2
1
=− ,
2
where the integral in the second line is equal to zero, since this is just the mean of a standard
normal distribution.
We can also calculate E[X 2 ] by explicitly integrating over the generalized PDF:
Z ∞  
2 2 1 1 1 1 − x2
E[X ] = x δ(x + 2) + δ(x − 1) + √ e 2 dx
−∞ 3 6 2 2π
Z ∞
1 1 1 1 x 2
= (4) + (1) + x2 √ e− 2 dx
3 6 2 −∞ 2π
4 1 1
= + + (1)
3 6 2
= 2,

where the integral in the second line is equal to 1, since this is just the variance of a standard
normal distribution.
Thus:

V ar[X] = E[X 2 ] − E[X]2


 2
1
=2− −
2
7
= .
4

Problem 16.

(a) Let D denote the event that the device is defective, and let P (D) = pd = 0.02. By the law of
total probability, we have:

FX (x) = P (X ≤ x)
= P (X ≤ x|D)P (D) + P (X ≤ x|Dc )P (Dc )
= u(x)pd + (1 − e−λx )(1 − pd )u(x).

By differentiating, we can find the generalized PDF:


d
fX (x) = FX (x)
dx h i
= pd δ(x) + (1 − pd ) (1 − e−λx )δ(x) + u(x)λe−λx
= pd δ(x) + (1 − pd )u(x)λe−λx ,
64 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

where I have used the fact that at x 6= 0, δ(x) = 0, and at x = 0, (1 − exp (−λx)) = 0, so that
the term, (1−e−λx )δ(x) is 0 for all x. We also could have written this PDF down immediately
by realizing that there is a probability point mass at x = 0 with total probability 0.02, and
there is a continuous probability contribution from the exponential distribution which must
integrate to 1 − 0.02 = 0.98.

(b)

Z ∞h i
P (X ≥ 1) = pd δ(x) + (1 − pd )u(x)λe−λx dx
1
Z ∞
= (1 − pd ) λe−λx dx
1
−λ
= (1 − pd )e
= (0.98)e−2
≈ 0.133

(c)

P (X > 2, X ≥ 1)
P (X > 2|X ≥ 1) =
P (X ≥ 1)
P (X > 2)
=
P (X ≥ 1)
R ∞ −λx
e dx
= R2∞ −λx
1 e dx
= e−λ
= e−2
≈ 0.135

(d) The expectation value of X is:

Z ∞ h i
E[X] = x pd δ(x) + (1 − pd )u(x)λe−λx dx
−∞
Z ∞
= (1 − pd ) xλe−λx dx
0
1
= (1 − pd )
λ
1
= (0.98)
2
= 0.49,

where I have used the fact that E[X] = 1/λ for an exponential distribution.
65

The expectation value of X 2 is:


Z ∞ h i
2
E[X ] = x2 pd δ(x) + (1 − pd )u(x)λe−λx dx
0
Z ∞
= (1 − pd ) x2 λe−λx dx
0
2
= (1 − pd ) 2
λ
1
= (0.98)
2
= 0.49,

where I have used the fact that E[X 2 ] = 2/λ2 for an exponential distribution. Therefore, the
variance is:

V ar[X] = 0.49 − 0.492


≈ 0.25.

Problem 17.

(a) We realize that for Lap(0, 1), fX (x) is an even function, while x is an odd function and
therefore E[X] = 0. Also,
Z ∞
1
2
E[X ] = 2 x2 e−x dx = 2,
0 2

where I have used the fact that since we are integrating an even function times an even
function we need only integrate from 0 to ∞ and multiply by 2. I have also used the fact
that the integrand is E[X 2 ] of an Exp(1) distribution, and we know this integral evaluates
to 2/λ2 . Therefore V ar[X] = 2.

(b) We can solve for fY (y) using the method of transformations:

   
y−µ d y−µ
fY (y) = fX
b dy b
( 
1
exp y−µ 1
for y−µ
b <0
= 12  b y−µb 
1 y−µ
2 exp − b b for b ≥ 0
( 
1
exp y−µ for y < µ
= 2b1
 b y−µ 
2b exp − b for y ≥ µ,

which is Lap(µ, b).

(c) Since

E[Y ] = E[bX + µ]
= bE[X] + µ
= µ,
66 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

and

E[Y 2 ] = E[(bX + µ)2 ]


= b2 E[X 2 ] + 2bµE[X] + µ2
= 2b2 + µ2 ,

the variance is V ar[Y ] = E[Y 2 ] − E[Y ]2 = 2b2 .

Problem 18. We see firstly, that RX = R and RY = [0, ∞). Also note that Y = |X| = −X
for X < 0 and X for X ≥ 0. I use the method of transformations, breaking fX (x) into 2 strictly
(1) (2)
monotonic regions. Let fX (x) = (1/2b) exp (x/b) and fX (x) = (1/2b) exp (−x/b), then:


(1) dy (2) d(−y)
fY (y) = fX (−y) + fX (y)
dy dy
1  y 1  y
= exp − + exp −
2b b 2b b
1  y
= exp − ,
b b

which is Exp(1/b).

Problem 19. Letting u ≡ 1 + x2 , the expectation value becomes:

Z 0 Z ∞
1 x 1 x
E[X] = 2
dx + dx
−∞ π 1 + x 0 π 1 + x2
Z 1 Z ∞
1 du 1 du
= +
∞ 2π u 1 2π u
1 1 1 ∞

= ln(1 + x2 ) + ln(1 + x2 )
2π ∞ 2π 1
= −∞ + ∞,

which is not well defined.

Z ∞
2 1 x2
E[X ] = 2
dx
−∞ π 1 + x
Z
2 ∞ x2
= dx
π 0 1 + x2
2
= [x − arctan(x)]∞ 0
π
2
= lim [x − arctan(x)]
π x→∞
2 2 π
= lim x − ·
π x→∞ π 2
=∞

Problem 20.
67

(a) The expectation value is:


Z ∞
1 2 2
E[X] = 2 x2 e−x /2σ dx
σ
√ 0 Z ∞ 
2π 2 2 −x2 /2σ 2
= √ x e dx
2σ 2πσ 0

2π 2
σ
2σ r
π
=σ ,
2

where I have use the fact that the term in the brackets is the same integral one must compute
to find the variance of a 0, σ 2 normal distribution.

(b) The integral we must calculate is


Z x
x0 − x0 22 0
FX (x) = e 2σ dx ,
0 σ2

which can be computed by a simple substitution, u ≡ x0 2 /(2σ 2 ), so that:

Z x2
2σ 2 x2
FX (x) = e−u du = 1 − e− 2σ2 .
0

(c) The range of both X and Y is [0, ∞). Therefore, for all y ∈ [0, ∞):

   2 
y2 d y
fY (y) = fX
2σ 2 dy 2σ 2

 y − y22
e 2σ for y ≥ 0
= σ2
0 for y < 0,

which is Rayleigh(σ).

Problem 21.

(a) The CDF is given by:


Z x
−α−1
FX (x) = αxαm x0 dx0
xm
 x α
m
=1−
x

for x ≥ xm and 0 otherwise.


68 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

(b)
P (X > 3xm , X > 2xm )
P (X > 3xm |X > 2xm ) =
P (X > 2xm )
P (X > 3xm )
=
P (X > 2xm )
1 − P (X ≤ 3xm )
=
1 − P (X ≤ 2xm )
1 − FX (3xm )
=
1 − FX (2xm )
h  α i
xm
1 − 1 − 3x m
= h  α i
xm
1 − 1 − 2xm
 α
2
=
3

(c) The expectation value is:


Z ∞
E[X] = αxαm x−α−1 xdx
xm
αxαm h x i∞
=
1 − α x α xm
αxm
= ,
α−1
where the limx→∞ (x/xα ) term evaluates to 0 since α > 2. The expectation value of X 2 is:
Z ∞
2 α
E[X ] = αxm x1−α xdx
xm
αx2m
= ,
α−2
where the limx→∞ (x2 /xα ) term evaluates to 0 since α > 2. Thus, the variance is

V ar[X] = E[X 2 ] − E[X]2


 
αx2m αxm 2
= −
α−2 α−1
αxm 2
= .
(α − 1)2 (α − 2)

Problem 22.
(a)

FX (x) = P (X ≤ x)
= P (eσZ+µ ≤ x)
 
ln x − µ
=P Z≤
σ
 
ln x − µ

σ
69

(b) I first find the PDF for the log-normal distribution:


d
fX (x) = FX (x)
dx  
d ln x − µ
= Φ
dx σ
 
0 ln x − µ 1

σ xσ
 
ln x − µ 1
= fZ
σ xσ
"   #
1 1 1 ln x − µ 2
= √ exp − ,
x 2πσ 2 σ

for all x ∈ (0, ∞). The expectation value of X is thus


Z ∞ "   #
1 1 ln x − µ 2
E[X] = √ exp − ,
2πσ 0 2 σ

which can be simplified with the following substitution:

u ≡ ln x − µ,

=⇒
1
du ≡ dx = e−(u+µ) dx,
x
so that:
Z ∞  
1 1 u2
E[X] = √ exp − · 2 exp (u + µ)du
2πσ −∞ 2 σ
  Z ∞  
σ2 1 1 2 2
= exp µ + √ exp − 2 (u − σ ) du
2 2πσ −∞ 2σ
 2

σ
= exp µ + .
2
To go from the first equality to the second, I used the “completing the squares” trick to make
the exponent in the form of a Gaussian√ for easy integration. In going from the second equality
to the third, I used the fact that 1/( 2πσ) times the integral evaluates to 1 since this is just
a σ 2 , σ 2 Gaussian.
The expectation value of X 2 is
Z "  2 #

1 1 ln x − µ
E[X 2 ] = √ x exp − ,
2πσ 0 2 σ

and we can make the same substitutions, resulting in


Z ∞  
2 1 1 u2
E[X ] = √ exp − · 2 exp (2u + 2µ)du
2πσ −∞ 2 σ
Z ∞  
2
 1 1 2 2
= exp 2µ + 2σ √ exp − 2 (u − 2σ ) du
2πσ −∞ 2σ
2

= exp 2µ + 2σ ,
70 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES

where, again, I used the completing the squares trick, and where the 1/( 2πσ) times the
integral evaluates to 1 since this is just a 2σ 2 , σ 2 Gaussian.
Therefore:

V ar[X] = E[X 2 ] − E[X]2



= exp (2µ + σ 2 ) exp σ 2 − 1 .

Problem 23. The expectation value is:


n
E[Y ] = E[X1 + X2 + . . . + Xn ] = nE[X1 ] = ,
λ
since the expectation value for an Exp(λ) distribution is 1/λ.
Since X1 , X2 , . . . , Xn and iid, the variance is linear, and thus:
n
V ar[Y ] = V ar[X1 + X2 + . . . + Xn ] = nV ar[X1 ] = ,
λ2
since the variance for an Exp(λ) distribution is 1/λ2 .
Chapter 5

Joint Distributions

71
72 CHAPTER 5. JOINT DISTRIBUTIONS

Problem 1.

(a)
1 1
P (X ≤ 2, Y > 1) = 0 + =
12 12

(b)
 
1 1 5

 3 + 12 for x = 1 
 12 for x = 1
PX (x) = 16 + 0 for x = 2 = 1
for x = 2

1 1

5
6

12 + 3 for x = 4 12 for x = 4

( (
1 1 1 7
PY (y) = 3 + 6 + 12 for y = 1
= 12 for y = 1
1 1 5
12 + 0 + 3 for y = 2 12 for y = 2

(c)
1
P (Y = 2, X = 1) 12 1
P (Y = 2|X = 1) = = 5 =
P (X = 1) 12
5

1 5
P (Y = 2|X = 1) = 6= P (Y = 2) =
5 12

=⇒ not independent

Problem 2.

(a) The ranges of X, Y and Z are: RX = {1, 2, 4}, RY = {1, 2} and RX = {−3, −2, −1, 0, 2}.
The mapping g(x, y) = x − 2y (where g : RX × RY → RZ ) is explicitly given by:

(1, 1) → −1
(1, 2) → −3
(2, 1) → 0
(2, 2) → −2
(4, 1) → 2
(4, 2) → 0.

Thus we have the following:


73



P (X − 2Y = −3) for z = −3



P (X − 2Y = −2) for z = −2

PZ (z) = P (X − 2Y = −1) for z = −1



P (X − 2Y = 0) for z =0



P (X − 2Y = 2) for z =2


PXY (1, 2) for z = −3



 = −2
PXY (2, 2) for z
= PXY (1, 1) for z = −1



PXY (2, 1) + PXY (4, 2) for z =0



PXY (4, 1) for z =2

1

 for z = −3


12


0 for z = −2
= 31 for z = −1


 1 for z = 0


 2
1
12 for z = 2,

which, as a sanity check, add up to 1.

(b)

P (X = 2|Z = 0) = P (X = 2|X − 2Y = 0)
= P (X = 2|X = 2Y )
P (X = 2, X = 2Y )
=
P (X = 2Y )
P (X = 2, Y = 1)
=
P (X = 2, Y = 1) + P (X = 4, Y = 2)
1
=
3

Problem 3. Let A be the event that the first coin we pick is the fair coin. We can find the joint
PMF by conditioning on this event, and realizing that once conditioned, X and Y are independent
(i.e., X and Y are conditionally independent given A):

PXY (x, y) = P (X = x, Y = y|A)P (A) + P (X = x, Y = y|Ac )P (Ac )


   
1 1
= P1/2 (x)P2/3 (y) + P2/3 (x)P1/2 (y) ,
2 2

where Pp (z) is the PMF associated with a Bern(p) trial. This PMF can be written conveniently
as Pp (z) = pz (1 − p)1−z , so that the joint PMF is
 2  y  1−y  2  x  1−x
1 2 1 1 2 1
PXY (x, y) = + .
2 3 3 2 3 3

This can also be written in tabular form as:


74 CHAPTER 5. JOINT DISTRIBUTIONS

Y =0 Y =1

1 1
X=0 6 4

1 1
X=1 4 3

To check if X and Y are independent I first check (x, y) = (0, 0). Adding the table horizontally
and vertically, the marginalized PMFs at these values are PX (0) = 5/12 and PY (0) = 5/12, and
thus PX (0)PY (0) = 25/144 6= 1/6, so X and Y are not independent.

Problem 4.

(a) I first find the marginalized PMFs:

X∞
1
PX (k) = k+l
2
l=1
" ∞  l
#
1 X 1
= k −1 +
2 2
l=0
1
= k,
2
and by symmetry
1
PY (l) = ,
2l
We then have that:
1
PXY (k, l) =
2k+l
1
=
2k 2l
1 1
= k· l
2 2
= PX (k)PY (l) ∀ (k, l) ∈ N × N,

and thus X and Y are independent.

(b) We can easily enumerate all pairs of (x, y) that satisfy this inequality:

P (X 2 + Y 2 ≤ 10) = PXY (1, 1) + PXY (1, 2) + PXY (1, 3) + PXY (2, 1) + PXY (2, 2) + PXY (3, 1)
1 1 1 1 1 1
= + 2
+ 3
+ 2 + 2 2+ 3
2·2 2·2 2·2 2 ·2 2 ·2 2 ·2
11
= .
16

Problem 5.
75

(a)  1

 3
for x = 1 

 1 1 1
 4
for x = 1
 3 + 61+ 12 7
PX|Y (x|1) = for x = 2 = 2
1
6
+ 16 + 12
1 for x = 2

 3 
1
7

 1
for x = 4
 1
12
for x = 4 7
3
+ 16 + 12
1

(b)
4 2 1 12
E[X|Y = 1] = (1) · + (2) · + (4) · =
7 7 7 7
(c)
     
12 2 4 12 2 2 12 2 1 52
V ar[X|Y = 1] = 1 − · + 2− · + 4− · =
7 7 7 7 7 7 49

Problem 6. We know that X ∼ P ois(10) and since each customer is female independent of the
other customers, if the total number of customers in an hour is n, then the total number of female
customers in an hour is the sum of n independent Bernoulli random variables. In other words,
Y |X = n ∼ Bin(n, 3/4). Therefore, the joint PMF is:

P (X = n, Y = y) = P (Y = y|X = n)P (X = n)
   y  n−y n −10
n 3 1 10 e
= .
y 4 4 n!

Problem 7. We know that for a Geom(p) distribution the mean is 1/p and the variance is (1 −
p)/p2 , so we should expect these answers. We can find the mean by conditioning on the first “toss”:

E[X] = E[X|H]p(H) + E[X|H c ]p(H c )


= 1 · p + (1 + E[X])(1 − p),

where E[X|H] = 1 since if we know the first toss is a heads, the experiment is done so that the
mean is 1, and E[X|H c ] = (1 + E[X]) since if the first toss is a tails, then we’ve wasted 1 toss, and
since the geometric distribution is memoryless, it starts over at the next toss. Solving this equation
for E[X] we find that E[X] = 1/p, which is what we expected.
We can solve for E[X 2 ] in a similar manner:

E[X 2 ] = E[X 2 |H]p(H) + E[X 2 |H c ]p(H c )


= 1 · p + E[(1 + X)2 ](1 − p)
= p + E[X 2 ](1 − p) + 2E[X](1 − p) + (1 − p),

where E[X 2 |H] = 1 for the same reason as above, and E[X 2 |H c ] = E[(1 + X)2 ] since, as above,
we’ve wasted 1 toss on the first toss, and then the experiment starts over on the second. Solving this
equation, I find E[X 2 ] = (2 − p)/p2 . The variance is thus: V ar[X] = E[X 2 ] − E[X]2 = (1 − p)/p2 ,
which is what we expected.
iid
Problem 8. If X, Y ∼ Geom(p), the we can easily find the joint PMF and use LOTUS to solve
for the expectation. The joint PMF is:

PXY (x, y) = PX (x)PY (y) = p(1 − p)x−1 p(1 − p)y−1 for x, y = 1, 2, . . . , (5.1)

where I have multiplied the marginal PMFs since X and Y are independent. Using LOTUS:
76 CHAPTER 5. JOINT DISTRIBUTIONS

4
3
2
1
0

y
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
x

Figure 5.1: A visual representation of the set C for Problems 9 and 10.

  X ∞  2
∞ X 
X2 + Y 2 x + y2
E = p(1 − p)x−1 p(1 − p)y−1
XY xy
x=1 y=1
∞ X
X ∞ ∞ X
X ∞
x y
= p(1 − p)x−1 p(1 − p)y−1 + p(1 − p)x−1 p(1 − p)y−1
y x
x=1 y=1 y=1 x=1
X∞ X ∞
x
=2 p(1 − p)x−1 p(1 − p)y−1
y
x=1 y=1

X ∞
X
x−1 1
=2 xp(1 − p) p(1 − p)y−1 ,
y
x=1 y=1

where going from the second to third line we realize that due to the symmetry, both of the sums
are the same. In the last line the first sum is just the mean of a Geom(p) distribution (1/p). We
can simplify the second sum by utilizing the following Taylor expansion:

X xk
− ln(1 − x) = for |x| < 1.
k
k=1

Thus, we arrive at:


 
X2 + Y 2 2
E = p(1 − p)−1 (− ln p)
XY p
 
2 1
= ln .
1−p p

Problem 9. To better understand what is in the set C, note that C is the set of (x, y) ∈
Z × Z, such that y ≤ 2 − x2 for y ≥ 0 and y ≥ x2 − 2 for y < 0. To visualize C, I plot
the set Z × Z as grey points in Fig. 5.1 as well as the lines y = 2 − x2 and y = x2 − 2. The
shaded grey region (and the lines themselves) represents the region satisfying the 2 conditions,
and thus any grey point in this region (or on the lines) is in C. Therefore, more explicitly,
C = {(0, 0), (1, 0), (0, 1), (1, 1), (0, 2), (0, −1), (1, −1), (0, −2), (−1, 0), (−1, 1), (−1, −1)}.
77

(a) The joint PMF is: (


1
11 for (x, y) ∈ C
PXY (x, y) =
0 otherwise.
By looking at Fig. 5.1 and adding vertically and horizontally, we can easily determine that
the marginal PMFs are:

3

 11 for x = −1
PX (x) = 11 5
for x = 0

3
11 for x = 1
and 
1

 for y = −2


11

 3
for y = −1
 11
PY (y) = 3
for y =0


11

 3
for y =1

 11
1
11 for y = 2.

(b) Since there are 3 points at Y = 1, and each point is equally as likely, the total probability mass
at Y = 1 is 3/11, while the total probability mass at (−1, 1), (0, 1), (1, 1) is 1/11 respectively.
Therefore: 
1

 3 for x = −1
PX|Y (x|1) = 31 for x = 0

1
3 for x = 1.

(c) X and Y are not independent since, for example, at X = −1: P (X = −1|Y = 1) = 1/3 6=
P (X = −1) = 3/11.

(d) Using LOTUS, we have


X 1 X 1  
E[XY 2 ] = xy 2 PXY (x, y) = xy 2 = 1 · 12 + 1 · (−1)2 − 1 · 12 − 1 · (−1)2 = 0,
11 11
(x,y)∈C (x,y)∈C

where only 4 points contribute to the sum (since the rest have zeros).

Problem 10.

(a)
X 1 1 1
E[X|Y = 1] = xPX|Y (x|1) = (−1) · + (0) · + (1) · = 0
3 3 3
x∈RX|Y =1

(b)
X 1 1 1 2
V ar[X|Y = 1] = x2 PX|Y (x|1) = (1) · + (0) · + (1) · =
3 3 3 3
x∈RX|Y =1

(c) One can easily see that the PMF, PX||Y |≤1 (x) is exactly the same as the PMF for PX|Y (x|1),
and therefore the expectation and variance will be the same, thus E[X||Y | ≤ 1] = 0.

(d) For the same reason as part c of this problem E[X 2 ||Y | ≤ 1] = 2/3.
78 CHAPTER 5. JOINT DISTRIBUTIONS

Problem 11. If there are n cars in the shop, then X = X1 +X2 +. . .+Xn , where Xi is a Bern(3/4)
random variable (as specified in the problem), and where X1 , X2 , . . . , Xn are all independent (as
specified in the problem). Thus we have that X|N = n ∼ Bin(n, 3/4) and for the same reason,
Y |N = n ∼ Bin(n, 1/4).
(a) Noting that RX = RY = {0, 1, 2, 3}, we can use the law of total probability to find both of
the marginal PMFs, which are:
3
X
PX (x) = P (X = x|N = n)PN (n)
n=0
X3    x  n−x
n 3 1
= PN (n),
x 4 4
n=0
and
3
X
PY (y) = P (Y = y|N = n)PN (n)
n=0
X3    y  n−y
n 1 3
= PN (n).
y 4 4
n=0
I compute both of these PMFs numerically to find:


0.180 for x = 0




0.258 for x = 1
PX (x) ≈ 0.352 for x = 2



 0.211 for x = 3



0 otherwise,
and 

0.570 for y = 0




0.336 for y = 1
PY (y) ≈ 0.086 for y = 2



0.008 for y = 3



0 otherwise,
which, as a sanity, both add up to approximately 1. We see that since a 4 door car is
more likely than a 2 door car, the marginalized PMF for the 4 door cars skews towards high
numbers, while the marginalized PMF for the 2 door cars skews towards lower numbers.
(b) We can find the joint PMF for X and Y by conditioning on N and using the law of total
probability:
3
X
PXY (x, y) = P (X = x, Y = y|N = n)PN (n),
n=0
where we can get rid of the sum because the probability is 0 if x + y 6= n:
PXY (x, y) = P (X = x, Y = y|N = x + y)PN (x + y)
= P (X = x|Y = y, N = x + y)P (Y = y|N = x + y)PN (x + y)
= P (Y = y|N = x + y)PN (x + y)
   y  x
x+y 1 3
= PN (x + y),
y 4 4
79

where in the second line I have used the chain rule of probability, in the third I have used the
fact that given Y = y and N = x + y, we are sure that X = x, and in the fourth line I have
used the fact that Y |N = x + y ∼ Bin(x + y, 1/4). I compute the joint PMF numerically and
present the results in the following table:

Y =0 Y =1 Y =2 Y =3

X=0 0.125 0.031 0.016 0.008

PXY (x, y) ≈ X=1 0.094 0.094 0.070 0

X=2 0.141 0.211 0 0

X=3 0.211 0 0 0

As a check, I made sure that the above PMF sums to approximately 1.

(c) X and Y are not independent since PXY (x, y) 6= PX (x)PY (y) ∀x, y. For example PXY (0, 0) =
0.125, while PX (0)PY (0) ≈ 0.180 · 0.570 = 0.103.

Problem 12. I first note that RX = RY = {1, 2, 3, 4, 5} and RZ = {−4, −3, . . . , 3, 4}. I can find
PZ (z) by conditioning on either X or Y and by using independence:

PZ (z) = P (Z = z)
= P (Y = X − z)
5
X
= P (Y = X − z|X = x)PX (x)
x=1
X5
1
= P (Y = x − z|X = x)
5
x=1
5
1X
= P (Y = x − z)
5
x=1
X5
1
= 1{x − z ∈ RY },
25
x=1
80 CHAPTER 5. JOINT DISTRIBUTIONS

where in going from the fourth to fifth line I used independence. Thus we have:

 1
 for z = −4


25

 2
for z = −3

 25

 3

 for z = −2


25

 4
for z = −1
 25
PZ (z) = 5
for z =0


25

 4
for z =1

 25

 3

 25 for z =2



 2
for z =3

 25
1
25 for z = 4.

Problem 13.

(a)
( (
1 1 1 11
6 + 6 + 8 for x = 0 24 for x = 0
PX (x) = 1 1 1
= 13
8 + 6 + 4 for x = 1 24 for x = 1

 
1 1 7

6 + 8 for y = 0 
 24 for y = 0
PY (y) = 16 + 1
for y = 1 = 1
for y = 1

1
6
1
 33

8 + 4 for y = 2 8 for y = 2

(b)
 1 (
161 for x = 0 4
for x = 0
+8 7
PX|Y (x|0) = 6
1 =
 8
for x = 1
3
7 for x = 1
1
6
+ 81

 1 (
161 for x = 0 1
for x = 0
+6 2
PX|Y (x|1) = 6
1 =
 6
for x = 1
1
2 for x = 1
1
6
+ 61

 1 (
181 for x = 0 1
for x = 0
+4 3
PX|Y (x|2) = 8
1 =
 4
for x = 1
2
3 for x = 1
1
8
+ 41

(c) We know that



E[X|Y = 0] with probability PY (0)
Z = E[X|Y = 1] with probability PY (1)


E[X|Y = 2] with probability PY (2),
or in other words: 

PY (0) for z = E[X|Y = 0]
PZ (z) = PY (1) for z = E[X|Y = 1]


PY (2) for z = E[X|Y = 2].
81

We already know the marginal PMF of Y , and thus what is left to calculate is E[X|Y = y]
for all y ∈ RY :
X
E[X|Y = 0] = xPX|Y (x|0)
x∈RX
   
4 3
=0 +1
7 7
3
= ,
7

X
E[X|Y = 1] = xPX|Y (x|1)
x∈RX
   
1 1
=0 +1
2 2
1
= ,
2

X
E[X|Y = 2] = xPX|Y (x|2)
x∈RX
   
1 2
=0 +1
3 3
2
= .
3
Finally, we have that:

7

 24 for z = 37
1
PZ (z) = for z = 12

3
3

8 for z = 23 .

(d) For this problem we are checking that the law of iterated expectations holds. That is, we
need to check explicitly that E[X] = E[E[X|Y ]], where the outer expectation on the RHS is
over Y . Computing the LHS I have:
X    
11 13 13
E[X] = xPX (x) = 0 +1 = .
24 24 24
x∈RX

Computing the RHS I have:

E[Z] = EY [E[X|Y ]]
X
= E[X|Y = y]PY (y)
y∈RY
        
3 7 1 1 2 3
= + +
7 24 2 3 3 8
13
= .
24
The LHS and RHS agree, and thus the law of iterated expectations holds.
82 CHAPTER 5. JOINT DISTRIBUTIONS

(e)
X
E[Z 2 ] = z 2 PZ (z)
z∈RZ
 2    2    2  
3 7 1 1 2 3
= + +
7 24 2 3 3 8
17
=
56
=⇒  2
2 17 2 13 41
V ar[Z] = E[Z ] − E[Z] = − =
56 24 4032

Problem 14.

(a) As with the previous problem, we know that




V ar[X|Y = 0] with probability PY (0)
V = V ar[X|Y = 1] with probability PY (1)


V ar[X|Y = 2] with probability PY (2),
or in other words: 

PY (0) for v = V ar[X|Y = 0]
PV (v) = PY (1) for v = V ar[X|Y = 1]


PY (2) for v = V ar[X|Y = 2].

Thus, we must compute V ar[X|Y = y] = E[X 2 |Y = y] + E[X|Y = y]2 for all y ∈ RY .


E[X|Y = y] was already computed in the previous problem, and, since to compute E[X 2 |Y =
y], both terms in the summation are the same as the two terms in the summation to compute
E[X|Y = y] (since 02 = 0 and 12 = 1), we have that E[X 2 |Y = y] = E[X|Y = y] (this can
be seen more clearly if one explicitly writes out the summation for E[X 2 |Y = y]). Thus we
have that:
 2
2 2 3 3 12
V ar[X|Y = 0] = E[X |Y = 0] − E[X|Y = 0] = − = ,
7 7 49
 2
2 1 2 1 1
V ar[X|Y = 1] = E[X |Y = 1] − E[X|Y = 1] = − = ,
2 2 4
and  2
2 2 2 2 2
V ar[X|Y = 2] = E[X |Y = 2] − E[X|Y = 2] = − = .
3 3 9
The PMF for V is thus 
7

 24 for v = 12
49
1
PV (v) = for v = 14

3
3

8 for v = 29 .

(b)
X 12 7 1 1 2 3 5
E[V ] = vPV (v) = · + · + · =
49 24 4 3 9 8 21
v∈RV
83

(c) In this problem we are checking that the law of total variance, V ar[X] = EY [V ar[X|Y ]] +
V arY [E[X|Y ]], holds (where the subscript Y on the expectation and variance denotes with
respect to the random variable Y .) Computing the LHS:

V ar[X] = E[X 2 ] − E[X]2


 2
X X
= x2 PX (x) −  xPX (x)
x∈RX x∈RX
 
11 13 11 13 2
= 02 · + 12 · − 0· +1·
24 24 24 24
143
= .
576
Computing the RHS:

EY [V ar[X|Y ]] + V arY [E[X|Y ]] = E[V ] + V ar[Z]


5 41
= +
21 4032
143
= ,
576

which is in agreement with the LHS of the equation. Note that E[V ] and V ar[Z] were
computed in this problem and the previous problem.

Problem 15. The law of total expectation gives:



X
E[Y ] = E[Y |N = n]PN (n)
n=0

" N
#
X X
= E Xi |N = n PN (n)
n=0 i=1

" n
#
X X
= E Xi PN (n)
n=0 i=1
X∞ Xn
= E[Xi ]PN (n)
n=0 i=1
X∞ X n
e−β β n
= E[Xi ]
n!
n=0 i=1
X∞ X n
1 e−β β n
=
λ n!
n=0 i=1

e−β X nβ n
= ,
λ n!
n=0

where in going from the second to the third line I have used the fact that Xi and N are independent
(for all i), in going from the third to the fourth line I have used the linearity of expectation,
and in going from the fifth to sixth line I have used the fact that for an Exp(λ) distribution,
E[X] = 1/λ . This summation can be computed by considering the Taylor expansion of the
84 CHAPTER 5. JOINT DISTRIBUTIONS

P∞ n
exponential, exp(x) = n=0 (x )/n!. Taking the derivative of both sides of this formula with
respect to x, we find that the desired sum is:

X nxn
= xex ,
n!
n=0

and hence
β
E[Y ] =
.
λ
P
The calculation for V ar[Y ] is similar. For this calculation, we will need ∞ 2 n
n=0 (n x )/n!, which
can be found with the same differentiation strategy. I differentiate the equation for the previous
summation with respect to x once more and solve for the desired summation to find

X n2 xn
= xex + x2 ex ,
n!
n=0

where I have used the chain rule in differentiating.


To find V ar[Y ] I now solve for E[Y 2 ], which is only moderating more complicated than solving
for E[Y ]. The law of total expectation gives:

X
E[Y 2 ] = E[Y 2 |N = n]PN (n)
n=0
 !2 

X N
X
= E Xi |N = n PN (n)
n=0 i=1
 !2 

X n
X
= E Xi  PN (n)
n=0 i=1
 

X Xn X
= E Xi2 + Xj Xk  PN (n)
n=0 i=1 j,k:j6=k
 
∞ X
X n X 
2
= E[Xi ] + E[Xj Xk ] PN (n)
 
n=0 i=1 j,k:j6=k
 
X∞ X n X 
= E[Xi2 ] + E[Xj ]E[Xk ] PN (n)
 
n=0 i=1 j,k:j6=k
 
X∞ X n X 1
2
= + PN (n)
 λ2 λ2 
n=0 i=1 j,k:j6=k
∞ 
X 
2n n2 − n e−β β n
= +
λ2 λ2 n!
n=0
∞ ∞
e−β X nβ n e−β X n2 β n
= + 2 ,
λ2 n! λ n!
n=0 n=0

where in going from the second to third line I have used the fact that Xi and N are independent
(for all i), in going from the third to fourth line I have broken the square of the summation
P into the
summation of the squares plus the summation of the cross-terms. The notation j,k:j6=k denotes
85

a sum over all possible tuples of (j, k), where j, k = 1, 2, . . . , n, except the tuples where j = k. In
going from the fourth to fifth line I have used the linearity of expectation, in going from the fifth
to sixth line I have used the independence of all Xi s, and in going from the sixth to seventh line I
have used the fact that for an Exp(λ) distribution, E[X 2 ] = 2/λ2 (as calculated in the book). The
first summationP summation has already been solved for, and to solve the second summation, I use
n2 xn
the formula for ∞ n=0 n! as derived above. Thus I have that

1 β e−β  β 
E[Y 2 ] = · + 2 βe + β 2 eβ
λ λ λ
β 2 + 2β
= .
λ2

Finally we have that:

V ar[Y ] = E[Y 2 ] − E[Y ]2


β 2 + 2β β 2
= − 2
λ2 λ

= 2.
λ

Problem 16.

(a)

Z 1Z ∞
1= fXY (x, y)dxdy
0 0
Z 1Z ∞ Z 1Z ∞
1 −x y
= e dxdy + c dxdy
0 0 2 0 0 (1 + x)2
Z 1Z ∞ Z 1Z ∞
1 −x y
= e dxdy + c dudy
0 0 2 0 1 u2
1 c
= +
2 2

=⇒
c=1

(b)

  Z 1/2 Z 1
1
P 0 ≤ X ≤ 1, 0 ≤ Y ≤ = fXY (x, y)dxdy
2 0 0
Z 1/2 Z 1  
1 −x y
= e + dxdy
0 0 2 (1 + x)2
1 1
= (1 − e−1 ) +
4 16
≈ 0.22
86 CHAPTER 5. JOINT DISTRIBUTIONS

(c)
Z 1Z 1
P (0 ≤ X ≤ 1) = fXY (x, y)dxdy
0 0
Z 1Z 1 
1 −x y
= e + dxdy
0 0 2 (1 + x)2
1 1
= (1 − e−1 ) +
2 4
≈ 0.57

Problem 17.

(a)
Z
fX (x) = fXY (x, y)dy
RY
Z ∞
= e−xy dy
0
 −xy ∞
e
=−
x 0
1
=
x
=⇒ (
1
x for 1 ≤ x < e
fX (x) =
0 otherwise

Z
fY (y) = fXY (x, y)dx
R
Z eX
= e−xy dx
1
 −xy e
e
=−
y 1
 
1 1 1
= − ey
y ey e

=⇒ (
1 1 1

y ey − eey for y > 0
fY (y) =
0 otherwise

(b)
Z √
√ 1Z e
P (0 ≤ Y ≤ 1, 1 ≤ X ≤ e) = e−xy dxdy
0 1

Problem 18.
87

(a)
Z
fX (x) = fXY (x, y)dy
RY
Z 2 
1 2 1
= x + y dy
0 4 6
1 2 1
= x +
2 3
=⇒ (
1 2 1
2x + 3 for −1 ≤ x ≤ 1
fX (x) =
0 otherwise

Z
fY (y) = fXY (x, y)dx
RX
Z 1 
1 2 1
= x + y dx
4 −1 6
1 1
= y+
3 6
=⇒ (
1 1
3y + 6 for 0 ≤ y ≤ 2
fY (y) =
0 otherwise

(b)
Z 1Z 1 
1 2 1
P (X > 0, Y < 1) = x + y dxdy
0 0 4 6
1
=
6

(c) Using inclusion-exclusion:

P (X > 0 ∪ Y < 1) = P (X > 0) + P (Y < 1) − P (X > 0, Y < 1)


Z 1Z 2  Z 1 Z 1 
1 2 1 1 2 1 1
= x + y dydx + x + y dydx −
0 0 4 6 −1 0 4 6 6
1 1 1
+ −
2 3 6
2
= .
3

(d)

P (X > 0, Y < 1)
P (X > 0|Y < 1) =
P (Y < 1)
1
6
= 1
3
1
= .
2
88 CHAPTER 5. JOINT DISTRIBUTIONS

0
-2 -1 0 1 2
x
-1

-2

Figure 5.2: The region of integration for Problem 18 (e) (shaded region).

(e) We must be slightly care in choosing the bounds of integration for this problem. The upper
bound of the y integral is the upper bound of RY , and the lower bound of the y integral is
max{0, −x}, and not simply −x. This is because for x > 0, −x < 0, but the lower bound of
the range of Y is 0. An illustration of the domain of the double integral is shown in Fig. 6.1.
The probability we seek is thus:

P (X + Y > 0) = P (Y > −X)


Z 1Z 2  
1 2 1
= x + y dydx
−1 max{0,−x} 4 6
Z 1 2
1 2 1
= x y + y2 dx
−1 4 12 max{0,−x}
Z Z 1
1 1 2 1 
= x (2 − max{0, −x}) dx + 4 − max{0, −x}2 dx
4 −1 12 −1
Z 0 Z 1 Z 0 Z 1
1 2 1 2 1 2
 1
= x (2 + x) dx + 2x dx + 4 − x dx + 4dx
4 −1 4 0 12 −1 12 0
131
= .
144

Problem 19.

(a)

∂FXY
fXY (x, y) =
∂x∂y
= e−x 2e−2y

=⇒ (
e−x 2e−2y for x, y > 0
fXY (x, y) =
0 otherwise
89

(b)
  Z ∞ Z x/2
1
P Y > X = e−x 2e−2y dydx
2
Z0 ∞ 0

= e−x − e−2x dx
0
1
=
2

(c) X and Y are independent because the joint PDF can be factored into the product of 2 marginal
PDFs. Specifically, the joint PDF can be factored into fX (x)fY (y) where X ∼ Exp(1) and
where where Y ∼ Exp(2).

Problem 20.

(a) To calculate the PDF, we simply need to condition on X > 0, and since a N (0, 1) distribution
is symmetric about zero, we know that P (X > 0) = 1/2:

fX (x)
fX|X>0 (x) =
P (X > 0)

 √2 − x22
e for x > 0
= 2π
0 otherwise.

To find the conditional CDF, we need only to integrate the Gaussian:


Z x
1 x0
2
FX|X>0 (x) = 2 √ e− 2 dx0

0 
1
= 2 Φ(x) − ,
2
so that (
2Φ(x) − 1 for x > 0
FX|X>0 (x) =
0 otherwise.

(b)
Z ∞
E[X|X > 0] = xfX|X>0 (x)dx
0
Z ∞
2 x2
=√ xe− 2 dx
2π 0
Z ∞
1 u
=√ e− 2 du
2π 0
2
=√

(c) We can compute E[X 2 |X > 0] by noting that if Y ∼ N (0, 1), then:

1 = E[Y 2 ]
Z ∞
2 y2
=√ y 2 e− 2 dy,
2π 0
90 CHAPTER 5. JOINT DISTRIBUTIONS

where I have used the fact that y 2 times exp (−y 2 /2) is an even function, so I need only
integrate from 0 to infinity and multiply by 2. Thus we have that
Z ∞
2
E[X |X > 0] = x2 fX|X>0 (x)dx
0
Z ∞
2 x2
=√ x2 e− 2 dx
2π 0
= 1.
Finally, we have that:
V ar[X|X > 0] = E[X 2 |X > 0] − (E[X|X > 0])2
 2
2
=1− √

π−2
= .
π
Problem 21.
(a) I first find the marginal PDF of Y :
Z  1 
2 1
fY (y) = x + y dx
−1 3
2
= (1 + y),
3
so that we have
fXY (x, y)
fX|Y (x|y) =
fY (y)
x + 13 y
2
= 2 ,
3 (1 + y)
and therefore: (
3x2 +y
2(1+y) for −1 ≤ x ≤ 1
fX|Y (x|y) =
0 otherwise.
(b)
Z 1
P (X > 0|Y = y) = fX|Y (x|y)dx
0
Z 1
1 
3x2 + y dx
2(1 + y) 0
1
= .
2
Notice that the probability, P (X > 0|Y = y), does not depend on y.
(c) We have already found the marginal PDF of Y , and now I find the marginal PDF of X:
Z 1 
1
fX (x) = x2 + y dy
0 3
1
= x2 + .
6
We thus see that fX (x)fY (y) = 2x /3 + y/9 + 2yx2 /3 + 1/9 6= fXY (x, y), and so X and Y
2

are not independent.


91

Problem 22. I start by first finding the marginal PDF of X:


Z 1 
1 2 2
fX (x) = x + y dy
0 2 3
1 2 1
= x + ,
2 3
which is valid for −1 ≤ x ≤ 1, otherwise fX (x) = 0.
I now find the PDF of Y conditioned on X = 0:
fXY (0, y)
fY |X (y|0) =
fX (0)
2
3y
= 1
3
= 2y

valid for 0 ≤ y ≤ 1. I may now calculate E[Y |X = 0],


Z
E[Y |X = 0] = yfY |X (y|0)dy
RY |X=0
Z 1
= 2y 2 dy
0
2
= ,
3
and E[Y 2 |X = 0],
Z
E[Y 2 |X = 0] = y 2 fY |X (y|0)dy
RY |X=0
Z 1
= 2y 3 dy
0
1
= .
2
Therefore, the variance is: V ar[Y |X = 0] = E[Y 2 |X = 0] − (E[Y |X = 0])2 = 1/2 − (2/3)2 = 1/18.

Problem 23.
(a) The set E is a diamond shaped region in R2 , upper-bounded by 1 − |x| and lower-bounded by
|x| − 1, as shown in Fig. 5.3. The area of the region is thus 4 times the area of a triangle with
a base length of 1 and height length of 1: 4 · (1/2) · (1) · (1) = 2. Since the total probability
must integrate to unity, we thus have c = 1/2.

(b) The marginal PDF of X is given by


Z 1−|x|
1
fX (x) = 2 dy
0 2
= 1 − |x|,

so, that (
1 − |x| for −1 ≤ x ≤ 1
fX (x) =
0 otherwise.
92 CHAPTER 5. JOINT DISTRIBUTIONS

1.5

1.0

0.5

0.0

y
−0.5

−1.0

−1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x

Figure 5.3: A visual representation of the set E for Problem 23.

By symmetry, we also know that

(
1 − |y| for −1 ≤ y ≤ 1
fY (y) =
0 otherwise.

(c) The conditional PDF is given by

fXY (x, y)
fX|Y (x|y) =
fY (y)
(
1
2(1−|y|) for x, y ∈ E
=
0 otherwise.

(d) X and Y are not independent, as it is clear that fX|Y (x|y) 6= fX (x).

Problem 24. The marginal PDFs for X and Y are given by

(
1
2 for 0 ≤ x ≤ 2
fX (x) =
0 otherwise,

and
(
1
2 for 0 ≤ y ≤ 2
fY (y) =
0 otherwise.

I solve for the desired probability by conditioning on Y and using the fact that X and Y are
93

independent:
Z 2
P (XY < 1) = P (XY < 1|Y = y)fY (y)dy
0
Z 2
= P (Xy < 1)fY (y)dy
0
Z 2  
1
= P X< fY (y)dy
0 y
Z 2 Z min{2,1/y}
= fX (x)fY (y)dxdy
0 0
Z  
1 2 1
= min 2, dy
4 0 y
Z 1/2 Z 2 !
1 1
= 2dy + dy
4 0 1/2 y
1 ln 2
= +
4 2
≈ 0.60.

Problem 25. The easiest way to solve this problem will be to use the law of iterated expectations
and the law of total variance. The following information will be useful: for X ∼ Exp(1), E[X] = 1,
V ar[X] = 1 and E[X 2 ] = 2, and for Y |X ∼ U nif (0, X), E[Y |X] = X/2 and V ar[Y |X] = X 2 /12.

(a) I use the law of iterated expectations, where the subscript on the first expectation denotes
an expectation over X:

E[Y ] = EX [E[Y |X]]


 
X
= EX
2
1
= .
2

(b) I use the law of total variance, where the subscripts denote expectation and variance over X:

V ar[Y ] = EX [V ar[Y |X]] + V arX [E[Y |X]]


 2  
X X
= EX + V arX
12 2
5
= .
12

Problem 26. For X ∼ U nif (0, 1) we have: E[X] = 1/2 and E[X 2 ] = 1/3.

(a) Since X and Y are independent, the expectation of the product is the product of the expec-
tations:

E[XY ] = E[X]E[Y ]
1
= .
4
94 CHAPTER 5. JOINT DISTRIBUTIONS

(b) Since X and Y are independent E[g(X)h(Y )] = E[g(X)]E[h(Y )]:


 
E[eX+Y ] = E eX eY
   
= E eX E eY .

I can compute E[eX ] using LOTUS:


Z
 X
 1
E e = ex dx
0
= e − 1,

and plugging into the previous equation I find that


 
E eX+Y = (e − 1)2 .

(c)

E[X 2 + Y 2 + XY ] = E[X 2 ] + E[Y 2 ] + E[XY ]


1 1 1
= + +
3 3 4
11
=
12

(d) We can compute this expectation with a 2D LOTUS over the joint distribution of X and Y .
Since X and Y are independent, fXY (x, y) = fX (x)fY (y) = 1 for x, y ∈ [0, 1]:
Z 1Z 1
E[Y eXY ] = yexy dxdy
0 0
Z 1
= (1 − ey ) dy
0
= e − 2.

Problem 27. I first note that RX = RY = [0, 1] and that RZ = [0, ∞). I solve for the CDF of Z
by conditioning on X and using the fact that X and Y are independent:

FZ (z) = P (Z ≤ z)
 
X
=P ≤z
Y
 
X
=P Y ≥
z
Z ∞  
x
= P Y ≥ |X = x fX (x)dx
−∞ z
Z 1  
x
= P Y ≥ |X = x dx
0 z
Z 1 
x
= P Y ≥ dx,
0 z

where in the last line I have used the fact that X and Y are independent. To solve for P (Y ≥ X/z)
by integrating FY (y) over y, some care must be taken in the limits of integration. Since x ∈ [0, 1]
95

1.0
1.4
a) b)

2x

x
1.2 0.8

y=

=
y
1.0
0.6

FZ (z)
0.8
y

x/2
y=
0.6 0.4

0.4
0.2
0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 1 2 3 4 5 6
x z

Figure 5.4: a) The x − y plane where the shaded region denotes the region of non-zero probability
for the PDF at hand. b) A plot of the CDF of Z.

and z ∈ [0, ∞), this implies that x/z ∈ [0, ∞). However, we know that fY (y) = 0 for y > 1 and thus
the lower bound of integration is not simply x/z, but is min{1, x/z}. This can be seen pictorially in
Fig. 5.4 a) (from a point of view of integrating over the joint PDF to solve the problem, rather than
the conditional PDF), which is the x − y plane, where the grey region corresponds to the region of
non-zero joint probability density. For any given z, P (Y ≥ X/z) represents the total probability
mass above the the line defined by x/z. For example, I have, I have drawn 3 different lines in 5.4
a) corresponding to z = 1/2 (highest line), z = 1 (middle line) and z = 2 (lowest line). For each
of these z values, P (Y ≥ X/z) is the fraction of the grey box above that line. We see that when
z increases from 0 to 1 (corresponding to the vertical line along the y-axis and the line y = x),
the total probability mass above the line increases smoothly. However, due to the edge of the box,
there is a kink in the total probability mass above the line when transitioning from z < 1 to z > 1,
which is the same reason we will get a kink in the function FZ (z) due to min{1, x/z}.
Continuing with the calculation:

Z 1
x
FZ (z) = P Y ≥ dx,
0 z
Z 1Z 1
= fY (y)dydx,
0 min{1,x/z}
Z 1h n x oi
= 1 − min 1, dx
0 z
Z 1 n xo
=1− min 1, dx
z
0Z 1 Z 1 
x
=1− 1{x ≥ z}dx + 1{x < z}dx ,
0 0 z

where I have picked out the proper value of the min function by utilizing an indicator function
(which is much nicer to use in an integral since it “kills” the integral whenever the logical condition
evaluates to false).
96 CHAPTER 5. JOINT DISTRIBUTIONS

Thus, for z > 1 we have that:


Z 1
x
FZ (z) = 1 − dx
0 z
1
=1− ,
2z
while for z ∈ [0, 1] we have that:
Z z Z 1 
x
FZ (z) = 1 − dx + dx
0 z z
z
= .
2
In summary we have:

z

2 for 0 ≤ z ≤ 1
FZ (z) = 1 − 1
for z > 1


2z
0 otherwise,

which I plot in Fig. 5.4 b). Notice that even though this is a piecewise function, it appears very
smooth because, at the transition (z = 1), both the actual function and the first derivative match
between the two piecewise regions.
To find the PDF we need only to differentiate:

dFZ (z)
fZ (z) =
dz

1

2 for 0 ≤ z ≤ 1
1
= 2z 2 for z > 1


0 otherwise.

Problem 28.

(a) To find fU |X (x|u) I first solve for the conditional CDF then differentiate with respect to u:

FU |X (u|x) = P (U ≤ u|X = x)
= P (X + Y ≤ u|X = x)
= P (x + Y ≤ u|X = x)
= P (Y ≤ u − x)
= Φ(u − x),

where in the fourth line I have used the fact that X and Y are independent. To find the
conditional PDF I now differentiate:
∂FU |X
= Φ0 (u − x)
∂u
1 (u−x)2
= √ e− 2 ,

and thus we see that:


U |X = x ∼ N (x, 1).
97

(b) If X ∼ N (µx , σx2 ) and X ∼ N (µy , σy2 ) are independent, then, as shown in the book by the
method of convolution, X + Y ∼ N (µx + µy , σx2 + σy2 ), and thus:

U ∼ N (0, 2).

(c) To find fX|U (x|u), I use Baye’s rule for PDFs:

fU |X (u|x)fX (x)
fX|U (x|u) =
fU (u)
(u−x)2 x 2
√1 e− 2 √1 e− 2
2π 2π
= u 2
1

2 π
e− 4
1 (u−x)2 x2 u2
= √ e− 2 − 2 + 4
π
1 u 2
= √ e−(x− 2 )
π
2
(x− u
2)
1 −
2· 1
=√ √ e 2 ,
2π(1/ 2)

where I have used the “completing the square” trick in the exponential to make it more
Gaussian. We recognize this distribution as a normal, u/2, 1/2 distribution:
 
u 1
X|U = u ∼ N , .
2 2

u 1

(d) Since X|U = u ∼ N 2, 2 , we have:
u
E[X|U = u] = ,
2
and
1
V ar[X|U = u] = .
2

Problem 29. This problem can be solved using the method of transformations. Since X and Y
are independent, we have an axis-aligned 2D Gaussian for the joint distribution:

fXY (x, y) = fX (x)fY (y)


1 − 1 (x2 +y2 )
= e 2 .

I define the functions h1 and h2 as:
(
X = h1 (R, Θ) = R cos Θ
Y = h2 (R, Θ) = R sin Θ,

so that, according to the method of transformations:


∂h
1 ∂h1
fRΘ (r, θ) = fXY (h1 (r, θ), h2 (r, θ)) ∂h
∂r
2
∂θ
∂h2 .
∂r ∂θ
98 CHAPTER 5. JOINT DISTRIBUTIONS

The Jacobian is easy to calculate:



cos θ −r sin θ

J =
sin θ r cos θ
= r(cos2 θ + sin2 θ)
= r,

where I have used the Pythagorean trigonometric identity. Thus, we have that:

1 − (r cos θ)2 − (r sin θ)2


fRΘ (r, θ) = e 2 e 2 r

r − 1 r2
= e 2 ,

where I have again used the Pythagorean trigonometric identity.
If R and Θ are independent, then we can factor fRΘ (r, θ) into fR (r)fΘ (θ). To help determine
what these 2 functions are, I integrate the joint distribution:
Z ∞Z π
1= fRΘ (r, θ)drdθ
0 −π
Z ∞Z π
r − 1 r2
= e 2 drdθ
0 −π 2π
Z ∞
1 2
= re− 2 r dr.
0

1 2
Thus, we see that the function re− 2 r , which only depends on r, is always positive (for r ≥ 0) and
integrates to 1. This is the marginal distribution of R. The function 1/(2π) is always positive and
integrates to 1 for θ ∈ [−π, π], and this is thus the marginal distribution of Θ. Therefore we see
that fRΘ (r, θ) can be factored into fR (r)fΘ (θ), and thus R and Θ are independent.

iid
Problem 30. If X, Y ∼ U nif (0, 1), then:
(
1 for x, y ∈ [0, 1]
FXY (x, y) =
0 otherwise.

The Jacobian has already been calculated in the previous problem (J = r), so that

fRΘ (r, θ) = rFXY (r cos θ, r sin θ)


(
r for r cos θ, r sin θ ∈ [0, 1]
=
0 otherwise,

where in Fig. 5.5, I have indicated where in the r − θ plane fRΘ (r, θ) is non-zero.
We can further examine the constraints r cos θ, r sin θ ∈ [0, 1] to gain more insight. Satisfying
these conditions is equivalent to simultaneously satisfying the following four conditions:


 r cos θ ≤ 1


r sin θ ≤ 1
r cos θ ≥ 0



r sin θ ≥ 0.
99


2

r
0
0 π/4 π/2
θ

Figure 5.5: The r − θ plane for Problem 30. The grey region denotes the region in the plane where
fRΘ (r, θ) is non-zero.
r sin θ, r cos θ

0 θ/2
θ

Figure 5.6: r sin θ (solid line) and r cos θ (dashed line) for 0 ≤ θ ≤ π/2

Since r is always positive, the last 2 conditions yield cos θ ≥ 0 and sin θ ≥ 0, which only happens
in the first quadrant, i.e., 0 ≤ θ ≤ π/2. If we plot the first 2 conditions for 0 ≤ θ ≤ π/2, as in
Fig. 5.6, we see that when 0 ≤ θ ≤ π/4, r cos θ ≤ 1 =⇒ r sin θ ≤ 1 and when π/4 ≤ θ ≤ π/2,
r sin θ ≤ 1 =⇒ r cos θ ≤ 1.
Thus, we can re-write fRΘ (r, θ) with the constraints specified a little more explicitly:


r for 0 ≤ θ ≤ π4 and r ≤ cos1 θ
fRΘ (r, θ) = r for π4 ≤ θ ≤ π2 and r ≤ sin1 θ


0 otherwise,

(where the inequalities did not flip when I divided by cos θ and sin θ because these are both positive
in the first quadrant). Note that these constraints imply that fRΘ (r, θ) > 0 in the unit square and
fRΘ (r, θ) = 0 outside of the unit square, as we would expect for X, Y ∼ U nif (0, 1) as shown in
Fig. 5.7. The figure shows the unit square in the x − y plane (where the probability is non-zero),
and it shows that for θ less than π/4, r is constrained by 0 ≤ r ≤ 1/ cos θ (for values of r within the
unit square). One can similarly show in this figure that for θ greater than π/4, 0 ≤ r ≤ 1/ sin θ.
100 CHAPTER 5. JOINT DISTRIBUTIONS

r
θ
0
0 1
x

Figure 5.7: The unit square in the x − y plane. In polar coordinates, to be within the unit square,
it can be seen geometrically that r is constrained by 0 ≤ r ≤ 1/ cos θ for 0 ≤ θ ≤ π/4, and by
0 ≤ r ≤ 1/ sin θ for π/4 < θ ≤ π/2.

We can check explicitly that this PDF integrates to 1:


Z π Z 1 Z π Z 1 Z π Z π
4 cos θ 2 sin θ 1 1 4 1 2 1
rdrdθ + rdrdθ = dθ + dθ
0 0 π
0 2
0 cos θ
2 2 π sin2 θ
4 4
π  π
1 4 1 cos θ 2
= tan θ −
2 0 2 sin θ π
4

= 1.

For this problem R and Θ are not independent, because there is no way to factor fRΘ (r, θ) into an
equation of just r times an equation of just θ, since the values of r over which the PDF is non-zero
explicitly depend on the values of θ.

Problem 31. The covariance can be computed straight from its definition:

Cov[X, Y ] = E[XY ] − E[X]E[Y ]


1 X
2 1
! 2 
X X X
= xyPXY (x, y) − xPX (x)  yPY (y)
x=0 y=0 x=0 y=0
       
1 1 1 1 1 1 1 1 1
=1· +2· − 1· + + 1· + +2· +
6 6 8 6 6 4 6 8 6
1
= .
24
To calculate ρXY , I first calculate the variances. For X we have
1
X  
1 1 1 11
E[X] = xPX (x) = 1 · + + = ,
8 6 6 24
x=0

1
X  
2 2 1 1 1 11
E[X ] = x PX (x) = 1 · + + =
8 6 6 24
x=0
101

=⇒
V ar[X] = E[X 2 ] − E[X]2
 2
11 11
= −
24 24
143
= ,
576
and for Y we have
2
X    
1 1 1 1
E[Y ] = yPY (y) = 1 · + +2· + = 1,
4 6 8 6
y=0
2
X    
2 2 1 1 2 1 1 19
E[Y ] = y PY (y) = 1 · + +2 · + =
4 6 8 6 12
y=0
=⇒
V ar[Y ] = E[Y 2 ] − E[Y ]2
19
= − 12
12
7
= .
12
Finally, the correlation is:
1
Cov[X, Y ]
ρXY = p = q 24 ≈ 0.11. (5.2)
V ar[X]V ar[Y ] 143
· 7
576 12

Thus, there is a weak, positive correlation between X and Y .


Problem 32. We can use several of the items in Lemma 5.3 in the book to solve this problem:
Cov[Z, W ] = Cov[11 − X + X 2 Y, 3 − Y ]
= Cov[−X + X 2 Y, −Y ]
= Cov[X, Y ] − Cov[X 2 Y, Y ]
= −Cov[X 2 Y, Y ]

= − E[X 2 Y 2 ] − E[X 2 Y ]E[Y ]

= − E[X 2 ]E[Y 2 ] − E[X 2 ]E[Y ]2
= −(1 − 0)
= −1,
where in the second line I have used item 5 of Lemma 5.3, in the third item 7, in the fourth item
2 in the sixth I have used the fact that X and Y are independent and in the seventh I have used
the fact that X, Y ∼ N (0, 1).
Problem 33. To solve this problem I use several of the items in Lemma 5.3 in the book. Since Z
and W are independent, we have that:
0 = Cov[Z, W ]
= Cov[2X − Y, X + Y ]
= 2Cov[X, X] + 2Cov[X, Y ] − Cov[Y, X] − Cov[Y, Y ]
= 2V ar[X] + Cov[X, Y ] − V ar[Y ]
= 2 · 4 + Cov[X, Y ] − 9
102 CHAPTER 5. JOINT DISTRIBUTIONS

=⇒
Cov[X, Y ] = 1.
The correlation is now straightforward to calculate:

Cov[X, Y ] 1 1
ρXY = p =√ = .
V ar[X]V ar[Y ] 4·9 6

Problem 34. We know that X ∼ U nif (1, 3) (so that E[X] = 2) and Y |X = x ∼ Exp(x) (so
that E[Y |X] = 1/x). Since Cov[X, Y ] = E[XY ] − E[X]E[Y ], and since we know the distribution
of Y |X = x we can probably solve most of the expectations by conditioning on X and using the
law of iterated expectations. To solve for E[Y ] I use the law of iterated expectations (where the
subscript X denotes an expectation over the random variable X):

E[Y ] = EX [E[Y |X]]


 
1
= EX
X
Z 3
1 1
= dx
2 1 x
1
= ln 3.
2
To solve for E[XY ] I also condition on X and “take out what is known”:

E[XY ] = EX [E[XY |X]]


= EX [XE[Y |X]]
 
1
= EX X
X
= 1.

Thus, we have that the covariance is:

Cov[X, Y ] = E[XY ] − E[X]E[Y ]


1
= 1 − 2 · ln 3
2
= 1 − ln 3.

Problem 35. The covariance is:

Cov[Z, W ] = Cov[7 + X + Y, 1 + Y ]
= Cov[X + Y, Y ]
= Cov[X, Y ] + V ar[Y ]
=0+1
= 1,

where I have used the fact that Cov[X, Y ] = 0 since X and Y are independent. Calculating the
variances is easy as well:

V ar[Z] = V ar[7 + X + Y ] = V ar[X] + V ar[Y ] = 2,


103

and
V ar[W ] = V ar[1 + Y ] = V ar[Y ] = 1.
Thus we have that the correlation is:
Cov[Z, W ] 1
ρZW = p = √ ≈ 0.71.
V ar[Z]V ar[W ] 2
Problem 36.
(a)
2
X + 2Y ∼ N (µX + 2µY , σX + 4σY2 + 2 · 2ρσX σY ) = N (1, 4)
=⇒  
3−1
P (X + 2Y ≤ 3) = Φ = Φ(1) ≈ 0.84.
2
(b)
Cov[X − Y, X + 2Y ] = Cov[X, X] + 2Cov[X, Y ] − Cov[X, Y ] − 2Cov[Y, Y ]
2
= σX + ρσX σY − 2σY2
=1

Problem 37.
(a)
2
X + 2Y ∼ N (µX + 2µY , σX + 4σY2 + 2 · 2ρσX σY ) = N (3, 8)
=⇒
P (X + 2Y > 4) = 1 − P (X + 2Y ≤ 4)
 
4−3
=1−Φ √
8
 
1
=1−Φ √
2 2
≈ 0.36.

(b) Since X and Y are uncorrelated, jointly normal random variables they are independent, and
thus:
E[X 2 Y 2 ] = E[X 2 ]E[Y 2 ]
= (V ar[X] + E[X]2 )(V ar[Y ] + E[Y ]2 )
2
= (σX + µ2X )(σY2 + µ2Y )
= 10.

Problem 38.
(a) X and Y are jointly normal random variables, and thus by Theorem 5.4 in the book:
 
x − µX
Y |X = x ∼ N µY + ρσY , (1 − ρ2 )σY2
σX
 
3(x − 2) 27
=N 1− , .
4 4
We therefore can immediately read off that
3(3 − 2) 1
E[Y |X = 3] = 1 − = .
4 4
104 CHAPTER 5. JOINT DISTRIBUTIONS

(b) Using the same distribution:


27
V ar[Y |X = 2] = .
4
(c) To solve for this problem, I define the random variables U = X + 2Y and V = X + Y and I
solve for the distribution of U |V . Since X and Y are jointly normal random variables, so too
are U and V (since aU + bV = aX + 2aY + bX + bY = (a + b) · X + (2a + b) · Y which we know
is normal for all a, b) and thus Theorem 5.4 in the book gives an equation for the distribution
of U |V . But first to use this formula, we will need to explicitly compute the distributions of
U and V . The distribution of U is
2
U ∼ N (µX + 2µY , σX + 4σY2 + 2 · 2ρσX σY )
= N (4, 28),

and the distribution of V is


2
V ∼ N (µX + µY , σX + σY2 + 2ρσX σY )
= N (3, 7).

I will also need to compute ρU V , so I here solve for Cov[U, V ]:

Cov[U, V ] = Cov[X + 2Y, X + Y ]


= Cov[X, X] + Cov[X, Y ] + 2Cov[Y, X] + 2Cov[Y, Y ]
2
= σX + 3ρσX σY + 2σY2
= 13.

Thus, we have that:


Cov[U, V ] 13 13
ρU V = p =√ =√ ,
V ar[U ]V ar[V ] 28 · 7 196
which, as a sanity check is in between -1 and 1.
Putting everything together into Theorem 5.4 in the book we have:
 
3 − µV 2 2
U |V = 3 ∼ N µU + ρU V σU , (1 − ρU V )σU
σV
   
132
= N 4, 1 − · 28
196
 
27
= N 4, .
7
Finally, now that I have the distribution, I can compute the desired probability:

P (X + 2Y ≤ 5|X + Y = 3) = P (U ≤ 5|V = 3)
 
5−4
= Φ q 
27
7
r !
7

27
≈ 0.69.
Chapter 6

Methods for More Than Two


Random Variables

105
106 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

Problem 1.

(a)
Z
fXY (x, y) = fXY Z (x, y, z)dz
(R 1
= 0 (x + y)dz for 0 ≤ x, y ≤ 1
0 otherwise
(
x + y for 0 ≤ x, y ≤ 1
=
0 otherwise

(b)
Z
fX (x) = fXY (x, y)dy
(R 1
= 0 (x + y)dy for 0 ≤ x ≤ 1
0 otherwise
(
x + 21 for 0 ≤ x ≤ 1
=
0 otherwise

(c) First note that:


Z
fZ (z) = fXY Z (x, y, z)dxdy
(R 1 R 1
= 0 0 (x + y)dxdy for 0 ≤ z ≤ 1
0 otherwise
(
1 for 0 ≤ z ≤ 1
=
0 otherwise,

so that
fXY Z (x, y, z)
fXY |Z (x, y|z) = = fXY Z (x, y, z).
fZ (z)

(d)
fXY |Z (x, y|z) = fXY Z (x, y, z) = fXY (x, y)

=⇒ X and Y are independent Z

Problem 2.
Since X, Y , Z are independent, fXY |Z (x, y|1) = fXY (x, y) = fX (x)fY (y), so that:

E[XY |Z = 1] = E[XY ] = E[X]E[Y ] = 0,

and
E[X 2 Y 2 Z 2 |Z = 1] = E[X 2 Y 2 ] = E[X 2 ]E[Y 2 ] = 1.
107

Problem 3. To solve this problem, I first state a general result for a multivariate normal. Suppose
that
     
XA µA ΣAA ΣAB
∼N , ,
XB µB ΣBA ΣBB
where XA , µA ∈ Rm , XB , µB ∈ Rn , ΣAA ∈ Rm×m , ΣBB ∈ Rn×n , ΣAB ∈ Rm×n and ΣBA =
ΣTAB . (Note that, here, I have written this vector equation out in the so-called “partitioned form”
for convenience.) Then, it is not difficult to show that1 :

XA |XB = xB ∼ N (µA + ΣAB Σ−1 −1


BB (xB − µB ), ΣAA − ΣAB ΣBB ΣBA ).

I now define the random variable, U = Y + Z, solve for the joint PDF of X, Y, U , then condition
on U using the formula above so that I may compute E[XY |Y + Z = 1]. Since U = Y + Z, and
Y and Z are 2 independent N (1, 1) distributions, then U ∼ N (2, 2). Recall that for 3 marginally
normal distributions, the joint distribution is:
     
X1 E[X1 ] V ar[X1 ] Cov[X1 , X2 ] Cov[X1 , X3 ]
X2  ∼ N E[X2 ] , Cov[X2 , X1 ] V ar[X2 ] Cov[X2 , X3 ] .
X3 E[X3 ] Cov[X3 , X1 ] Cov[X3 , X2 ] V ar[X3 ]

Thus, to solve for the joint distribution of X, Y, U , all that is left to do is to calculate the covariance
terms involving U , Cov[U, X] = Cov[Y + Z, X] = 0 and Cov[U, Y ] = Cov[Y + Z, Y ] = V ar[Y ], so
the the joint distribution is:      
X 1 1 0 0
 Y  ∼ N 1 , 0 1 1 .
U 2 0 1 2
To solve for the distribution of X, Y |U = 1, it is not difficult to identify that, here, ΣAA = I2 (the
2 × 2 identity matrix), ΣAB = [0, 1]T , ΣBA = [0, 1], and ΣBB = 2, where I identify XA with
[X, Y ]T and XB with U . The mean of the conditional distribution is therefore given by
     
1 0 1 1
µA|B = + (1 − 2) = 1 ,
1 1 2 2

and the covariance matrix of the conditional distribution is given by:


     
1 0 0 1  1 0
ΣA|B = − 0, 1 = .
0 1 1 2 0 21

Finally, I have the conditional distribution I desire:


   
1 1 0
X, Y |U = 1 ∼ N 1 , .
2 0 12

We would like to solve for E[XY |U = 1], which we can get easily from the covariance term of the
above distribution: E[XY |U = 1] = Cov[X, Y |U = 1] + E[X|U = 1]E[Y |U = 1] = 0 + (1) · (1/2) =
1/2.

Problem 4. Due to the symmetry of the problem, Y1 , Y2 , . . . Yn are all identically distributed,
and thus:
E[Y ] = E[Y1 + Y2 + . . . + Yn ] = E[Y1 ] + E[Y2 ] + . . . + E[Yn ] = nE[Y1 ],
1
for example, see: Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
108 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

and
n
X X
V ar[Y ] = V ar[Y1 + Y2 + . . . + Yn ] = V ar[Yi ] + 2 Cov[Yi , Yj ] = nV ar[Y1 ] + 2nCov[Y1 , Y2 ].
i=1 i<j

The reason there is a factor of n in front of the covariance term can be seen from the matrix below
(an n = 5 example). The summation over i < j means that we can only consider pairs of i, j
above the diagonal (shaded cells). Further, it is evident that Yi , Yj pairs that are 2 or more apart
are independent. For example, Y2 = X2 X3 , Y3 = X3 X4 , Y4 = X4 X5 , so that Y2 and Y3 are not
independent because they share the X3 random variable, but Y2 and Y4 are independent because
they share no random variables (and all the Xi s are independent). Since independence =⇒ no
covariance, the only Yi , Yj pairs that contribute to the sum are the n − 1 terms right above the
diagonal (as indicated by the spades in the figure). We also cannot forget that the Yn , Y1 pair is
not independent because they share the X1 random variable (the spade in the upper right hand
corner). We thus see that there are n pairs that contribute to the sum.

Y1 Y2 Y3 Y4 Y5
Y1 ♠ ♠
Y2 ♠
Y3 ♠
Y4 ♠
Y5

iid
It remains to compute E[Y1 ], V ar[Y1 ] and Cov[Y1 , Y2 ]. Since Y1 = X1 X2 , and X1 , X2 ∼ Bern(p),
the range of Y1 is {0, 1}, with probability p2 of obtaining 1 (X1 = 1 and X2 = 1). In other words,
Y1 ∼ Bern(p2 ), so that E[Y1 ] = p2 and V ar[Y1 ] = p2 (1 − p2 ). All that is left to do is to compute
the covariance:

Cov[Y1 , Y2 ] = E[Y1 Y2 ] − E[Y1 ]E[Y2 ]


= E[X1 X2 X2 X3 ] − E[X1 X2 ]E[X2 X3 ]
= E[X1 ]E[X22 ]E[X3 ] − E[X1 ]E[X2 ]2 E[X3 ]
= p · p · p − p · p2 · p
= p3 (1 − p),

where in the second line I have used the fact that all the Xs are independent, and in the fourth
line I have used the fact that for a Bern(p) distribution, p(1 − p) = E[X 2 ] − p2 .
Thus, we have that:
E[Y ] = np2

and

V ar[Y ] = np2 (1 − p2 ) + 2np3 (1 − p)


= np2 (3p + 1)(1 − p).

Problem 5.
109

(a) To solve for the expectation, note that:


E[X] = E[X1 + X2 + . . . + Xk ]
= E[X1 ] + E[X2 ] + . . . + E[Xk ]
1
X 1
X 1
X
= jP (X1 = j) + jP (X2 = j) + . . . + jP (Xk = j)
j=0 j=0 j=0

= P (X1 = 1) + P (X2 = 1) + . . . + P (Xk = 1),


where in the second line I have used the linearity of expectation. To solve this problem, we
therefore need to solve for P (Xi = 1) for all i = 1, 2, . . . , k.
To solve for P (Xi = 1), first suppose that we draw all b+r balls and create a specific sequence
of blues and reds. Note that all possible sequences are equally likely to occur. To see this, as
an example, suppose r = 3 and b = 2 and we draw the sequence RRBRB. The probability
that this occurs is
3 3−1 2 3−2 2−1 3!2!
· · · · = .
3+2 3+2−1 3+2−2 3+2−3 3+2−4 (3 + 2)!
Suppose instead we had drawn the sequence BRRBR. The probability that this sequence
occurs is:
2 3 3−1 2−1 3−2 3!2!
· · · · = .
3+2 3+2−1 3+2−2 3+2−3 3+2−4 (3 + 2)!
Notice that since the probability of all possible sequences with r = 3 and b = 2 is simply the
product of the same terms in the numerators and the same terms in the denominators, but
in a different order, the probability of any possible sequence is the same product. Thus, in
general, the probability of any specific sequence occurring is r!b!/(b + r)!. As a check that all
possible sequences are equally as likely and the probability of each sequence is r!b!/(b + r)!, if
we multiply this probability by the total number of distinct sequences the result should be 1.
Taking into account the indistinguishability of all red balls and all blue balls, the total number
of distinct sequences is (b + r)!/(r!b!). Multiplying these values together indeed results in 1.
Now that we know that all outcomes are equally as likely and that there is a finite sample
space, we can use combinatorics to find the probabilities we are after. Let us concentrate on
the ith draw and compute the probability that the ith draw is blue. Since all sequences are
equally likely to occur, to compute this probability, we need only to divide the total number
of unique sequences with a blue ball in the ith spot by the total number of possible unique
sequences. Since all red balls are indistinguishable and all blue balls are indistinguishable,
as above, the total number of unique sequences is thus (b + r)!/(r!b!). The total number of
unique sequences with a blue ball in the ith spot is (b + r − 1)!/[r!(b − 1)!], and therefore the
probability of obtaining a blue ball in the ith spot, P (Xi = 1), is:
(b+r−1)!
r!(b−1)! b
P (Xi = 1) = (b+r)!
= .
b+r
r!b!

The desired expectation value is therefore


E[X] = P (X1 = 1) + P (X2 = 1) + . . . + P (Xk = 1)
b b b
= + + ... +
b+r b+r b+r
| {z }
k times
kb
= .
b+r
110 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

(b) To solve for V ar[X], I already have E[X], so I just need to solve for E[X 2 ]:

E[X 2 ] = E[(X1 + X2 + . . . + Xk )2 ]
 
Xk X
=E Xi2 + Xi Xj 
i=1 i,j:i6=j
k
X X
= E[Xi2 ] + E[Xi Xj ],
i=1 i,j:i6=j

P
where in the third line I have used the linearity of expectation, and where the notation i,j:i6=j
refers to a summation over all i, j pairs (i, j=1, 2, . . . , k) expect the pairs for which i = j
(which was accounted for in the first summation).

We have already pretty much solved for E[Xi2 ]:

1
X
E[Xi2 ] = j 2 P (Xi = j)
j=0

= P (Xi = 1)
b
= ,
b+r

so that the first summation is:


k
X kb
E[Xi2 ] = .
b+r
i=1

The second summation is slightly more difficult. The strategy I take is to condition on one of
the random variables and to use the law of total expectation (since Xi , Xj ∈ {0, 1}, so that
one of the terms will go to zero):

E[Xi Xj ] = E[Xi Xj |Xj = 0]P (Xj = 0) + E[Xi Xj |Xj = 1]P (Xj = 1)


= E[Xi |Xj = 1]P (Xj = 1)
" 1 #
X
= lP (Xi = l|Xj = 1) P (Xj = 1)
l=0
= P (Xi = 1|Xj = 1)P (Xj = 1),

I thus need to solve for P (Xi = 1|Xj = 1), which I can do in a very similar combinatorial
fashion as I did for P (Xi = 1). For this probability, the total sample space is all sequences
of size k with b blue balls and r red balls, with a blue ball in the j th spot. The size of the
sample space is thus (b + r − 1)!/[(b − 1)!r!]. The number of unique sequences with a blue
ball in the j th spot and a blue ball in the ith spot is (b + r − 2)!/[r!(b − 2)!], so that

(b+r−2)!
r!(b−2)! b−1
P (Xi = 1|Xj = 1) = = .
(b+r−1)! b+r−1
(b−1)!r!
111

Finally, the second summation is


X X
E[Xi Xj ] = P (Xi = 1|Xj = 1)P (Xj = 1)
i,j:i6=j i,j:i6=j
X b−1 b
=
b+r−1b+r
i,j:i6=j
(k 2 − k)(b − 1) b
= ,
b+r−1 b+r
so that
kb (k 2 − k)(b − 1) b
E[X 2 ] = +
b+r b+r−1 b+r
kbr + k 2 b2 − bk 2
= .
(b + r)(b + r − 1)

The variance is thus:

V ar[X] = E[X 2 ] − E[X]2


 2
kbr + k 2 b2 − bk 2 kb
= −
(b + r)(b + r − 1) b+r
kbr(b + r − k)
= .
(b + r)2 (b + r − 1)

Problem 6. I start by writing out the definition of the MGF:

MX (s) = E[esX ]

X
= p(1 − p)k−1 esk
k=1
(∞ )
p X
s k
= [(1 − p)e ] − 1 ,
1−p
k=0

where we recognize that the summation is a geometric series, and is finite provided (1 − p)es < 1.
Using the formula for a geometric series, and simplifying, I have that:

pes
MX (s) = ,
1 + (p − 1)es

for s < − ln(1 − p).

Problem 7. We can solve this problem by realizing that the k th derivative of the MGF evaluated
at s = 0 gives the k th moment of the distribution:

dMX
E[X] =
ds s=0 
d 1 1 s 1 2s
= + e + e
ds 4 2 4 s=0
= 1,
112 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

2 d2 MX
E[X ] =
ds2 s=0 
d2 1 1 s 1 2s
= 2 + e + e
ds 4 2 4 s=0
3
= .
2

We therefore have that V ar[X] = 3/2 − 12 = 1/2.

Problem 8. We already know from Problem 5 in section 6.1.6 of the book that the MGF for a
N (µ, σ 2 ) distribution is M (s) = exp(sµ + σ 2 s2 /2), and since the MGF of the sum of independent
random variables is the product of the MGFs of the random variables, we have that:

MX+Y (s) = MX (s)MY (s)


 2 s2   
σX σY2 s2
= exp sµX + exp sµY +
2 2
 2

s 2
= exp s(µX + µY ) + (σX + σY2 ) .
2
2 + σ 2 ) distribution. Further, by Theorem 6.1
We recognize this as the MGF of a N (µX + µY , σX Y
in the book, the MGF of a random variable uniquely determines its distribution, so that indeed
X + Y ∼ N (µX + µY , σX2 + σ 2 ).
Y

Problem 9. As a note, for the Laplace distribution, λ > 0.

MX (s) = E[esX ]
Z
λ ∞ −λ|x|+sx
= e dx
2 −∞
Z 0 Z ∞ 
λ x(λ+s) x(s−λ)
= e dx + e dx
2 −∞ 0

Notice that, for both integrals to be finite, we have the conditions that λ + s > 0 (for the first
integral) and s − λ < 0 (for the second), or in other words |s| < λ. Assuming these two conditions,
the integral can easily be evaluated, and is:

λ2
MX (s) = .
λ2 − s2

Problem 10.

MX (s) = E[esX ]
Z ∞ sx α α−1 −λx
e λ x e
= dx
0 Γ(α)
Z ∞
λα
= xα−1 e−(λ−s)x dx
Γ(α) 0
λα Γ(α)
= for s < λ
Γ(α) (λ − s)α
 α
λ
=
λ−s
113

Problem 11. For Xi ∼ Exp(λ), from Example 6.5 in the book, we have that MXi = λ/(λ − s)
for s < λ. Moreover since the MGF of the sum of independent random variables is the product of
the MGFs of the random variables, we have that:
MY (s) = MX1 (s)MX2 (s) . . . MXn (s)
 n
λ
= ,
λ−s
which, from the previous problem, we notice is the MGF of a Gamma(n, λ) random variable. By
Theorem 6.1 in the book, the MGF of a random variable uniquely determines its distribution, so
Y ∼ Gamma(n, λ).

Problem 12. By the definition of the characteristic function, we have that:


φY (ω) = E[eiωY ]
= E[eiω(aX+b) ]
= eiωb E[ei(aω)X ]
= eiωb φX (aω).

Problem 13.
(a) To solve for E[U ], I first find the marginal PDFs:
(R 1 (
1 3 1
0 2 (3x + y)dy for 0 ≤ x ≤ 1 x+ for 0 ≤ x ≤ 1
fX (x) = = 2 4 ,
0 otherwise 0 otherwise
and (R 1 (
1 1
0 2 (3x for 0 ≤ y ≤ 1
+ y)dx y + 34 for 0 ≤ y ≤ 1
fY (y) = = 2 .
0 otherwise 0 otherwise
R1 R1
Thus, E[X] = 0 x(3x/2 + 1/4)dx = 5/8 and E[Y ] = 0 y(y/2 + 3/4)dy = 13/24, so that
  5
E[X] 8 .
E[U ] = = 13
E[Y ] 24

(b) In order to solve forRU , I will first need to compute E[X 2 ], E[Y 2 ] and E[XY ]:
Z 1  
2 2 3 1 11
E[X ] = x x+ dx = ,
0 2 4 24
Z 1  
1 3 3
E[Y 2 ] = y2 y+ dy = ,
0 2 4 8
and Z Z
1 1 1 1
E[XY ] = xy(3x + y)dxdy = .
2 0 0 3
I can now immediately write down the correlation matrix:
RU = E[U U T ]
 
E[X 2 ] E[XY ]
=
E[Y X] E[Y 2 ]
 11 1 
= 241
3 .
3
3 8
114 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

(c) The covariance matrix is:

CU = E[U U T ] − E[U ]E[U ]T


 
E[X]2 E[X]E[Y ]
= RU −
E[Y ]E[X] E[Y ]2
  "   #
11 1 5 2 5 13
= 24 3 − 8  8 24
1 3 13 5 13 2
3 8 24 8 24
 13 1

192 − 192
= 1 47 .
− 192 576

Problem 14.

(a) First note that the range of Y is [0, 1]. Since we know the distribution of Y |X = x and the
distribution of X, the law of total probability for PDFs will probably be useful. Note that
fY |X (y|x) = 1/x for 0 ≤ y ≤ x and 0 otherwise, which can be written as (1/x)1{0 ≤ y ≤ x},
which will be helpful in the integral to get the bounds of integration correct:
Z 1
fY (y) = fY |X (y|x)fX (x)dx
0
Z 1
1
= 1{0 ≤ y ≤ x}dx
0 x
Z 1
1
= 1{y ≤ x}dx for y > 0
0 x
Z 1
1
= dx
y x
= − ln y,

and thus (
− ln y for 0 ≤ y ≤ 1
fY (y) =
0 otherwise,
which I checked integrates to 1.
Finding the PDF of Z is very similar to that of finding the PDF for Y . Firstly, the range of
Z is [0, 2]. Note that in this case fZ|X (z|x) = 1/2x for 0 ≤ z ≤ 2x and 0 otherwise, which
can be written as (1/2x)1{0 ≤ z ≤ 2x}, so that the integral above becomes:
Z 1
1
fZ (z) = 1{z ≤ 2x}dx for z > 0
0 2x
Z 1
1
= dx
z/2 2x
ln 2 ln z
= − ,
2 2
and thus (
ln 2 ln z
2 − 2 for 0 ≤ z ≤ 2
fZ (z) =
0 otherwise,
which I checked integrates to 1.
115

(b) Using the chain rule of probability, we have that:


fXY Z (x, y, z) = fZ|XY (z|x, y)fY |X (y|x)fX (x)
= fZ|X (z|x)fY |X (y|x)fX (x),
where in the second line I used the fact that Z and Y are conditionally independent given X.
We thus have that
(
1
2 for 0 ≤ x ≤ 1, 0 ≤ y ≤ x, 0 ≤ z ≤ 2x
fXY Z (x, y, z) = 2x
0 otherwise,
which I again checked integrates to 1.

Problem 15.
(a) As stated in the problem, we have that
     
X1 1 4 1
∼N , ,
X2 2 1 1
and for a bivariate normal, we know that
     
X1 E[X1 ] V ar[X1 ] Cov[X1 , X2 ]
∼N , .
X2 E[X2 ] Cov[X2 , X1 ] V ar[X2 ]
Thus, I have that X2 ∼ N (2, 1), so that:
P (X2 > 0) = 1 − P (X2 ≤ 0)
 
0−2
=1−Φ
1
= Φ(2)
≈ 0.98.

(b)
Y = AX + b
   
2 1   −1
  X1
= −1 1 + 0

X2
1 3 1
 
2X1 + X2 − 1
=  −X1 + X2 
X1 + 3X2 + 1
=⇒
 
2E[X1 ] + E[X2 ] − 1
E[Y ] =  −E[X1 ] + E[X2 ] 
E[X1 ] + 3E[X2 ] + 1
 
2·1+2−1
=  −1 + 2 
1+3·2+1
 
3
= 1

8
116 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

(c) We know that a linear combination of a multivariate Gaussian random variable is also Gaus-
sian. Specifically, Y is distributed as Y ∼ N (AE[X] + b, ACX AT ), and thus the covariance
matrix of Y is

CY = ACX AT
 
2 1   
4 1 2 −1 1
= −1 1
1 1 1 1 3
1 3
 
21 −6 18
= −6 3 −13 .
18 −3 19

Notice that, as it should be, CY is symmetric.

(d) As with the first part of this problem, we know that

     
Y1 E[Y1 ] V ar[Y1 ] Cov[Y1 , Y2 ] Cov[Y1 , Y3 ]
Y2  ∼ N E[Y2 ] , Cov[Y2 , Y1 ] V ar[Y2 ] Cov[Y2 , Y3 ] ,
Y3 E[Y3 ] Cov[Y3 , Y1 ] Cov[Y3 , Y2 ] V ar[Y3 ]

so that Y2 ∼ N (1, 3), and therefore:


 
2−1
P (Y2 ≤ 2) = Φ √ ≈ 0.72.
3

Problem 16. To solve this problem, I first review how to “complete the square” for matrices. For
a ∈ R, x, b ∈ Rm and C ∈ Rm×m (and symmetric), a quadratic of the form

1
a + bT x + xT Cx
2
can be factored into the form
1
(x − m)T M (x − m) + v,
2
where
M = C,

m = −C −1 b,

and
1
v = a − bT C −1 b.
2
I now explicitly write out the MGF of X:

MX (s, t, r) = E[esX1 +tX2 +rX3 ]


Z  
T 1 1 T −1
= exp{s x} 3/2 1/2
exp − (x − µ) Σ (x − µ) d3 x
R3 (2π) |Σ| 2
Z  
1 1 T −1
= exp − (x − µ) Σ (x − µ) + s x d3 x,
T
(2π)3/2 |Σ|1/2 R3 2
117

where xT ≡ [x1 , x2 , x3 ] and sT ≡ [s, t, r]. To make the exponent more Gaussian looking, I now
expand the exponent out and complete the square (note that since Σ is symmetric, then so too is
Σ−1 ):

1 1 
− (x − µ)T Σ−1 (x − µ) + sT x = − xT Σ−1 x − xT Σ−1 µ − µT Σ−1 x + µT Σ−1 µ + sT x
2 2
1 1
= − xT Σ−1 x + µT Σ−1 x − µT Σ−1 µ + sT x
2 2
1 T −1 1
= − x Σ x + (s + µ Σ )x − µT Σ−1 µ,
T T −1
2 2

where I have used the fact that (xT Σ−1 µ)T = xT Σ−1 µ since this is just a real number. I can now
read off a, b and C:
1
a = − µT Σ−1 µ,
2
bT = sT + µT Σ−1
and
C = −Σ−1 ,
so that
b = (bT )T = s + Σ−1 µ
and
−1
C −1 = −Σ−1 = −Σ.
Finally, the exponent can be re-expressed as

1
− (x − m̃)T M̃ (x − m̃) + ṽ,
2
where
m̃ = Σ(s + Σ−1 µ),

M̃ = Σ−1
and
1 1
ṽ = − µT Σ−1 µ + (sT + µT Σ−1 )Σ(s + Σ−1 µ)
2 2
T
s Σs
= sT µ + ,
2
so that the integral becomes:
Z  
1 1 T −1
MX (s, t, r) = exp(ṽ) exp − (x − m̃) Σ (x − m̃) d3 x
(2π)3/2 |Σ|1/2 R3 2
 T

s Σs
= exp sT µ + ∀s ∈ R3 .
2

I have used the fact that the integral is that of a Gaussian integrated over its entire domain, so
that the integral evaluates to 1. Note that we probably could have guessed this form of the MGF
of X, since it is the vector analogue of the 1 dimensional case: MX (s) = exp{sµ + σ 2 s2 /2}, as
found in Problem 5 of 6.1.6 in the book.
118 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

The specified values of the mean vector and covariance matrix are:
 
1
µ = 2 ,
0

and  
9 1 −1
Σ =  1 4 2 ,
−1 2 4
and plugging in these specific values into the equation I derived above, and multiplying the matrices,
I finally arrive at:
   
9 2
MX (s, t, r) = exp s s + 1 + 2t(t + 1) + 2r − rs + st + 2rt .
2

Problem 17. Let Ai , i = 1, 2, 3, 4, be the event that the ith component fails, so that the a failure
occurs under the event ∪41=i Ai . We can thus obtain an upper limit on the event that failure occurs
using the union bound:
4
! 4
[ X
P Ai ≤ P (Ai )
1=i i=1
≤ 4pf
1
= .
25
Problem 18.
iid
(a) First note that for the random position of a node (Xi , Yi ), X1 , X2 , . . . , Xn , Y1 , Y2 , . . . , Yn ∼
U nif (0, 1). Let us call the node under consideration node j, and let the set S be defined as
S ≡ {1, 2, . . . , n} − {j}. The probability that the node is isolated, pd , is:

!
\ 
pd = P (Xj − Xi )2 + (Yj − Yi )2 > r2
i∈S
Z !
1Z 1 \ 
= P (Xi − Xj )2 + (Yi − Yj )2 > r2 Xj = xj , Yj = yj fXj Y j (xj , yj )dxj dyj
0 0 i∈S
Z !
1Z 1 \ 
= P (Xi − xj )2 + (Yi − yj )2 > r2 fXj (xj )fYj (yj )dxj dyj
0 0 i∈S
Z 1Z 1 Y 
= P (Xi − xj )2 + (Yi − yj )2 > r2 fXj (xj )fYj (yj )dxj dyj
0 0 i∈S
Z 1Z 1Y 
= 1 − P (Xi − xj )2 + (Yi − yj )2 ≤ r2 fXj (xj )fYj (yj )dxj dyj
0 0 i∈S
Z Z 1
1 n−1
= 1 − P (X1 − xj )2 + (Y1 − yj )2 ≤ r2 fXj (xj )fYj (yj )dxj dyj ,
0 0

where in the third and fourth lines I have used the fact that the random variables are inde-
pendent, and in the last line, I have used symmetry. I now must compute P ((X1 − xj )2 +
119

.
y1
( xj , y j )

.
0
(xj , yj )

0 1
x1

Figure 6.1: Two example nodes at (xj , yj ) for Problem 18.

(Y1 − yj )2 ≤ r2 ). If the given point (xj , yj ) is near the middle of the square (as in the upper
point in Fig. 6.1) then the probability of this event is simply the area of this circle (shaded
grey region). However, we notice that if (xj , yj ) is near the edge of the square, part of the
shaded circle will get cutoff. In fact, if (xj , yj ) is exactly at one of the corners of the unit
square, for example, at (0, 0) as in Fig. 6.1, then the amount of shaded area is minimized at
πr2 /4. Thus, for any given (xj , yj ), P ((X1 − xj )2 + (Y1 − yj )2 ≤ r2 ) ≥ πr2 /4. Therefore,
Z 1Z 1 n−1
πr2
pd ≤ 1− fXj (xj )fYj (yj )dxj dyj
0 0 4
 n−1
πr2
= 1−
4

(b) Let Ai be the event that the ith node is isolated. Then the probability we seek is:
n
! n
[ X
P Ai ≤ P (Ai )
i=1 i=1
n
X
= pd
i=1
 n−1
πr2
=n 1−
4

Problem 19. For X ∼ Geom(p), E[X] = 1/p, so that the Markov inequality is:
E[X] 1
P (X ≥ a) ≤ = .
a pa
120 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

a=1 a=2
10 10
Markov upper bound Markov upper bound
Exact Exact
8 8
P (X ≥ a)

P (X ≥ a)
6 6

4 4

2 2

0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p

Figure 6.2: Comparison of Markov upper bound to exact probability for Problem 19.

The exact probability is:



X
P (X ≥ a) = p(1 − p)k−1
k=a
∞ a−1
!
p X X
k k
= (1 − p) − (1 − p)
1−p
k=0 k=0
 
p 1 1 − (1 − p)a
= −
1−p 1 − (1 − p) 1 − (1 − p)
= (1 − p)a−1 .

The Markov upper bound is greater than or equal to the exact probability for a ≥ 1 and 0 < p < 1
as shown for a few values of a in Fig. 6.2

Problem 20.
V ar[X] 1−p
P (|X − E[X]| ≥ b) ≤ =
b2 pb2
Problem 21.
 
σ2 σ2
P (X ≥ a) = P X + ≥a+
a a
 2
2  2 !
σ σ2
=P X+ ≥ a+
a a
h 2
i
E (X + σa )2
≤  2 Markov0 s Inequality
2
a + σa
σ4
σ2 + a2
= σ2 2
a2 (1 + a2
)
σ2
=
a2 + σ 2
Problem 22.
121

(a)

P (X ≤ 80 or X ≥ 120) = P (|X − 100| ≥ 20)


= P (|X − E[X]| ≥ 20)
225
≤ 2 by Chebyshev
20
9
=
16

(b)
225 45
P (X ≥ 120) ≤ 2
=
225 + 120 293
iid
Problem 23. We know from Problem 11 that if X1 , X2 , . . . , Xn ∼ Exp(λ), then Y ≡ X1 + X2 +
. . . + Xn ∼ Gamma(n, λ). The relevant Chernoff bound is given by:

P (Y ≥ a) ≤ min{e−sa MX (s)},
s>0

where, in this case the MX (s) is the MGF for a Gamma(n, λ) distribution. This MGF was solved
for in Problem 10, and is given by
 n
λ
MX (s) = for s < λ,
λ−s
and therefore we must minimize the objective function over 0 < s < λ. Let the optimal value
be called s? . I solve for s? in the standard calculus manner by setting the derivative equal 0. I
then check to make sure that this optimal value is within the interval (0, λ). The derivative of the
objective can be found easily with the chain rule:
 n  n  n
d −sa λ −sa λ −sa λ 1
e = −ae + ne ,
ds λ−s λ−s λ−s λ−s
and setting this equal to zero and solve for s? results in
n
s? = λ − .
a
As stipulated in the problem a > n/λ, which means that n/a is positive and less than λ. Thus we
have that s? ∈ (0, λ) as required.
The desired bound is therefore
?
P (Y ≥ a) ≤ e−s MX (s? )
 n
−λa+n λa
=e .
n
We can understand the behavior of this function as n → ∞ by expanding the exponential in
powers of 1/n:
 n     n
−λa+n λa −λa 1 1
e =e 1+O (λa)n
n n n
 n  n+1 !
λa 1
= e−λa +O ,
n n

and we thus see that the upper bound goes to 0 exponentially fast as n goes to infinity.
122 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES

Problem 24. Using some properties of absolute values I have that:

E[|X + Y |p−1 |X|] = E[|(X + Y )p−1 ||X|]


= E[|(X + Y )p−1 X|]
p p−1 1
≤ E[|(X + Y )p−1 | p−1 ] p E[|X|p ] p
p−1 1
= E[|X + Y |p ] p [|X|p ] p ,

where in the third line I have used Hölder’s inequality, E[|U V |] ≤ E[|U |α ]1/α E[|V |β ]1/β , with
1 < α, β < ∞ and 1/α + 1/β = 1, and where I have specifically chosen α = p/(p − 1) and β = p.
Using the inequality provided in the book,

E[|X + Y |p ] ≤ E[|X + Y |p−1 |X|] + E[|X + Y |p−1 |Y |]


p−1 1 1
≤ E[|X + Y |p ] p (E[|X|p ] p + E[|Y |p ] p ),

and multiplying both sides of this equation by E[|X + Y |p ](1−p/p) (which we can do without flipping
the inequality sign since we know that this quantity is positive) yields the desired result:
1 1 1
E[|X + Y |p ] p ≤ E[|X|p ] p + E[|Y |p ] p .

Problem 25.

(a)
d2
(x − x3 ) = −6x
dx2
=⇒ (
+ (convex) for x < 0
g 00 (x) =
− (concave) for x > 0,
Since we know X is a positive random variable, by Jensen’s inequality, we have:

E[X − X 3 ] ≤ E[X] − E[X]3 = −990.

(b)
d2 √ 1
2
(x ln x) =
dx 2x
00
=⇒ g (x) > 0 (convex) for x > 0 =⇒
√ p √
E[X ln X] ≥ E[X] ln E[X] = 10 ln 10

(c) The function is a typical absolute value function, an upward v shape hitting y = 0 at x = 2,
which is clearly convex, since a straight line drawn from any 2 points on the graph is always
above the graph. Therefore, I have that E[|2 − X|] ≥ |2 − E[X]| = 8.

Problem 26. Taking the second derivative, we have that d2 /dx2 (x3 − 6x2 ) = 6x − 12. Setting to
zero and solving for x, I find that the second derivative is negative for x < 2 and positive for x > 2.
Since the range of X is (0, 2), we have that g(x) = x3 − 6x2 is concave in this interval. By Jensen’s
inequality, this implies that E[Y ] = E[g(X)] ≤ g(E[X]) = E[X]3 − 6E[X]2 = 1 − 6 = −5.
Chapter 7

Limit Theorems and Convergence of


Random Variables

123
124 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES

Problem 1.

(a)

1
E[Mn ] = E[X1 + . . . + Xn ]
n
1
= nE[X1 ]
n
1
=
2

1
V ar[Mn ] = V ar[X1 + . . . + Xn ]
n2
1
= 2 nV ar[X1 ] (independence)
n
1
=
12n

(b)
   
1 1 1
P Mn − ≥ = P |M − E[M ]| ≥
2 100
n n
100
V ar[Mn ]
≤ 
1 2
100
2500
=
3n

(c)  
1 2500
lim P |Mn − E[Mn ]| ≥ ≤ lim =0
n→∞ 100 n→∞ 3n
=⇒  
1 1
lim P Mn − ≥ =0
n→∞ 2 100

Problem 2. Let X1 , X2 , . . . , X365 be the number of accidents on day 1, day 2, . . ., day 365, so that
iid
the total number of accidents in the year is Y = X1 + . . . + X365 . We know that p 365 ∼
X1 , X2 , . . . , X√
P oiss(λ), where λ = 10 accidents/day, so that µ = E[Xi ] = λ = 10 and σ = V ar[Xi ] = λ =

10. Using the central limit theorem, I have
 
Y − 365 · 10 3800 − 365 · 10
P (Y > 3800) = P √ > √
365 · 10 365 · 10
 
3800 − 365 · 10
= P Z365 > √
365 · 10
 
3800 − 365 · 10
= 1 − P Z365 ≤ √
365 · 10
 
3800 − 365 · 10
≈1−Φ √
365 · 10
−3
≈ 6.5 × 10 .
125

Problem 3. Let the random variable, Xi be 0 if the ith bit is not received in error and 1 if it
iid
is. Notice that X1 , X2 , . . . , X1000 ∼ Bern(0.1), and let theptotal number
p of errors, Y√, be Y =
X1 + . . . + X1000 . Note that µ = E[Xi ] = p = 0.1 and σ = V ar[Xi ] = p(1 − p) = 0.09. We
seek the probability of decoding failure, in other words P (Y > 125):

P (Y > 125) = 1 − P (Y ≤ 125)


 
Y − 1000 · 0.1 125 − 1000 · 0.1
=1−P √ ≤ √
0.09 · 1000 0.09 · 1000
 
125 − 1000 · 0.1
= 1 − P Z1000 ≤ √
0.09 · 1000
 
25
≈1−Φ √
3 10
= 4.2 × 10−3 .

Problem 4. Let the random variable, Xi be 0 if the ith student does not have a car and 1 if the ith
iid
student does have a car. Notice that X1 , X2 , . . . , X50 ∼ Bern(0.5), and
p let the totalpnumber of cars,
Y , be Y = X1 + . . . + X50 . Note that µ = E[Xi ] = p = 0.5 and σ = V ar[Xi ] = p(1 − p) = 0.5.
We seek the probability that there are not enough car spaces, in other words P (Y > 30):

P (Y > 30) = P (Y > 29.5) (using the continuity correction)


= 1 − P (Y ≤ 29.5)
 
Y − 50 · 0.5 29.5 − 50 · 0.5
=1−P √ ≤ √
50 · 0.5 50 · 0.5
 
29.5 − 50 · 0.5
= 1 − P Z50 ≤ √
50 · 0.5
≈ 1 − Φ (1.27)
≈ 0.10.

Problem 5. Let N , a random variable, be the number of jobs processed in 7 hours (420 mins).
We seek the probability that the number of jobs processed in 7 hours is less than or equal to 40,
P (N ≤ 40). This can be rephrased as the probability that the total time to processes 40 jobs is
greater than or equal to 7 hours:

P (N ≤ 40) = P (X1 + . . . + X40 ≥ 420)


 
X1 + . . . + X40 − 40 · 10 420 − 40 · 10
=P √ √ ≥ √ √
40 · 2 40 · 2
 
5
= P Z40 ≥ √
5
 
5
= 1 − P Z40 < √
5
 
5
≈1−Φ √
5
−2
≈ 1.3 × 10 .

Problem 6. Let Xi be the number of heads flipped on toss i, so that the total proportion of
iid
heads out of n tosses, X, is X = (X1 + . . . + Xn )/n. Notice that X1 , . . . , Xn ∼ Bern(0.5), so
126 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES
p √
that µ = E[Xi ] = p = 0.5 and σ = V ar[Xi ] = 0.52 = 0.5. To be at least 95% sure that
0.45 ≤ X ≤ 0.55, we have that:
0.95 ≤ P (0.45 ≤ X ≤ 0.55)
 
0.45n − n · 0.5 X1 + . . . + Xn − n · 0.5 0.55n − n · 0.5
=P √ ≤ √ ≤ √
n · 0.5 n · 0.5 n · 0.5
√ √ 
= P −0.1 n ≤ Zn ≤ 0.1 n
√ √
≈ Φ(0.1 n) − Φ(−0.1 n)

= 2Φ(0.1 n) − 1.

Thus, I have that 0.95 . 2Φ(0.1 n) − 1. Applying the inverse normal CDF function to this
inequality, I arrive at: &   2 '
1.95
n & 100 Φ−1 = 385. (7.1)
2
p
Problem 7. Note that X1 , X2 , . . . , Xn are iid with µ = E[Xi ] = 0 and σ = V ar[Xi ] = 2, so we
can use the CLT. To be at least 95% sure that the final estimate is within 0.1 units of q, we require:
0.95 ≤ P (q − 0.1 ≤ Mn ≤ q + 0.1)
 
X1 + . . . + Xn + nq
= P q − 0.1 ≤ ≤ q + 0.1
n
= P ((q − 0.1)n − nq ≤ X1 + . . . + Xn ≤ (q + 0.1)n − nq)
 
(q − 0.1)n − nq X1 + . . . + Xn (q + 0.1)n − nq
=P √ ≤ √ ≤ √
2 n 2 n 2 n
 √ √ 
−0.1 n 0.1 n
=P ≤ Zn ≤
2 2
 √   √ 
0.1 n −0.1 n
≈Φ −Φ
2 2
 √ 
0.1 n
= 2Φ − 1.
2

We therefore have that 0.95 . 2Φ(0.1 n/2) − 1. Applying the inverse normal CDF function to
this inequality, I arrive at: &   2 '
1.95
n & 400 Φ−1 = 1537. (7.2)
2

Problem 8. To solve this problem, I first compute the limit of exp[n(x − 1)]/{1 + exp[n(x − 1)]}
for x > 0 as n goes to ∞. Notice that this function has different behavior for x = 1 (in which case
the limit evaluates easily to 1/2), 0 < x < 1 (in which case the limit evaluates easily to 0) and for
x > 1 (in which case the numerator and denominator evaluate to infinity). Using L’hopital’s rule
in this case I find that the limit evaluates to 1. Therefore, I have that:

( 
en(x−1)
limn→∞ 1+en(x−1) for x > 0 0 for −∞ < x < 1
lim FXn (x) = = 12 for x = 1
n→∞ 0 otherwise 

1 for x > 1.
For the “random variable”, X, that takes on a value of 1 with probability 1, the CDF is:
(
0 for −∞ < x < 1
FX (x) =
1 for x ≥ 1
127

Thus, we see that limn→∞ FXn (x) = FX (x) everywhere FX (x) is continuous (i.e, R − {1}), and
d
hence Xn −
→ X.

Problem 9. To solve this problem, I first state without proof the following 2 limits:
enx + xenx
lim  = x for 0 ≤ x ≤ 1,
n→∞ 1 + n+1 en
n

and
enx + enx
lim  =1 for x > 1.
n→∞ 1 + n+1 en
n
I therefore have that:
 

0 for x < 0 
 enx +xenx
0 for x < 0
lim FXn (x) = limn→∞ 1+( n+1
for 0 ≤ x ≤ 1 = x for 0 ≤ x ≤ 1
n )
en
n→∞ 
 

limn→∞ nx
e +e nx
for x > 1 1 for x > 1,
1+( n+1
n )
en

d
which is the same CDF as a U nif (0, 1) distribution. Hence, Xn −
→ X for X ∼ U nif (0, 1).

Problem 10.
(a)
lim P (|Xn − 0| ≥ ) = lim P (Xn ≥ ) (since Xn ≥ 0)
n→∞ n→∞
(
1
2 for  ≤ n
= n
0 for  > n
1
= lim 2
n→∞ n
=0
p
=⇒ Xn →
− 0
(b)
lim E[|Xn − 0|r ] = lim E[Xnr ] (since Xn ≥ 0)
n→∞ n→∞
1 r
= lim n
n→∞ n2
= lim nr−2
n→∞
= 0 (for 1 ≤ r < 2)
Lr
=⇒ Xn −→ 0 (for 1 ≤ r < 2)
(c) For r ≥ 2,
lim E[|Xn − 0|r ] = lim E[Xnr ] (since Xn ≥ 0)
n→∞ n→∞
1 r
= lim n
n→∞ n2
= lim nr−2 ,
n→∞

which converges to 1 for r = 2 and diverges for r > 2. Therefore, Xn does not converge to 0
in the rth mean for r ≥ 2.
128 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES

P∞
(d) To solve this problem I use Theorem 7.5 in the book, and must thus show that n=1 P (|Xn | >
) (for all  > 0) is finite:


X ∞
X
P (|Xn | > ) = P (Xn > )
n=1 n=1

X 1
=
n2
n=de
X∞
1

n2
n=1
π2
=
6
< ∞,

where in the first line I have used the fact that Xn is always greater than or equal to zero.

Problem 11. This is a hypergeometric experiment with b = n + δ (with δ = 0, 1, 2, . . .), r = n


and k = 10, so that the PMF for X10 , X11 , . . . is given by:

n+δ
 n

x 10−x
PXn (x) = 2n+δ
 ,
10

for x = 0, 1, . . . 10 (and 0 otherwise). Since X, X10 , X11 , . . . are non-negative random integers (for
X ∼ Bin(10, 0.5)), by Theorem 7.1 in the book, we need only prove that limn→∞ PXn (x) = PX (x)
to prove convergence in distribution. As for the RHS of this equation, for X ∼ Bin(10, 0.5), the
PMF is given by:
 
10
PX (x) = (0.5)10 ,
x

for x = 0, 1, . . . , 10 (and 0 otherwise).


As for the LHS of this equation, taking the limit of PXn (x) as n → ∞ I have that:

n+δ
 n

x 10−x
lim PXn (x) = lim 2n+δ

n→∞ n→∞
10
     
n+δ n 2n + δ −1
= lim lim lim .
n→∞ x n→∞ 10 − x n→∞ 10

The first limit can be found easily by expanding the factorial in the numerator:
 
n+δ 1
lim = lim (n + δ)(n + δ − 1) . . . (n + δ − x + 1)
n→∞ x x! n→∞
1 
= lim nx + O nx−1
x! n→∞
nx
= lim ,
n→∞ x!
129

and the remaining 2 limits can be worked out similarly. Plugging these limits in, I have that:
     −1
nx n10−x (2n)10
lim PXn (x) = lim lim lim
n→∞ n→∞ x!n→∞ (10 − x)! n→∞ 10!
(  −1 )
n x n 10−x (2n) 10
= lim · ·
n→∞ x! (10 − x)! 10!
 
10
= (0.5)10 .
x

d
Thus limn→∞ PXn (x) = PX (x), and by Theorem 7.1 Xn −
→ X.

Problem 12. Let


X1 + . . . + Xn − nµX
Xn = √ ,
σX n
where X1 , . . . , Xn are iid from any distribution with finite mean µX and finite variance σX 2 . Also,

let Yn be defined analogously (and let all Xi s be independent from all Yi s, so that Xn is independent
d
of Yn ). Moreover, let X ∼ N (0, 1) and let Y = X. Now, from the CLT, we know that Xn −
→X
d
and Yn −→Y.
From the CLT, we also know that in the limit that n → ∞, Xn + Yn is simply the sum of
two independent standard normal random variables, so that in this limit Xn + Yn ∼ N (0, 2).
Also, since X + Y = 2X, we have that X + Y ∼ N (0, 4), since for X ∼ N (E[X ], V ar[X ]),
Y = aX + b ∼ N (aE[X ] + b, a2 V ar[X ]) (see Sec. 6.1.5 from the book). Thus, in this example, I
d d
have that Xn −
→ X and Yn −
→ Y , but Xn + Yn does not converge in distribution to X + Y .

d
Problem 13. Xn −
→ 0 since:

Z ∞
n −nx
lim P (|Xn | ≥ ) = lim 2 e dx (by symmetry)
n→∞ n→∞  2
1
= lim n
n→∞ e
= 0.

Problem 14. This can easily be proven by realizing that Xn is never negative (so |Xn | = Xn ),
and by re-expressing the integral over the PDF in terms of an indicator functions depending on
whether  > 1/n (in which case the lower bound of the integral is ) or whether  ≤ 1/n (in which
case the integral evaluates to 1):

lim P (|Xn | ≥ ) lim P (Xn ≥ )


n→∞ n→∞
  Z ∞  
1 1 −2 1
= lim 1  > x dx + 1  ≤
n→∞ n  n n
Z ∞        
1 1 1
= x dx lim 1  >
−2
lim + lim 1  ≤
n→∞ n n→∞ n n→∞ n
Z ∞
= x−2 dx · 1 · 0 + 0

= 0.
130 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES

Problem 15. For convenience, I first write Xn in summation notation:

n−1
!
1 X
Xn = Yi Y1+1 + Yn Y1 .
n
i=1

To solve this problem, I will use Chebyshev’s inequality, and will thus need to compute E[Xn ] and
V ar[Xn ]. Computing E[Xn ]:

n−1
!
1 X
E[Xn ] = E[Yi Y1+1 ] + E[Yn Y1 ]
n
i=1
n−1
!
1 X
= E[Yi ]E[Y1+1 ] + E[Yn ]E[Y1 ]
n
i=1
n−1
!
1 X
2 2
= µ +µ
n
i=1
= µ2 ,

where in the first line I have used the linearity of expectation and in the second I have used the
fact that all Y s are independent.
Solving for V ar[Xn ] is slightly more tricky. To do this, I will first need to compute Cov[Yi Yi+1 , Yi+1 Yi+2 ]
for i = 1, 2, . . . n − 2 (I will also need to compute Cov[Yn−1 Yn , Yn Y1 ] and Cov[Yn Y1 , Y1 Y2 ], but it
is not difficult to show that the following computation gives the same answer for these 2 covari-
ances) and V ar[Yi Yi+1 ] for i = 1, 2, . . . n − 1 (I will also need to compute V ar[Yn Y1 ] but, again,
it is not difficult to show that the following computation gives the same answer for this variance).
Computing the covariance:

Cov[Yi Yi+1 , Yi+1 Yi+2 ] = E[Yi Yi+1 Yi+1 Yi+2 ] − E[Yi Yi+1 ]E[Yi+1 Yi+2 ]
2
= E[Yi ]E[Yi+1 ]E[Yi+2 ] − E[Yi ]E[Yi+1 ]2 E[Yi+2 ]
= µ2 (E[Yi+1
2
] − E[Yi+1 ]2 )
= µ2 σ 2 ,

where in the second line I have used independence. Now I compute the variance:

V ar[Yi Yi+1 ] = E[(Yi Yi+1 )2 ] − (E[Yi Yi+1 ])2


= E[Yi2 Yi+1
2
] − E[Yi ]2 E[Yi+1 ]2
= E[Yi2 ]E[Yi+1
2
] − E[Yi ]2 E[Yi+1 ]2
= (σ 2 + E[Yi ]2 )(σ 2 + E[Yi+1 ]2 )] − E[Yi ]2 E[Yi+1 ]2
= (σ 2 + µ2 )(σ 2 + µ2 )] − µ2 µ2
= σ 4 + 2σ 2 µ2 ,
131

where in the second and third lines I have used independence. I now compute V ar[Xn ]:
"n−1 #
1 X
V ar[Xn ] = 2 V ar Yi Y1+1 + Yn Y1
n
i=1
n−1
X n−2
X
1
= V ar[Yi Y1+1 ] + V ar[Yn Y1 ] + 2 Cov[Yi Yi+1 , Yi+1 Yi+2 ] + 2Cov[Yn−1 Yn , Yn Y1 ]
n2
i=1 i=1
!
+ 2Cov[Yn Y1 , Y1 Y2 ]

1
= [n(σ 4 + 2σ 2 µ2 ) + 2n(µ2 σ 2 )]
n2
σ2 2
= (σ + 4µ2 ).
n
For the summation of the covariances, I have only summed over the covariances of adjacent pairs
of Yi Yi+1 , since pairs that are 2 or more away from each other have zero covariance since they
are independent (since they do not share any Y random variables). To see why this is the proper
summation over the covariances, I illustrate the summation in a matrix form for X5 below. We must
sum all off diagonal terms, however, only adjacent pairs contribute non zero covariance, indicated
by the spades in the figure. It is not difficult to see that my summation corresponds exactly to
adding the cells containing spades in this figure.

Y1 Y2 Y2 Y3 Y3 Y4 Y4 Y5 Y5 Y1

Y1 Y2 ♠ ♠

Y2 Y3 ♠ ♠

Y3 Y4 ♠ ♠

Y4 Y5 ♠ ♠

Y5 Y1 ♠ ♠

Finally, I complete the problem using Chebyshev’s inequality:

lim P (|Xn − µ2 | ≥ ) = lim P (|Xn − E[Xn ]| ≥ )


n→∞ n→∞
V ar[Xn ]
≤ lim
n→∞ 2
σ (σ 2 + 4µ2 )
2
= lim
n→∞ n2
= 0.

Since probabilities cannot be less than 0, I conclude that limn→∞ P (|Xn − µ2 | ≥ ) = 0, so that
p
Xn →− µ2 .
Pn
Problem 16. Using some simple algebra, since Xn = (Πni=1 Yi )1/n , I have that ln Xn = 1
n i=1 ln Yi .
132 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES

Using the WLLN, I therefore have that:


!
1 X
n

lim P (| ln Xn − γ| ≥ ) = lim P ln Yi − E[ln Yi ] ≥ 
n→∞ n→∞ n
i=1
= 0.

p
This therefore implies that ln Xn →− γ. Now, by the Continuous Mapping Theorem (Theorem 7.7 in
p p
the book), since exp(·) is a continuous function, exp(ln Xn ) → − eγ .
− exp(γ), or in other words: Xn →

Problem 17. To solve this problem, I compute E[|Yn − λ|2 ], keeping in mind that for a P oiss(λ)
distribution, E[X] = V ar[X] = λ:

E[|Yn − λ|2 ] = E[(Yn − λ)2 ]


" 2 #
1
=E Xn − λ
n
 
1 2
= E 2 (Xn − λn)
n
1  
= 2 E (Xn − E[Xn ])2
n
1
= 2 V ar[Xn ]
n
1
= 2 nλ
n
λ
= .
n
m.s.
I thus have that limn→∞ E[|Yn − λ|2 ] = 0, so that Yn −−→ λ.

Problem 18. Using Minkowski’s inequality, I have

E[|Xn + Yn − (X + Y )|r ] = E[|(Xn − X) + (Yn − Y )|r ]


≤ E[|(Xn − X)|r ]1/r + E[|(Yn − Y )|r ]1/r ,

so that:

lim E[|Xn + Yn − (X + Y )|r ] ≤ lim E[|(Xn − X)|r ]1/r + lim E[|(Yn − Y )|r ]1/r
n→∞ n→∞ n→∞
 1/r  1/r
= lim E[|(Xn − X)|r ] + lim E[|(Yn − Y )|r ]
n→∞ n→∞
= 0,

Lr Lr
where the last line follows since Xn −→ X and Yn −→ Y . Since, |Xn + Yn − (X + Y )|r ≥ 0,
Lr
E[|Xn + Yn − (X + Y )|r ] ≥ 0, so that the inequality must hold with equality, and thus Xn + Yn −→
X +Y.
P∞
Problem 19. To solve this problem, I utilize Theorem 7.5 in the book and show that n=1 P (|Xn | >
) (for all  > 0) is finite:
133


X ∞
X
P (|Xn | > ) = P (Xn > )
n=1 n=1
X∞ Z ∞ n2 x2
= n2 xe− 2 dx
n=1 
X∞
n2 2
= e− 2 ,
n=1

where in the first line I have used the fact that the random variable Xn is never negative and in
the third line I have solved the integral with a substitution of u = n2 2 /2.
Now, it is not difficult to show that for positive µ and n ≥ 1, as we have in this case, that
−x 2µ 2
e ≤ e−xµ , and therefore, each term in the above summation is ≤ e−n /2 :

X ∞
X n2
P (|Xn | > ) ≤ e− 2

n=1 n=1
2
e− 2
= 2
1 − e− 2
< ∞ (for  > 0).

Problem 20. Note that for X2 = Y1 , X3 = Y1 Y2 , X4 = Y1 Y2 Y3 , . . ., with Yn ∼ Bern(n/(n + 1)),


RXn = {0, 1}. It is therefore not difficult to show that for this sequence, for 0 <  < 1:

X ∞
X 1
P (|Xn | > ) =
k+1
n=2 k=1
= ∞.

Therefore, we cannot simply appeal to Theorem 7.5 and must thus use Theorem 7.6. That is, we
must show that for any  > 0, limm→∞ P (Am ) = 1, where the set Am is defined in the book. I
show this for 0 <  < 1. For this interval, from the definition of Am :

Am = {|Xn | < , ∀n ≥ m}
= {Xn < , ∀n ≥ m}
= {Xn = 0, ∀n ≥ m}.

Evaluating the probability of this event, I have that:

P (Am ) = P ({Xn = 0, ∀n ≥ m})


= P ({Xm = 0, Xm+1 = 0, . . .})
= P (Ym−1 Ym−2 . . . Y1 = 0, Ym Ym−1 . . . Y1 = 0, . . .)
= P (Ym−1 Ym−2 . . . Y1 = 0)P (Ym Ym−1 . . . Y1 = 0, Ym+1 Ym . . . Y1 = 0, . . . |Ym−1 Ym−2 . . . Y1 = 0)
= P (Ym−1 Ym−2 . . . Y1 = 0)
= 1 − P ((Ym−1 = 0 ∪ Ym−2 = 0 ∪ . . . ∪ Y1 = 0)c )
= 1 − P (Ym−1 = 1, Ym−2 = 1, . . . , Y1 = 1)
m−1
Y k
=1− ,
k+1
k=1
134 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES

where in the fifth line I have used the fact that if Ym−1 Ym−2 . . . Y1 = 0, then at least 1 Yi (i =
1, . . . , m − 1) is 0. Since the random variables Ym Ym−1 . . . Y1 , Ym+1 Ym . . . Y1 , . . . are all products
of Ym−1 Ym−2 . . . Y1 , given that Ym−1 Ym−2 . . . Y1 = 0, we know for sure that all random variables
Ym Ym−1 . . . Y1 , Ym+1 Ym . . . Y1 , . . . are 0. In the seventh line I have used De Morgan’s law, and in
the eighth I have used independence.
The product is easily solved:
m−1
Y k 1 2 m−2 m−1 1
= · ... · = ,
k+1 2 3 m−1 m m
k=1

where all denominators have cancelled out with the next numerator except for the last one. I
therefore have that
 
1
lim P (Am ) = lim 1 −
m→∞ m→∞ m
= 1,
a.s.
and therefore, by Theorem 7.6 Xn −−→ 0.
Chapter 8

Statistical Inference I: Classical


Methods

135
136 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

Problem 1.

(a) Using the formulas for the sample mean, sample variance and sample standard deviation, I
find that:

X̄ ≈ 164.3 lbs,

S 2 ≈ 383.7 lbs2 ,

and
S ≈ 19.59 lbs.

Problem 2. To calculate the bias of this estimator, I first compute its expectation:
 !2 
X n
1
E[Θ̂] = E  Xk 
n
k=1
 
Xn X
1
= 2E  Xk2 + Xi Xj 
n
k=1 i,j:i6=j
 
n
1 X X
= 2 E[Xk2 ] + E[Xi ]E[Xj ]
n
k=1 i,j:i6=j
1 
= 2 n(σ 2 + E[Xk ]2 ) + (n2 − n)E[Xk ]2
n
1 
= 2 n(σ 2 + µ2 ) + (n2 − n)µ2
n
σ2
= + µ2 ,
n
P
where the notation i,j:i6=j refers to a sum over all pairs of i, j (i, j = 1, . . . , n) except for the pairs
where i = j. In the third line I have used the linearity of expectation and independence. The bias
is thus:
σ2
B(Θ̂) = E[Θ̂] − θ = E[Θ̂] − µ2 = ,
n

and since B(Θ̂) 6= 0, Θ̂ is a biased estimator of θ.

Problem 3.

(a) To solve this problem, I first compute the expectation of Xi :


Z 1  
1
E[Xi ] = θ x− + 1 xdx
0 2
θ θ 1
= − + .
3 4 2
137

I now compute the expectation of the estimator:

E[Θ̂n ] = E[12X̄ − 6]
" n
#
12 X
=E Xi − 6
n
i=1
n
12 X
= E[Xi ] − 6
n
i=1
Xn  
12 θ θ 1
= − + −6
n 3 4 2
i=1
 
θ θ 1
= 12 − + −6
3 4 2
= θ.

I therefore have that B(Θ̂n ) = E[Θ̂n ] − θ = 0, and so Θ̂n is an unbiased estimator of θ.

(b) I will use Chebyshev’s inequality to show that this is a consistent estimator and I will therefore
need to compute V ar[Θ̂n ]. To do this, I first compute E[Xi2 ]:
Z 1   
2 1
E[Xi ] = θ x− + 1 x2 dx
0 2
θ θ 1
= − + .
4 6 3
Now I compute E[Θ̂n ]:
 
E[Θ̂2n ] = E (12X̄ − 6)2
 !2 
2 Xn Xn
12 12
=E 2 Xi − 6 · 2 · Xi + 36
n n
i=1 i=1
 
Xn X n
144  2  144 X
= 2 E[Xi ] + E[Xi ]E[Xj ] − E[Xi ] + 36
n n
i=1 i,j:i6=j i=1
144 
= 2 nE[Xi2 ] + (n2 − n)E[Xi ]2 − 144E[Xi ] + 36,
n
P
where the notation i,j:i6=j refers to a sum over all pairs of i, j (i, j = 1, . . . , n) except
for the pairs where i = j. In this derivation, I have used the linearity of expectation and
independence. Plugging in E[Xi2 ] and E[Xi ] and simplifying, I find that
 
12 1
E[Θ̂2n ] = + θ2 1 − ,
n n
so that

V ar[Θ̂n ] = E[Θ̂2n ] − E[Θ̂n ]2


 
12 2 1
= +θ 1− −θ
n n
12 − θ2
= .
n
138 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

p
To show that Θ̂n is a consistent estimator, I must show that Θ̂n →
− θ. Using Chebyshev’s
inequality, I have:
lim P (|Θ̂n − θ| ≥ ) = lim P (|Θ̂n − E[Θ̂n ]| ≥ )
n→∞ n→∞
V ar[Θ̂n ]
≤ lim
n→∞ 2
12 − θ2
= lim
n→∞ n2
= 0.

Since probabilities cannot be negative, I have that limn→∞ P (|Θ̂n − θ| ≥ ) = 0 for all  > 0,
and thus Θ̂n is consistent.
(c) Since I have already computed the variance and bias of Θ̂n , computing the mean squared
error is easy:
12 − θ2
M SE(Θ̂n ) = V ar[Θ̂n ] + B[Θ̂n ]2 = .
n
Problem 4.
4
Y
L(x1 , . . . , x4 ; p) = PXi (xi ; p)
i=1
4
Y
= p(1 − p)xi −1
i=1
= p (1 − p)2−1 (1 − p)3−1 (1 − p)3−1 (1 − p)5−1
4

= p4 (1 − p)9

Problem 5.
4
Y
L(x1 , . . . , x4 ; θ) = fXi (xi ; θ)
i=1
4
Y
= θe−θxi
i=1
4 −2.35θ −1.55θ −3.25θ −2.65θ
=θ e e e e
4 −9.8θ
=θ e

Problem 6. Since log(·) is a monotonic increasing function on R, argmaxθ∈R L(x; θ) = argmaxθ∈R log L(x; θ).
This can easily be proven by considering the definition of a strictly monotonic function.

Problem 7.

(a) For a single data point, X, and our estimator Θ̂ (which is a function, f , of X) of σ 2 , we have
that E[Θ̂] = E[f (X)] = σ 2 , where the last equality follows because we want the estimator to
be unbiased. Therefore, we are searching for a function such that:
Z ∞
1 x2
√ f (x)e− 2σ2 dx = σ 2 .
−∞ 2πσ
Since the PDF is that of N (0, σ 2 ), (i.e., it has mean zero), it is clear that the function that
satisfies this equation is f (x) = x2 , and therefore Θ̂ = X 2 .
139

(b)
1 x2
ln L(x; σ 2 ) = − ln 2π − ln σ − 2
2 2σ
(c) Taking the derivative of the above equation with respect to σ,

∂ ln L 1 x2
= − + 3,
∂σ σ σ
setting equal to zero,
1 x2
0=− + 3 ,
σ̂M L σ̂M L

and solving for σ̂M L , I find that σ̂M L = |x|.

Problem 8.

(a)
n
Y
L(x1 , . . . , xn ; λ) = PXi (xi ; λ)
i=1
Yn
e−λ λxi
=
xi !
i=1
n
Y
Pn 1
= e−nλ λ i=1 xi
xi !
i=1

(b) The log-likelihood, `, is:

`(λ) = ln L(x1 , . . . , xn ; λ)
Pn n
X
−λn
= ln e + ln λ i=1 xi
+ ln xi !−1
i=1
n
X n
X
= −λn + xi ln λ − ln xi !.
i=1 i=1

Differentiating this respect to λ, and setting equal to zero, I have:


n
X
1
0 = −n + xi .
λ̂M L i=1

Solving for the maximum likelihood estimate, I have that:


n
1X
λ̂M L = xi ,
n
i=1

that is, the maximum likelihood estimate of λ is simply the sample mean.

Problem 9. To solve for the CDF for the ith order statistic, let us assume that X1 , X2 , . . . , Xn
are a random sample from a continuous distribution with CDF, FX (x). I fix a value x ∈ R, and
define the indicator random variable, Ij , by
140 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

(
1 if Xj ≤ x
Ij (Xj ) =
0 if Xj > x,
where Ij = 1 is a “success” and Ij = 0 is a “failure.” Note that, since all Xj s are iid, the probability
of a success, P (Xj ≤ x), is the same for each trial and is given by FX (x). Therefore, I have that
iid P
Ij ∼ Bern(FX (x)). I now define the random variable, Y = nj=1 Ij , and since this is the sum of n
independent Bernoulli random trials, it has a distribution: Y ∼ Bin(n, FX (x)).
Now, given that Y ∼ Bin(n, FX (x)), the quantity, P (Y ≥ i) is therefore the probability that
there are at least i successes out of n trials. Given our definition of “success”, and given that the
number of trials n is simply the number of observations, this can be re-phrased as the probability
that there are at least i observations out of n with values less than or equal to x.
We desire to find P (X(i) ≤ x), the probability that the ith biggest observation out of n obser-
vations has a value less than or equal to x. In other words, we desire to find the probability that
there are at least i observations out of n with a value less than or equal to x. Notice that this is
exactly P (Y ≥ i), so that:

FX(i) (x) = P (X(i) ≤ x)


= P (Y ≥ i)
Xn  
n
= [FX (x)]k [1 − FX (x)]n−k .
k
k=i

Problem 10. Let region 1 be defined as the interval (−∞, x], region 2 as the interval (x, x + δ],
(where δ is a small positive number) and region 3 as the interval (x + δ, ∞). By the definition of
the PDF, the probability that the ith order statistic is in region 2 is given by P (x < X(i) ≤ x + δ) ≈
fX(i) (x)δ. In other words, for δ small enough, P (x < X(i) ≤ x + δ) is the probability that, out of n
samples, there are i − 1 samples in region 1, one in region 2 and n − i in region 3.
Now, since all samples are iid from a distribution with PDF fX (x) and CDF FX (x), the
probability that a sample lands in region 1, is
p1 = P (X ≤ x) = FX (x),
in region 2 is
p2 = P (x ≤ X ≤ x + δ) ≈ fX (x)δ,
and in region 3 is
p3 = P (X > x + δ) = 1 − FX (x + δ) ≈ 1 − FX (x).
Notice that if we define si as the event that a sample, out of n samples, lands in region i (with
associated probability, pi ), this is precisely a multinomial experiment with 3 possible outcomes.
Thus the probability that out of n samples (trials), there are i − 1 in region 1, one in region 2 and
n − i in region 3 is given by:
n!
pi−1 p2 p3n−1 .
(i − 1)!(n − 1!) 1
However, this is precisely the probability fX(i) (x)δ. Therefore, I have that:
n!
fX(i) (x)δ = pi−1 p2 pn−1
(i − 1)!(n − 1!) 1 3

n!
= [FX (x)]i−1 fX (x)δ[1 − FX (x)]n−1 .
(i − 1)!(n − 1!)
Canceling the δ from both sides of the equation gives the desired result.
141

Problem 11. Since n is relatively large, the variance is known, and we would like an approximate
confidence interval for θ = E[Xi ], we can calculate the confidence interval by employing the CLT

and by using n(X̄ − θ) as the pivotal quantity. This computation is done in the book and the
interval is given by:  
σ σ
X̄ − z α2 √ , X̄ + z α2 √ .
n n
The quantities in this interval are X̄ = 50.1, σ = 9, n = 100 and z α2 = z 0.05 = 1.96. Using
2
these values, I find that the 95% confidence interval is given by: [48.3, 51.9]. Note that z α2 can be
computed in Python with scipy.stats.norm.ppf(1-alpha/2).

Problem 12. In this problem, we choose a random sample of size n from a population, X1 , X2 , . . . , Xn ,
where these random variables are iid Bern(θ), where Xi is 1 if the ith voter intends to vote for
Candidate A, and 0 otherwise.
(a) We require there to be at least a 90% probability that the sample proportion, X̄ is within 3
percentage points of the actual proportion, θ. In math, this is:

P (θ − 0.03 ≤ X̄ ≤ θ + 0.03) ≥ 0.9,

and algebraically manipulating the argument, it is easy to show that the 90% confidence
interval we require is:
[X̄ − 0.03, X̄ + 0.03].

Following along Example 8.18 in the book, utilizing the CLT, and obtaining a conservative
estimate for the interval by using σmax , which for a Bernoulli distribution is 1/2 (since we do
not actually know σ), the proper interval is given by:
 
z α2 z α2
X̄ − √ , X̄ + √ .
2 n 2 n
Comparing this interval with the one above, we see that:
z α2
√ = 0.03.
2 n
Thus, we require n to be at least:
& 2 '
z α2
= 748,
2 · 0.03

where I have used z α2 = z 0.1 ≈ 1.64.


2

(b) Using the same formula as above, but with z α2 = z 0.01 ≈ 2.58, I find that n must be at least
2
1849.

Problem 13. For this problem, since n is relatively large, I use the standard approximate con-
fidence interval derived using the CLT. The variance, however, is unknown, but since n is large
should be well approximated by the sample variance, S 2 . The proper confidence interval is thus:
 
S S
X̄ − z 2
α √ , X̄ + z 2
α √ ,
n n

and using n = 100, X̄ = 110.5, S 2 = 45.6, and z 0.05 = 1.96, I find the 95% confidence interval for
2
the distribution mean to be approximately be: [109.2, 111.8].
142 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

Problem 14.

(a) For an n = 36 random sample from N (µ, σ 2 ), with µ and σ 2 unknown, the proper pivotal

quantity to use to estimate µ is T = (X̄ − µ/(S/ n)), which because it has a T distribution,
results in a confidence interval of:
 
S S
X̄ − t α2 ,n−1 √ , X̄ + t α2 ,n−1 √ ,
n n

as shown in the book. For the desired confidence levels (90%, 95%, 99%), the appropriate
t values are: t0.1,35 ≈ 1.69, t0.05,35 ≈ 2.03, t0.01,35 ≈ 2.72, and the corresponding confidence
intervals are: [34.8, 36.8], [34.6, 37.0] and [34.2, 37.4]. We see that as the confidence level
increases, the width of the interval gets wider since we desire more confidence that the actual
value of µ is encompassed by that random interval. Note that t α2 ,n−1 can be computed in
Python with scipy.stats.t.ppf(1-alpha/2, n-1).

(b) The proper pivotal quantity to use to estimate σ 2 is Q = (n − 1)S 2 /σ 2 , which because it has
a χ2 distribution, results in a confidence interval of:
" #
(n − 1)S 2 (n − 1)S 2
, ,
χ2α ,n−1 χ21− α ,n−1
2 2

as shown in the book. Computing the proper χ2α ,n−1 and χ21− α ,n−1 values, I find the following
2 2
90%, 95% and 99% confidence intervals for σ 2 : [8.78, 19.47], [8.22, 21.3] and [7.26, 25.4]. Again,
we see that as the confidence level increases, the width of the interval gets wider since we desire
more confidence that the actual value of σ 2 is encompassed by that random interval. Note
that χ2α ,n−1 can be computed in Python with scipy.stats.chi2.ppf(1-alpha/2, n-1).
2

Problem 15.

(a) We recognize that since the data are drawn iid from a normal distribution, since σ 2 is known,
and since the hypotheses are of the form Ho : µ = µo and HA : µ 6= µo , this is a 2-sided

z-test, as outlined in Table 8.2 in the book. Thus, if the statistic W = (X̄ − µo )/(σ/ n)
satisfies |W | ≤ z α2 then we fail to reject the null hypothesis, otherwise we reject it in favor of
the alternative hypothesis
Computing X̄ and W , I find:
X̄ ≈ 5.96,
and
W ≈ 2.15.
At a level of α = 0.05, the proper threshold is z0.025 ≈ 1.96. Since W > z α2 , we reject Ho in
favor of HA at a significance level of 0.05.

(b) In this case, since the data are drawn iid from a normal distribution with known variance,
the proper (1 − α)100% confidence interval to use as shown in Section 8.3.3 of the book is:
 
σ σ
X̄ − z 2 √ , X̄ + z 2 √ ,
α α
n n

which, when plugging in the particular values for this problem results in a 95% confidence
interval of approximately [5.08, 6.84].
143

The value µo = 5 is not within this interval. As shown in Section 8.4.3 of this book, for this

type of hypothesis test, since we accept Ho at a level of α if |(X̄ − µo )/(σ/ n)| ≤ z α2 , this
results in the condition that we accept Ho if:
 
σ σ
µo ∈ X̄ − z 2
α √ , X̄ + z 2
α √ .
n n

I.e., for this test, if µo is in the (1 − α)100% confidence interval, we accept Ho at a level
of α, otherwise we do not. Since µo = 5 is not in the calculated confidence interval, this
corresponds to rejecting Ho in favor of HA , which is indeed what we found above.

Problem 16.

(a) As with the previous problem, since the data are drawn iid from a normal distribution with
known variance, the proper (1 − α)100% confidence interval to use as shown in Section 8.3.3
of the book is:  
σ σ
X̄ − z α2 √ , X̄ + z α2 √ .
n n
For this problem X̄ ≈ 17.0, z α2 = z0.05 ≈ 1.64, so that the 90% confidence interval is approx-
imately [16.45, 17.55]. The value of µo is not included in this interval, which, as explained
above, means that we reject the null hypothesis at a significance level of α = 0.1.

(b) As shown in Section 8.4.3 of the book, the proper test statistic to use is W = (X̄ −µo )/(σ/ n),
and if |W | ≤ z α2 (see Table 8.2) we cannot reject the null hypothesis. For this problem, W ≈ 3,
z α2 = z0.05 ≈ 1.64, and therefore we reject Ho at a significance level of α = 0.1.

Problem 17. In this problem, the random sample comes from an unknown distribution with
unknown variance and with a rather large n (n = 150). Since the hypotheses we would like to test
correspond to Ho : µ = 50 and HA : µ > 50, this will most likely be a 1-sided z-test (using the

sample variance). Here I work out the test explicitly. Using W as my statistic, (X̄ − µo )/(S/ n),
If Ho is true, then we would expect X̄ ≈ µo and W ≈ 0. On the other hand, if HA is true we expect
X̄ > µo and W > 0. Therefore I employ the following test: if W ≤ c I fail to reject Ho , while if
W > c I reject Ho in favor of HA .
To solve for c I must bound the probability of making a Type I error:

P (Type I error) = P (reject Ho |Ho )


= P (W > c|Ho )
= 1 − Φ(c) (since W ∼ N (0, 1) under Ho )
≤α

Therefore, the critical value, c, occurs at equality: 1 − Φ(c) = α, or in other words c = zα . Thus,
if W ≤ zα I fail to reject Ho , while if W > zα I reject Ho in favor of HA .
For this problem X̄ = 52.28, S 2 = 30.9, and so W ≈ 5.02, while zα = z0.05 ≈ 1.64, so that I
reject Ho in favor of HA at a significance level of 0.05.

Problem 18. In this problem, the random sample comes from a normal distribution with unknown
variance, and the hypotheses we are testing are of the form Ho : µ ≥ µo and HA : µ < µo , and I
therefore use a 1-sided t-test. As indicated in Table 8.4, we fail to reject Ho if W ≥ −tα,n−1 . For
this problem,
27.72 + 22.24 + 32.86 + 19.66 + 35.34
X̄ = ≈ 27.56,
5
144 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

n
1 X
S̄ 2 = (Xi − X̄)2 ≈ 44.84,
n−1
i=1
so that
X̄ − µ 27.56 − 30
W = √ ≈√ √ ≈ −0.81.
S/ n 44.84/ 5
Also, for this problem −tα,n−1 = −t0.05,4 ≈ −2.13. Since W ≥ −tα,n−1 , we fail to reject the null
hypothesis at a level of 0.05.

Problem 19. Since the random sample is drawn from an unknown distribution with unknown
variance, but the sample number is relatively large (n = 121), I can use a z-test with the sample
variance. Moreover, the hypotheses we are testing are of the form Ho : µ = µo and HA : µ < µo .

I therefore use a 1-sided z-test, where if W < −zα with W = (X̄ − µo )/(S/ n), then I reject the
null hypothesis at a significance level of α (see table 8.4).
The p-value is the probability of making a Type I error when the statistic threshold is set to
that which was observed (w1 ), in this case w1 ≈ −0.81. Thus, the p-value for this problem is:

p − value = P (Type I error when c = w1 |Ho )


= P (W < w1 |Ho )
= Φ(w1 )
≈ 0.035,

where, because of the CLT, I have used the CDF of a Gaussian.

Problem 20.
(a) We would like to test the hypotheses that Ho : θ ≥ 0.1 and HA : θ < 0.1, which, since equality
gives us a worse case scenario (as shown in Section 8.4.3 of the book), can be simplified to:

Ho : θ = θo = 0.1
HA : θ < θo .

(b) If we let Xi = 1 if the ith student has allergies, and 0 otherwise, we see that Xi ∼ Bern(θ),
so that E[Xi ] = θ and V ar[Xi ] = θ(1 − θ). Now, under the null hypothesis, and under the
CLT (since n is large), I have that

X̄n − θo n
p ∼ N (0, 1).
nθo (1 − θo )
This is a convenient test statistic to use since I have its distribution, and since if the alternative
hypothesis is true, X̄ will be small (and so will the statistic), while if the null hypothesis is
true, X̄ will be large (and so will the statistic). This suggests the following test: if

X̄n − θo n
p <c
nθo (1 − θo )
then reject the null hypothesis in favor of the alternative hypothesis, while if
X̄n − θo n
p ≥ c,
nθo (1 − θo )
145

fail to reject the null hypothesis.


Calculating the value of the statistic for the particular instance of the data that we have
collected:
21 − 0.1 · 225
w1 = p ≈ −0.33.
225 · 0.1 · (1 − 0.1)
Now, the p-value is the probability of making a Type I error when the test threshold, c, is set
to be w1 :

p − value = P (Type I error with c = w1 )


= P (reject Ho with c = w1 |Ho )
!
X̄n − θo
=P p < w1 |Ho
nθo (1 − θo )
= Φ(w1 )
≈ 0.37.

(c) Since the p-value is the lowest significance level α that results in rejecting the null hypothesis,
at a level of α = 0.05, we cannot reject the null hypothesis.

Problem 21.

(a) Using the equations for simple linear regression I have the following:

n
1X −5 − 3 + 0 + 2 + 1
x̄ = xi = = −1,
n 5
i=1

n
1X −2 + 1 + 4 + 6 + 3
ȳ = yi = = 2.4,
n 5
i=1

n
X
sxx = (xi − x̄)2 = (−5 + 1)2 + (−3 + 1)2 + (0 + 1)2 + (2 + 1)2 + (1 + 1)2 = 34,
i=1

n
X
sxy = (xi − x̄)(yi − ȳ) = (−5 + 1)(−2 − 2.4) + (−3 + 1)(1 − 2.4) + (0 + 1)(4 − 2.4)
i=1
+ (2 + 1)(6 − 2.4) + (1 + 1)(3 − 2.4) = 34,

sxy 34
β̂1 = = = 1,
sxx 34
and
β̂0 = ȳ − β̂1 x̄ = 2.4 − 1(−1) = 3.4.
The regression line is given by ŷ = β̂0 + β̂1 x, and therefore

ŷ = 3.4 + x.
146 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

(b) The regression predictions for the training data are:


ŷ1 = 3.4 − 5 = −1.6
ŷ2 = 3.4 − 3 = 0.4
ŷ3 = 3.4 + 0 = 3.4
ŷ4 = 3.4 + 2 = 5.4
ŷ5 = 3.4 + 1 = 4.4.

(c) The residuals are:


e1 = y1 − ŷ1 = −2 + 1.6 = −0.4
e2 = y2 − ŷ2 = 1 − 0.4 = 0.6
e3 = y3 − ŷ3 = 4 − 3.4 = 0.6
e4 = y4 − ŷ4 = 6 − 5.4 = 0.6
e5 = y5 − ŷ5 = 3 − 4.4 = −1.4.
Pn
As a check, we know that i=1 ei = 0, which is indeed the case.
(d) To calculate the coefficient of determination, I first need to compute syy
n
X
syy = (yi − ȳ)2 = (−2 − 2.4)2 + (1 − 2.4)2 + (4 − 2.4)2 + (6 − 2.4)2 + (3 − 2.4)2 = 37.2,
i=1

so that
s2xy 342
r2 = = ≈ 0.91.
sxx syy 34 · 37.2
Problem 22.
(a) Using the equations for simple linear regression I have the following:

n
1X 1+3
x̄ = xi = = 2,
n 2
i=1

n
1X 3+7
ȳ = yi = = 5,
n 2
i=1

n
X
sxx = (xi − x̄)2 = (1 − 2)2 + (3 − 2)2 = 2,
i=1

n
X
sxy = (xi − x̄)(yi − ȳ) = (1 − 2)(3 − 5) + (3 − 2)(7 − 5) = 4
i=1

sxy 4
β̂1 = = = 2,
sxx 2
and
β̂0 = ȳ − β̂1 x̄ = 5 − 2 · 2 = 1.
The regression line is given by ŷ = β̂0 + β̂1 x, and therefore

ŷ = 1 + 2x.
147

(b) The regression predictions for the training data are:

ŷ1 = 1 + 2 · 1 = 3

ŷ2 = 1 + 2 · 3 = 7.

(c) The residuals are:

e1 = y1 − ŷ1 = 3 − 3 = 0

e2 = y2 − ŷ2 = 7 − 7 = 0.

Pn
As a check, we know that i=1 ei = 0, which is indeed the case.

(d) To calculate the coefficient of determination, I first need to compute syy

n
X
syy = (yi − ȳ)2 = (3 − 5)2 + (7 − 5)2 = 8,
i=1

so that

s2xy 42
r2 = = = 1.
sxx syy 2·8

(e) Since there are only 2 data points in the training set, the regression line that minimizes the
sum of squared errors goes exactly through those 2 points, and thus r2 = 0. This is a good fit
to the training data, however, it will probably not generalize well to new, unseen data, and
is probably therefore a poor predictive model.

Problem 23.

(a) According to this model, Yi ∼ N (βo + β1 xi , σ 2 ). To solve for the distribution of β̂1 , note that
it is fairly easy to find the distribution of a sum (or a linear combination) of independent
normal random variables. However, due to the fact that each Yi is in the sum that comprises
Ȳ , clearly the term, Sxy is not a sum of independent random variables. In order to express
the formula for β̂1 as a linear combination of the Yi s (which are independent), I expand the
148 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

formula for β̂1 and group each Yi term. With ci ≡ (xi − x̄), I have:

Sxy
β̂1 =
sxx
n
1 X
= ci (Yi − Ȳ )
sxx
i=1
(    
1 1 1
= c1 Y1 − (Y1 + Y2 + . . . + Yn ) + c2 Y2 − (Y1 + Y2 + . . . + Yn )
sxx n n
  )
1
+ . . . + cn Yn − (Y1 + Y2 + . . . + Yn )
n
(
1 h  c1 c2 cn i h  c1 c2 cn i
= Y1 c1 − − − ... − + Y2 c2 − − − ... −
sxx n n n n n n
)
h  c1 c2 cn i
+ . . . + Yn cn − − − ... −
n n n
     
 1 Xn   Xn 
= Y1 (x1 − x̄) − 1 (xj − x̄) + Y2
1 
(x1 − x̄) −
1
(xj − x̄)
 sxx n   sxx n 
j=1 j=1
  
 1 1X
n 
+ . . . + Yn  (x1 − x̄) − (xj − x̄)
 sxx n 
j=1
  
X n  1 X n 
= Yi (xi − x̄) − 1 (xj − x̄)
 sxx n 
i=1 j=1
n
X 1
= Yi (xi − x̄)
sxx
i=1
n
X
= Ui .
i=1

Now, since each Ui is a normal random variable (Yi ) multiplied by a constant, the distribution
for each Ui is given by:

 
1 2 1 2
Ui ∼ N (βo + β1 xi ) (xi − x̄), σ 2 (xi − x̄) .
sxx sxx

Note that now, β̂1 is a sum of independent, normal random variables, so the distribution for
β̂1 is simply normal, where the mean is the sum of the means and where the variance is the
sum of the variances:

n n
!
X 1 X 1
β̂1 ∼ N (βo + β1 xi ) (xi − x̄), σ 2 2 (xi − x̄)2 .
sxx sxx
i=1 i=1
149

(b) Before I show that β̂1 is unbiased, first note the following:

n
X
sxx = (xi − x̄)2
i=1
n
X
= (x2i − 2x̄xi + x̄2 )
i=1
n
X n n
1X X
= x2i − 2x̄n xi + x̄2
n
i=1 i=1 i=1
n
X
= x2i − nx̄2
i=1
n
X n
1X
= x2i − nx̄ xi
n
i=1 i=1
n
X
= (x2i − x̄xi )
i=1
n
X
= xi (xi − x̄).
i=1

Now, it can be shown that β̂1 is unbiased by simplifying the expectation of β̂1 as given above:

n
X 1
E[β̂1 ] = (βo + β1 xi ) (xi − x̄)
sxx
i=1
n n
βo X β1 X
= (xi − x̄) + xi (xi − x̄)
sxx sxx
i=1 i=1
βo
= (nx̄ − nx̄) + β1
sxx
= β1 .

(c) The variance can be further simplified by canceling out a factor of sxx :

n
σ2 X σ2
V ar[β̂1 ] = 2 (xi − x̄)2 = .
sxx sxx
i=1

Problem 24.

(a) This problem can be solved in a very similar


P manner to that of the previous problem. I first
expand out β̂o , use the fact that, β̂1 = ni=1 Yi sxx
1
(xi − x̄) (as found above), and group each
150 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

Yi term:

β̂o = Ȳ − β̂1 x̄
 
1 1 1 1
= (Y1 + Y2 + . . . + Yn ) − x̄ Y1 (x1 − x̄) + Y2 (x2 − x̄) + . . . + Yn (xn − x̄)
n sxx sxx sxx
     
1 1 1 1 1 1
= Y1 − x̄ (x1 − x̄) + Y2 − x̄ (x2 − x̄) + . . . + Yn − x̄ (xn − x̄)
n sxx n sxx n sxx
Xn  
1 1
= Yi − x̄ (xi − x̄)
n sxx
i=1
n
X
= Ui .
i=1

As in the previous problem, since each Ui is a normal random variable (Yi ) multiplied by a
constant, the distribution for each Ui is given by:

   2 !
1 1 1 1
Ui ∼ N (βo + β1 xi ) − x̄ (xi − x̄) , σ 2 − x̄ (xi − x̄) .
n sxx n sxx

Also, as above, β̂o is now a sum of independent, normal random variables, so the distribution
for β̂o is simply normal, where the mean is the sum of the means and where the variance is
the sum of the variances:

n
X   Xn  2 !
1 1 2 1 1
β̂o ∼ N (βo + β1 xi ) − x̄ (xi − x̄) , σ − x̄ (xi − x̄) .
n sxx n sxx
i=1 i=1

Pn
(b) To show that β̂o is unbiased, I can simplify E[β̂o ] as found above (using sxx = i=1 xi (xi − x̄)
as found in the previous problem):

n
X  
1 1
E[β̂o ] = (βo + β1 xi ) − x̄ (xi − x̄)
n sxx
i=1
n
X n n n
βo βo x̄ X β1 X β1 x̄ X
= − (xi − x̄) + xi − xi (xi − x̄)
n sxx n sxx
i=1 i=1 i=1 i=1
βo x̄
= βo − (nx̄ − nx̄) + β1 x̄ − β1 x̄
sxx
= βo .

Pn 1
(c) For any i = 1, 2, . . . , n, using β̂1 = i=1 Yi sxx (xi − x̄) (as derived in the previous problem),
151

I have that:

 
h i X n
1
Cov β̂1 , Yi = Cov  Yj (xj − x̄), Yi 
sxx
j=1
Xn  
1
= Cov Yj (xj − x̄), Yi
sxx
j=1
n
X 1
= (xj − x̄)Cov [Yj , Yi ]
sxx
j=1
1
= (xi − x̄)V ar [Yi , Yi ]
sxx
(xi − x̄)σ 2
= ,
sxx

where in the second to last line I have used the fact that all Yi s are independent, so that
Cov[Yi , Yj ] = 0 for i 6= j.

Pn 1
(d) Again, using β̂1 = i=1 Yi sxx (xi − x̄), I have that:

 
h i X n
1 1 Xn
Cov β̂1 , Ȳ = Cov  Yi (xi − x̄), Yj 
sxx n
i=1 j=1
X  
1 1
= Cov Yi (xi − x̄), Yj
sxx n
i,j
X (xi − x̄)
= Cov [Yi , Yj ]
nsxx
i,j
Xn
(xi − x̄)
= V ar[Yi ]
nsxx
i=1
n n
!
σ2 X X
= xi − x̄
nsxx
i=1 i=1
σ2
= (nx̄ − nx̄)
nsxx
= 0.

Again, I have used the fact that all Yi s are independent, so that Cov[Yi , Yj ] = 0 for i 6= j.
152 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS

(e) The variance of β̂o can be further simplified to give the desired result:
n
X  2
1 1
V ar[β̂o ] = σ2 − x̄ (xi − x̄)
n sxx
i=1
Xn  
2 1 2x̄ x̄2 2
=σ − (xi − x̄) + 2 (xi − x̄)
n2 nsxx sxx
i=1
" n n
#
1 2x̄ X x̄ 2 X
= σ2 − (xi − x̄) + 2 (xi − x̄)2
n nsxx sxx
i=1 i=1
 2

1 2x̄ x̄
= σ2 − (nx̄ − nx̄) +
n nsxx sxx
 2

2 sxx + nx̄

nsxx
 Pn 2 2
2 i=1 (xi − x̄) + nx̄

nsxx
 Pn 2 2 2
2 i=1 (xi − 2x̄xi + x̄ ) + nx̄

nsxx
 Pn 2 
i=1 xi
= σ2 .
nsxx
Chapter 9

Statistical Inference II: Bayesian


Inference

153
154 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

Problem 1. The posterior density can be found with Bayes’ rule:

PY |X (2|x)fX (x)
fX|Y (x|2) = R ,
PY |X (2|x)fX (x)dx

where
PY |X (2|x) = x(1 − x).
I therefore have that:
x2 (1 − x)2
fX|Y (x|2) = R 1
0 x2 (1 − x)2 dx
= 30x2 (1 − x)2

for 0 ≤ x ≤ 1 and 0 otherwise. As a sanity check, I made sure the posterior integrates to 1.

Problem 2. From Baye’s rule we know that the posterior density for 0 ≤ x ≤ 1 is:
x
fX|Y (x|5) ∝ PY |X (5|x)fX (x) = 3x3 (1 − x)4 ,
x
(and 0 otherwise) where the symbol ∝ means proportional to as a function of x. Therefore, the
MAP estimate is given by
x̂M AP = arg max{3x3 (1 − x)4 },
x∈[0,1]

which can be found setting the derivative of the argument equal to zero and solving for x:

0 = 9x̂2M AP (1 − x̂M AP )4 − 12x̂3M AP (1 − x̂M AP )3

=⇒ x̂M AP = 3/7, which is indeed in the interval [0, 1]

Problem 3. The conditional PDF of X|Y is given by

fXY (x, y) x
fX|Y (x|y) = ∝ fXY (x, y),
fY (y)

so that the MAP estimate of x is:

x̂M AP = arg max{fXY (x, y)}.


x∈[0,1]

On the other hand, the conditional PDF of Y |X is given by

fXY (x, y) fXY (x, y)


fY |X (y|x) = =R ,
fX (x) fXY (x, y)dy

so that the ML estimate of x is:


 
fXY (x, y)
x̂M L = arg max R .
x∈[0,1] fXY (x, y)dy

Since the joint PDF is


(
x + 23 y 2 for 0 ≤ x, y ≤ 1
fXY (x, y) =
0 otherwise,
155

we see that fXY (x, y) is maximized when x = 1, and therefore x̂M AP = 1.


From the equation above:

x + 23 y 2
fY |X (y|x) = R
x + 32 y 2 dy
2x + 3y 2
=
2x + 1
3y 2 − 1
=1+ ,
2x + 1

and thus we see that, to maximize this expression, if 3y 2 − 1 ≤ 0 we need to minimize the second
term, while if 3y 2 − 1 > 0 we need to maximize it. Therefore:
(
1 for y ≤ √13
x̂M L =
0 otherwise.

Problem 4. The posterior distribution is:

fY |X (y|x)fX (x)
fX|Y (x|y) = R
fY |X (y|x)fX (x)dx
x
(xy − 2 + 1)(2x
2 + 13 )
= R1 x
0 (xy − 2 + 1)(2x
2 + 13 )dx
2x3 y + xy 3
3 −x − 6
x
+ 2x2 + 1
3
= 2 2 ,
3y + 3

so that

x̂M = E[X|Y = y]
Z 1 
1 4 x2 y 4 x2 3 x
= 2 2 2x y + − x − + 2x + dx
3y + 3 0
3 6 3
46y − 37
= .
60(y + 1)

Problem 5.

(a) First note that since X ∼ N (0, 1), W ∼ N (0, 1) (where X and W are independent), we have
that Y = 2X + W ∼ N (0, 5). Now, aY + bX = a(2X + W ) + bX = (2a + b)X + aW which is
normal for all a, b ∈ R, and thus X and Y are jointly normal. Thus, by Theorem 5.4 in the
book, I have that:
Y − µY
X̂M = E[X|Y ] = µX + ρσX .
σY

From the distributions of X and Y , I have that µX = µY = 0, σY = 5, σX = 1 and:

Cov[X, Y ] = Cov[X, 2X + W ]
= 2V ar[X] + Cov[X, W ]
= 2V ar[X] (since X, W are independent)
= 2.
156 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

Plugging in the values, I find that


X̂M = E[X|Y ]
2 Y
=√ ·√
5 5
2Y
= .
5
(b) To solve this problem, I use the facts that X and W are independent and E[X 2 ] = E[W 2 ] = 1:
M SE = E[(X − X̂M )2 ]
= E[X 2 ] − 2E[X X̂M ] + E[X̂M 2
]
  "  2 #
2Y 2Y
= 1 − 2E X +E
5 5
   
2 2 4 2 2
= 1 − 2E (2X + XW ) + E (4X + 4XW + W )
5 25
4  4 
=1− E[2X 2 ] + E[X]E[W ] + E[(4X 2 ] + 4E[X]E[W ] + E[W 2 ]
5 25
4 4
= 1 − (2 · 1) + (4 · 1 + 1)
5 25
1
= .
5
2 ] = E[(2Y /5)2 ] =
(c) We know that E[X 2 ] = 1, and from the derivation above we can see that E[X̂M
2
20/25 = 4/5. Also, from the derivation above E[X̃ ] = M SE = 1/5. Therefore E[X̂M 2 ]+

E[X̃ 2 ] = 4/5 + 1/5 = 1, and so the relation is verified.


Problem 6.
(a) The linear MMSE estimator is given by
Cov[X, Y ]
X̂L = (Y − E[Y ]) + E[X],
V ar[Y ]
and therefore I must find these values. First note that since X ∼ U nif (0, 1), E[X] = 1/2
and V ar[X] = 1/12, which implies that E[X 2 ] = 1/3. Also note that since Y |X = x ∼
Exp(1/(2x)), E[Y |X] = 2X and V ar[Y |X] = 4X 2 . Now, since I know the distribution of
Y |X, to find E[Y ], I use the law of iterated expectation:
E[Y ] = E[E[Y |X]]
= E[2X]
= 1.
Similarly, to find V ar[Y ], I use the law of total variance:
V ar[Y ] = E[V ar[Y |X]] + V ar[E[Y |X]]
= E[4X 2 ] + V ar[2X]
= 4E[X 2 ] + 4V ar[X]
1 1
=4· +4·
3 12
5
= .
3
157

To find Cov[X, Y ], I first solve for E[XY ], again using the law of iterated expectation:

E[XY ] = E[E[XY |X]]


= E[XE[Y |X]]
= E[X2X]
= 2E[X 2 ]
2
= .
3
Thus I have that:

Cov[X, Y ] = E[XY ] − E[X]E[Y ]


2 1
= − ·1
3 2
1
= .
6
Plugging all these values into the formula for X̂L , I find that:
1 2
X̂L = Y + .
10 5

(b) The MSE of X̂L is given by:

M SE = (1 − ρ2 )V ar[X]
 
Cov[X, Y ]2
= 1− V ar[X]
V ar[X]V ar[Y ]
 !
1 2
6 1
= 1− 1 5 ·
12 · 3
12
1
= .
15

(c) First, note that E[Y 2 ] = V ar[Y ] + (E[Y ])2 = 5/3 + 1 = 8/3. Now, I have that:

E[X̃Y ] = E[(X − X̂L )Y ]


   
1 2
= E XY − Y + Y
10 5
 
1 2 2
= E[XY ] − E[Y ] + E[Y ]
10 5
 
2 1 8 2
= − · +
3 10 3 5
= 0.

Problem 7.

(a) First note that Y ∼ (0, σX 2 + σ 2 ). Also note that aX + bY = (a + b)X + bW which is normally
W
distributed for all a, b ∈ R, and thus X and Y are jointly normal, and so by Theorem 5.4 in
the book:
Y − µY
X̂M = E[X|Y ] = µX + ρσX .
σY
158 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

Solving for the covariance is easy: Cov[X, Y ] = Cov[X, X + W ] = V ar[X] + Cov[X, W ] =


V ar[X] = σX2 , where Cov[X, W ] = 0 since X and W are independent. I thus have that:

Cov[X, Y ] Y − µY
X̂M = µX + σX
σX σY σY
σ 2
Y
= q X · σX · q
2 + σ2
σX σX σX2 + σ2
W W
2
σX
= 2 2 Y.
σX + σW

(b) The MSE is:

M SE(X̂M ) = E[(X − X̂M )2 ]


= E[X 2 ] − 2E[X X̂M ] + E[X̂M
2
]
 2
2 σ2 σX2  
= E[X ] − 2 2 X 2 E [XY ] + 2 2 E Y2
σX + σW σX + σW
 2
2 σX2 2
σX  
= E[X ] − 2 2 2 E [X(X + W )] + 2 2 E (X + W )2
σX + σW σX + σW
σ 2 
= E[X 2 ] − 2 2 X 2 E[X 2 ] + E[X]E[W ]
σX + σW
 2
σX2 
+ 2 2 E[X 2 ] + 2E[X]E[W ] + E[W 2 ]
σX + σW
σX2 σ2
W
= 2 + σ2 ,
σX W

where in the second to last line I have used that X and W are independent, and where I have
used that: E[X 2 ] = σX
2 , E[W 2 ] = σ 2 and E[X] = E[W ] = 0.
W

Problem 8. Following along Example 9.9 of the book, I use the principle of orthogonality to solve
for X̂L . I first note that since E[X] = E[W1 ] = E[W2 ] = 0, E[Y1 ] = E[Y2 ] = 0. We would like the
linear MMSE estimator to be of the form:

X̂L = aY1 + bY2 + c.

Now, using the first of the two orthogonality principle equations I can solve for c:

0 = E[X̃] = E[X − X̂L ] = E[X] − aE[Y1 ] − bE[Y2 ] − c = −c,

so that c = 0.
Using the second of the two orthogonality principle equations I may solve for the remaining
constants by noting that

Cov[X̃, Yj ] = Cov[X − X̂L , Yj ] = 0 for j = 1, 2,

and thus:
Cov[X, Yj ] = Cov[X̂L , Yj ] for j = 1, 2.
I now compute the four covariances, and plug them into the above equations (one for j = 1 and
one for j = 2) to obtain a coupled set of equations in a and b.
159

For the Cov[X, Yj ] covariances, I have

Cov[X, Y1 ] = Cov[X, 2X + W1 ] = 2V ar[X] + Cov[X, W1 ] = 2V ar[X] = 10,

where I have used the fact that since X, W1 are independent, their covariance is 0. Also,

Cov[X, Y2 ] = Cov[X, X + W2 ] = V ar[X] + Cov[X, W2 ] = V ar[X] = 5,

where again I have used independence.


I now solve for the other two covariances:

Cov[X̂L , Y1 ] = Cov[aY1 + bY2 , Y1 ]


= aCov[Y1 , Y1 ] + bCov[Y2 , Y1 ]
= aCov[2X + W1 , 2X + W1 ] + bCov[X + W2 , 2X + W1 ]
= a(4V ar[X] + 4Cov[X, W1 ] + V ar[W1 ])
+ b(2V ar[X] + Cov[X, W1 ] + 2Cov[W2 , X] + Cov[W2 , W1 ])
= a(4V ar[X] + V ar[W1 ]) + 2bV ar[X]
= a(4 · 5 + 2) + 2 · b · 5
= 22a + 10b,

and

Cov[X̂L , Y2 ] = Cov[aY1 + bY2 , Y2 ]


= aCov[Y1 , Y2 ] + bCov[Y2 , Y2 ]
= aCov[2X + W1 , X + W2 ] + bCov[X + W2 , X + W2 ]
= a(2V ar[X] + 2Cov[X, W2 ] + Cov[W1 , X] + Cov[W1 , W2 ])
+ b(V ar[X] + 2Cov[X, W2 ] + V ar[W2 ])
= 2aV ar[X] + b(V ar[X] + V ar[W2 ])
= 2 · a · 5 + b(5 + 5)
= 10a + 10b

where in both computations, I have used the independence of X, W1 , W2 .


Putting these 4 equations together, I get a coupled set of algebraic equations in a and b:

10 = 22a + 10b
5 = 10a + 10b,

resulting in a = 5/12 and b = 1/12. The linear MMSE estimator of X is thus:


5 1
X̂L = Y1 + Y2 .
12 12
Problem 9. In order to use the vector formula

X̂L = CXY CY−1 (Y − E[Y ]) + E[X],

to solve for the linear MMSE estimator, I need to compute the vectors/matrices that go into this
formula. Firstly, it can easily be shown that
   
E[Y1 ] 0
E[Y ] = = .
E[Y2 ] 0
160 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

The covariance matrix of Y is:


 
E[(Y1 − E[Y1 ])2 ] E[(Y1 − E[Y1 ])(Y2 − E[Y2 ])]
CY =
E[(Y2 − E[Y2 ])(Y1 − E[Y1 ])] E[(Y2 − E[Y2 ])2 ]
 
E[Y12 ] E[Y1 Y2 ]
=
E[Y2 Y1 ] E[Y22 ]
 
E[(2X + W1 )2 ] E[(2X + W1 )(X + W2 )]
=
E[(X + W2 )(2X + W1 )] E[(X + W2 )2 ]
 
E[4X 2 + 4XW1 + W12 ] E[2X 2 + 3XW1 + W1 W2 ]
=
E[2X 2 + 3XW1 + W1 W2 ] E[X 2 + 2XW2 + W22 ]
 
4E[X 2 ] + E[W12 ] 2E[X 2 ]
=
2E[X 2 ] E[X 2 ] + E[W22 ]
 
4·5+2 2·5
=
2·5 5+5
 
22 10
= ,
10 10

where I have used the fact that X, W1 , W2 are independent.


The cross-covariance matrix of X and Y is:
 
CXY = E[(X − E[X])(Y1 − E[Y1 ])] E[(X − E[X])(Y2 − E[Y2 ])]
 
= E[XY1 ] E[XY2 ]
 
= E[X(2X + W1 )] E[X(X + W2 )]
 
= 2E[X 2 ] + E[XW1 ] E[X 2 ] + E[XW2 ]
 
= 2E[X 2 ] E[X 2 ]
 
= 10 5 ,

where, again, I have used the fact that X, W1 , W2 are independent.


Plugging these matrices into the formula for X̂L , I finally have:
 
  22 10 −1
X̂L = 10 5 Y
10 10
5 1
= Y1 + Y2 ,
12 12
which matches exactly with the answer found in the previous problem which utilized the orthogo-
nality principle.

Problem 10. To solve this problem, I use the vector formula approach as in the previous problem.
It can easily be shown that    
E[Y1 ] 0
E[Y ] = E[Y2 ] = 0 .
E[Y3 ] 0
Also, since E[Yi ] = 0 (for i = 1, 2, 3), the covariance matrix of Y is
 
E[Y12 ] E[Y1 Y2 ] E[Y1 Y3 ]
CY = E[Y2 Y1 ] E[Y22 ] E[Y2 Y3 ]
E[Y3 Y1 ] E[Y3 Y2 ] E[Y32 ].
161

Therefore, I need to compute all of the pairwise expectations along the diagonal and in the upper
right triangle of the matrix. Notice that all of the pairwise expectations are of the form E[Yi Yj ],
where Yi = ai X + bi Wi , so I derive a general formula for all 6 pairs for easy computation of the
expectations:

E[Yi Yj ] = E[(ai X + bi Wi )(aj X + bj Wj )]


= ai aj E[X 2 ] + ai bj E[XWj ] + bi aj E[Wi X] + bi bj E[Wi Wj ]
= ai aj E[X 2 ] + bi bj E[Wi2 ]δi,j
= 5ai aj + b2i E[Wi2 ]δi,j ,

where δi,j is the Kronecker-delta function, and where I used independence several times. Using this
formula, it is easy to find that  
22 10 10
CY = 10 10 5 
10 5 17.
Since E[X] = 0 and E[Yi ] = 0 (for i = 1, 2, 3), the cross-covariance matrix is also easy to
compute
 
CXY = E[XY1 ] E[XY2 ] E[XY3 ]
 
= E[X(2X + W1 )] E[X(X + W2 )] E[X(X + W3 )]
 
= 2E[X 2 ] E[X 2 ] E[X 2 ]
 
= 10 5 5

Plugging these matrices into the vector formula for X̂L , I finally have:
 −1
  22 10 10
X̂L = 10 5 5 10 10 5  Y
10 5 17
60 12 5
= Y1 + Y2 + Y3 .
149 149 149
Problem 11.

(a) The linear MMSE estimator is given by:

Cov[X, Y ]
X̂L = (Y − E[Y ]) + E[X],
V ar[Y ]
and so I must compute the various terms in this equation.
The expectations are
3
E[X] = E[Y ] = ,
7
while the covariance is

Cov[X, Y ] = E[XY ] − E[X]E[Y ]


 2
3
=0−
7
9
=− .
49
162 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

Finally, the variance of Y is:


     
3 2 1 3 3 2 3
V ar[Y ] = 0 − · + + 1− ·
7 7 7 7 7
12
= .
49

Plugging these values into the above equation for X̂L , I find that:
3 3
X̂L = − Y + .
4 4

(b) The MMSE estimator is given by


X̂M = E[X|Y ],
and the conditional PMFs can easily be found to be
(
1
for x = 0
PX|Y (x|0) = 43
4 for x = 1

and (
1 for x = 0
PX|Y (x|1) =
0 for x = 1.
This results in the conditional expectations, E[X|Y = 0] = 3/4 and E[X|Y = 1] = 0, which
can be combined into one expression using an indicator random variable:
3
X̂M = 1{Y = 0}.
4

(c) The MSE of X̂M is given by:

M SE(X̂M ) = E[(X − X̂M )2 ]


X 3
2
= x − 1{y = 0} PXY (x, y)
x,y
4
 2  2  2
3 1 3 3 3 3
= 0 − 1{0 = 0} · + 1 − 1{0 = 0} · + 0 − 1{1 = 0} ·
4 7 4 7 4 7
 2  2
3 1 3 3
= − · + 1− ·
4 7 4 7
3
= .
28

Problem 12.

(a) Following along with the previous problem, I must calculate the terms that go into the formula
for X̂L . The expectations are:
1 1 1
E[X] = + =
3 6 2
and
1 1 2
E[Y ] = + 2 · = ,
3 6 3
163

while the covariance is:

Cov[X, Y ] = E[XY ] − E[X]E[Y ]


1 2 1
=2· − ·
6 3 2
= 0.

Therefore, the linear MMSE estimator is simply a constant:

1
X̂L = E[X] = .
2

(b) The MSE for X̂L is given by:

M SE(X̂L ) = E[(X − X̂L )2 ]


1
= E[(X − )2 ]
2
1
= E[X 2 ] − E[X] +
4
1 1 1
= − +
2 2 4
1
= .
4

(c) To find X̂M , I must first find the conditional PMF of X:

(
1
3 for x = 0
PX|Y (x|0) = 2
3 for x = 1,

(
1 for x = 0
PX|Y (x|1) =
0 for x = 1.

and
(
0 for x = 0
PX|Y (x|1) =
1 for x = 1.

Thus, I have the conditional expectations, E[X|Y = 0] = 2/3, E[X|Y = 1] = 0, and E[X|Y =
2] = 1, which can be combined into one expression using an indicator random variable:

X̂M = E[X|Y ]
2
= 1{Y = 0} + 1{Y = 2}.
3
164 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

(d) The MSE of X̂M is given by:

M SE(X̂M ) = E[(X − X̂M )2 ]


X 2
2
= x − 1{y = 0} − 1{y = 2} PXY (x, y)
x,y
3
 2  2
2 1 2 1
= 0 − 1{0 = 0} − 1{0 = 2} · + 1 − 1{0 = 0} − 1{0 = 2} ·
3 6 3 3
 2  2
2 1 2 1
+ 0 − 1{1 = 0} − 1{1 = 2} · + 1 − 1{2 = 0} − 1{2 = 2} ·
3 3 3 6
 2  2
2 1 2 1 1 1
= 0− · + 1− · + (0 − 0)2 · + (1 − 1)2 ·
3 6 3 3 3 6
1
= .
9

Problem 13. In this problem, the two hypotheses are:

Ho : X = 1
H1 : X = −1,

where the priors are P (Ho ) = p and P (H1 ) = 1 − p. Note that given Ho , Y = 2 + W so that
Y |Ho ∼ N (2, σ 2 ), and that given H1 , Y = −2 + W so that Y |H1 ∼ N (−2, σ 2 ). The posterior
probability of Ho is thus:

x 1 (y−2)2
P (Ho |Y = y) ∝ fY (y|Ho )P (Ho ) = √ e− 2σ2 p,
2πσ

while the posterior probability of H1 is thus:

x 1 (y+2)2
P (H1 |Y = y) ∝ fY (y|H1 )P (H1 ) = √ e− 2σ2 (1 − p).
2πσ

The MAP decision rule for this problem is to accept Ho if the posterior probability under Ho is
greater than or equal to the posterior probability under H1 . In other words, we accept Ho if:

1 (y−2)2 1 (y+2)2
√ e− 2σ2 p ≥ √ e− 2σ2 (1 − p),
2πσ 2πσ

otherwise we accept H1 .
Using some algebra, this rule can be re-written as: if
 
σ2 1−p
y≥ ln ,
2 p

then accept Ho , otherwise accept H1 .

Problem 14. The error probability is given by:

Pe = P (choose H1 |Ho )P (Ho ) + P (choose Ho |H1 )P (H1 ).


165

We can replace “choose H1 ” and “choose Ho ” with our decision rule to compute the conditional
probabilities:
   
σ2 1 − p
P (choose H1 |Ho ) = P Y < ln X = 1
2 p
   
σ2 1 − p
= P 2X + W < ln X = 1
2 p
  
σ2 1−p
=P 2+W < ln
2 p
   
σ 1−p 2
=Φ ln − .
2 p σ
Likewise, the other conditional probability is given by
   
σ2 1 − p
P (choose Ho |H1 ) = P Y ≥ ln X = −1
2 p
   
σ2 1 − p
= P 2X + W ≥ ln X = −1
2 p
  
σ2 1−p
= P −2 + W ≥ ln
2 p
   
σ 1−p 2
=1−Φ ln + .
2 p σ
Plugging these conditional probabilities into the formula for the error probability, I find that
        
σ 1−p 2 σ 1−p 2
Pe = Φ ln − p+ 1−Φ ln + (1 − p).
2 p σ 2 p σ

Problem 15. Using the minimum cost decision method, we should accept Ho if P (Ho |y)C10 ≥
P (H1 |y)C01 . Note that C01 is the cost of choosing Ho (there is no malfunction) given H1 is true
(there is a malfunction). That is, C01 is the cost of missing a malfunction, so that, as specified in
the problem C01 = 30C10 .
The left hand side of the inequality decision rule is therefore:

P (Ho |y)C10 = (1 − P (H1 |y))C10


= 0.9C10 ,

while the right hand side of the inequality is

P (H1 |y)C01 = 0.1 · 30C10


= 3C10 .

Since the costs are usually not negative, we see that P (H1 |y)C01 > P (Ho |y)C10 and we thus should
accept H1 , the hypothesis that there is a malfunction.

Problem 16. For X, Y jointly normal, we know from Theorem 5.4 in the book that:
 
y − µY 2 2
X|Y = y ∼ N µX + ρσX , (1 − ρ )σX .
σY
Since X ∼ N (2, 1) and Y ∼ N (1, 5), using the above formula, I have that the posterior distribution
is X|Y = 1 ∼ N (2, 15/16).
166 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

When choosing a confidence interval, to keep things symmetric, we usually choose the confidence
interval such that α/2 of the probability is in the left tail of the distribution (i.e., P (X < a|Y = 1) =
α/2) and α/2 of the probability is in the right tail of the distribution (i.e., P (X > b|Y = 1) = α/2).
Therefore to find the 90% credible interval, [a, b], I have the following equations:
 
a−2
0.05 = Φ √ ,
15/4
and  
b−2
1 − 0.05 = Φ √ .
15/4
Solving for a and b by using the inverse Gaussian CDF, I find that the 90% credible interval is
approximately [0.41, 3.6].
Problem 17.
(a) The posterior distribution can be found using Bayes’ rule. For x > 0 :
x
fX|Y (x|y) ∝ PY |X (y|x)fX (x)
x
∝ e−x xy xα−1 e−βx
= e−x(β+1) xα+y−1
x
∝ (β + 1)α+y e−x(β+1) xα+y−1 ,
x
while for x ≤ 0, fX|Y (x|y) = 0. The notation ∝ means proportional to as a function of x.
Notice that since α > 0 and y ≥ 0, y + α is therefore greater than 0. Also, since β > 0,
1+β > 0. Therefore, this function is exactly of the form of a Gamma(α+y, β+1) distribution,
and so we know that the normalizing constant is Γ(α + y).
(b) In the previous part of the problem, I showed that X|Y = y ∼ Gamma(α + y, β + 1), and
therefore: (
(β+1)α+y xα+y−1 e−x(β+1)
Γ(α+y) for x > 0
fX|Y (x|y) =
0 otherwise.

(c) For U ∼ Gamma(α, λ), as shown in Section 4.2.4 of the book, E[U ] = α/λ and V ar[X] =
α/λ2 . Therefore, I have
α+Y
E[X|Y ] =
β+1
and
α+Y
V ar[X|Y ] = .
(β + 1)2
Problem 18.
(a) The posterior distribution can be found using Bayes’ rule. For 0 ≤ x ≤ 1 :
x
fX|Y (x|y) ∝ PY |X (y|x)fX (x)
x
∝ xy (1 − x)n−y xα−1 (1 − x)β−1
= xα+y−1 (1 − x)β+n−y−1 ,
while for x < 0 and x > 1, fX|Y (x|y) = 0. Now, since α > 0 and y ≥ 0, α + y > 0. Also, since
β > 0 and n ≥ y, β + n − y > 0. Thus, since this equation, up to a normalization constant,
has the exact same functional form as a Beta(α + y, β + n − y) distribution, the posterior is
given by this distribution.
167

(b) Since the posterior is given by a Beta(α + y, β + n − y) distribution, I have that:


( Γ(α+β+n) α+y−1 (1 − x)β+n−y−1 for 0 ≤ x ≤ 1
Γ(α+y)Γ(β+n−y) x
fX|Y (x|y) =
0 otherwise.

(c) Plugging in my values for α and β into the formulas for the expectation and variance of a
Beta distribution, I find:
α+Y
E[X|Y ] =
α+β+n
and
(α + Y )(β + n − Y )
V ar[X|Y ] = .
(α + β + n)2 (α + β + n + 1)

Problem 19.

(a) Since Y |X = x ∼ Geom(x) I have that:

PY |X (y|x) = x(1 − x)y−1 y = 1, 2, 3, . . . .

Now, the posterior distribution can be found using Bayes’ rule. For 0 ≤ x ≤ 1 :
x
fX|Y (x|y) ∝ PY |X (y|x)fX (x)
x
∝ x(1 − x)y−1 xα−1 (1 − x)β−1
= x(α+1)−1 (1 − x)(β+y−1)−1 ,

while for x < 0 and x > 1, fX|Y (x|y) = 0. Since α > 0, α + 1 > 0. Also, since β > 0,
y − 1 ≥ 0, β + y − 1 > 0. We thus see that up to a normalizing constant, this is the PDF for
a Beta(α + 1, β + y − 1) distribution, and hence X|Y = y ∼ Beta(α + 1, β + y − 1).

(b) Since X|Y = y ∼ Beta(α + 1, β + y − 1), the posterior distribution is:


( Γ(α+β+y−1) (α+1)−1 (1 − x)(β+y−1)−1 for 0 ≤ x ≤ 1
Γ(α+1)Γ(β+y−1) x
fX|Y (x|y) =
0 otherwise.

(c) Plugging in my values for α and β into the formulas for the expectation and variance of a
Beta distribution, I find:
α+1
E[X|Y ] =
α+β+Y
and
(α + 1)(β + Y − 1)
V ar[X|Y ] = .
(α + β + Y )2 (α + β + Y + 1)

Problem 20.
iid
(a) Since Yi |X = x ∼ Exp(x), I have that
(
xe−xy for y > 0
fYi |X (y|x) =
0 otherwise.
168 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE

The likelihood function is thus:

L(Y ; x) = fY1 ,Y2 ,...,Yn |X (y1 , y2 , . . . , yn |x)


Yn
= fYi |X (yi |x) (by independence)
i=1
n
Y
= xe−xyi
i=1
n −x n
P
=x e i=1 yi .

x
(b) Since X ∼ Gamma(α, β), I have that fX (x) ∝ xα−1 e−βx for x > 0 and fX (x) = 0 otherwise.
Therefore, for x > 0,the posterior is:
x
fX|Y1 ,Y2 ,...,Yn (x|y1 , y2 , . . . , yn ) ∝ fY1 ,Y2 ,...,Yn |X (y1 , y2 , . . . , yn |x)fX (x)
= L(Y ; x)fX (x)
x Pn
∝ xn e−x i=1 yi α−1 −βx
x e
Pn
= xα+n−1 e−x( i=1 yi +β )
,

while for x ≤ 0, fX|Y1 ,Y2 ,...,Yn (x|y1 , y2 , . . . , yn ) = 0. Now, since α and n are greater than 0,
so too is α + n. Further since it is assumed that P Yi |X ∼ Exp(X), it is implicit that all yi s
are greater than 0, and since β > 0, so too is ni=1 yi + β. Therefore, we see that, Pn up to a
normalizing constant, the posterior has the functional Pn form of a Gamma(α + n, i=1 yi + β)
distribution, so that X|Y = y ∼ Gamma(α + n, i=1 yi + β).

(c) The posterior PDF is given by:


 P Pn
 ( ni=1 yi +β)α+n xα+n−1 e−x( i=1 yi +β)
for x > 0
fX|Y1 ,Y2 ,...,Yn (x|y1 , y2 , . . . , yn ) = Γ(α+n)
0 otherwise.

(d) For U ∼ Gamma(α, λ), as shown in Section 4.2.4 of the book, E[U ] = α/λ and V ar[X] =
α/λ2 . Therefore, I have
α+n
E[X|Y ] = Pn
i=1 Yi + β
and
α+n
V ar[X|Y ] = Pn .
( i=1 Yi + β)2
Chapter 10

Introduction to Simulation Using


Python

169
170 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

The Python programming language is one of the most popular languages in both academia and
industry. It is heavily used in data science for simple data analysis and complex machine learning.
By most accounts, in the last few years, Python has eclipsed the R programming language in
popularity for scientific/statistical computation. Its popularity is due to intuitive and readable
syntax that can be implemented in a powerful object oriented programming paradigm, if so desired,
as well as being open source. It is for these reasons that I decided to transcribe the Introduction to
Simulation chapter in Pishro-Nik’s Introduction to Probability, Statistics and Random Processes
book into Python.
This entire chapter was written in a Jupyter notebook, an interactive programming environment,
primarily for Python, that can be run locally in a web browser. Jupyter notebooks are ideal for
quick and interactive data analysis, incorporating markdown functionality for clean presentations
and code sharing. If you are a fan of RStudio, you will most likely be fond of Jupyter notebooks.
This entire notebook is available freely at https://github.com/dsrub/solutions_to_probability_
statistics.
Additionally, much of this code was written using the Numpy/SciPy library, Python’s main library
for scientific computation and numerical methods. Numpy has a relatively clear and well docu-
mented API (https://docs.scipy.org/doc/numpy/reference/index.html), a reference which I utilize
almost daily.
I start with a few basic imports, and define several functions I will use throughout the rest of this
chapter.
#define html style element for notebook formatting
from IPython.core.display import HTML

with open('style.txt', 'r') as myfile:


notebook_style = myfile.read().replace('\n', '')

HTML(notebook_style)

#import some relevant packages and plot inline


import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

#define a few functions I will be using throughout the rest of the notebook

#function to print several of the RGNs to the screen


def print_vals(RNG_function, *args):
for i in range(5):
print('X_' + str(i)+' = ', RNG_function(*args))

#plotting function
def plot_results(x, y, xlim=None, ylim=None, xlabel=None, ylabel=None, \
title=None, labels=None):

plt.figure(1, figsize = (6, 4))


plt.rc('text', usetex=True)
plt.rc('font', family = 'serif')
171

if labels:
plt.plot(x[0], y[0], label=labels[0], linewidth = 2)
plt.plot(x[1], y[1], label=labels[1], linewidth = 2)
plt.legend(loc='upper right')
else:
plt.plot(x, y, linewidth = 2)
if xlim:
plt.xlim(xlim)
if ylim:
plt.ylim(ylim)
if xlabel:
plt.xlabel(xlabel, size = 15)
if ylabel:
plt.ylabel(ylabel, size = 15)
if title:
plt.title(title, size=15)

plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15);

Example 1. (Bernoulli) Simulate tossing a coin with probability of heads p.


Solution: We can utilize the algorithm presented in the book, which uses random vari-
ables drawn from a U nif (0, 1) distribution. The following function implements this
algorithm in Python to generate a Bern(p) (pseudo) random variable.
def draw_bern(p, N):
"""
A Bern(p) pseudo-RNG
"""
U = np.random.uniform(size = N)
if N == 1: U = U[0]
X = (U < p) + 0

return X
#print a few examples of the RGNs to the screen
p = 0.5
print_vals(draw_bern, p, 1)

X_0 = 0
X_1 = 0
X_2 = 0
X_3 = 0
X_4 = 1
Note that we can directly sample from a Bern(p) distribution with Numpy’s binomial random
number generator (RNG) by setting n = 1 with: np.random.binomial(1, p).
Example 2. (Coin Toss Simulation) Write code to simulate tossing a fair coin to see how the law
of large numbers works.
172 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

Solution: I draw 1000 Bern(0.5) random variables and compute the cumulative average.
#generate data, compute proportion of heads and plot

#set a seed for reproducibility


np.random.seed(2)

X = draw_bern(0.5, 1000)
avg = np.cumsum(X)/(np.arange(1000) + 1)
plot_results(np.arange(1000) + 1, avg, xlabel='$N$', ylabel='Proportion of heads')

#reset seed
np.random.seed(0)

Example 3. (Binomial) Generate a Bin(50, 0.2) random variable.


Solution: If X1 , X2 , . . . , Xn are drawn iid from a Bern(p) distribution, then we can
express a Bin(n, p) random variable as X = X1 + X2 + . . . + Xn . Therefore we can
utilize the code we have already written for drawing a Bern(p) random variable to draw
a Bin(n, p) random variable.
def draw_bin(n, p, N):
"""
A Bin(n, p) pseudo-RNG
"""
if N > 1:
U = np.random.uniform(0, 1, (N, n))
X = np.sum(U < p, axis = 1)

else:
U = np.random.uniform(0, 1, n)
X = np.sum(U < p)
173

return X
#print a few examples of the RGNs to the screen
n = 50
p = 0.2
print_vals(draw_bin, n, p, 1)

X_0 = 8
X_1 = 17
X_2 = 3
X_3 = 13
X_4 = 10
Note that we can directly sample from a Bin(n, p) distribution with Numpy’s binomial RNG with:
np.random.binomial(n, p).
Example 4. Write an algorithm to simulate the value of a random variable X such that:



0.35 for x=1


0.15 for x=2
PX (x) =


0.4 for x=3


0.1 for x = 4.
Solution: We can utilize the algorithm presented in the book which divides the unit
interval into 4 partitioned sets and uses a uniformly drawn random variable.
def draw_general_discrete(P, R_X, N):
"""
A pseudo-RNG for any arbitrary discrete PMF specified by R_X and
corresponding probabilities P
"""
F_X = np.cumsum([0] + P)

X_arr = []
U_arr = np.random.uniform(0, 1, size = N)
for U in U_arr:
X = R_X[np.sum(U > F_X)-1]

#take care of edge case where U = 0


if U == 0:
X = R_X[0]
X_arr.append(X)
if N == 1: X_arr = X_arr[0]

return X_arr

#print a few examples of the RGNs to the screen


P = [0.35, 0.15, 0.4, 0.1]
R_X = [1, 2, 3, 4]
print_vals(draw_general_discrete, P, R_X, 1)
174 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

X_0 = 2
X_1 = 4
X_2 = 3
X_3 = 3
X_4 = 4
Note that we can directly sample from a discrete PMF using Numpy’s multinomial RNG. A multi-
nomial distribution is the k dimensional analogue of a binomial distribution, where k > 2. The
multinomial distribution is a distribution over random vectors, X (of size k), where the entries
in the vectors can take on values from 0, 1, . . . n, subject to X1 + X2 + . . . + Xk = n, where Xi
represents the ith component of X.
If a binomial random variable represents the number of heads we flip out of n coin tosses (where
the probability of heads is p), then a multinomial random variable represents the number of times
we roll a 1, the number of times we roll a 2, . . ., the number of times we roll a k, when rolling
a k sided die n times. For each roll, the probability of rolling the ith face of the die is pi (where
∑k
i=1 pi = 1). We store the value for the number times we roll the i
th face of the die in X . To
i
denote a random vector drawn from a multinomial distribution, the notation, X ∼ M ult(n, p), is
typical, where p denotes the k dimensional vector with the ith component of p given by pi .
To directly sample from a discrete PMF with (ordered) range array R_X and associated prob-
ability array P we can use Numpy’s multinomial RNG function by setting n = 1 (one roll).
To sample one time we can use the code: X = R_X[np.argmax(np.random.multinomial(1,
pvals=P))], and to sample N times, we can use the code: X = [R_X[np.argmax(x)] for x in
np.random.multinomial(1, pvals=P, size=N)].
Additionally, to sample from an arbitrary discrete PMF, we can also use Numpy’s choice function,
which samples randomly from a specified list, where each entry in the list is sampled according to a
specified probability. To sample N values from an array R_X, with corresponding probability array
P, we can use the code: X = np.random.choice(R_X, size=N, replace=True, p=P). Make sure
to specify replace=True to sample with replacement.
Example 5. (Exponential) Generate an Exp(1) random variable.
Solution: Using the method of inverse transformation, as shown in the book, for a
strictly increasing CDF, F , the random variable X = F −1 (U ), where U ∼ U nif (0, 1),
has distribution X ∼ F . Therefore, it is not difficult to show that,

1
− ln(U ) ∼ Exp(λ),
λ

where the fact that 1 − U ∼ U nif (0, 1) has been used.


def draw_exp(lam, N):
"""
An Exp(lambda) pseudo-RNG using the method of inverse transformation
"""
U = np.random.uniform(0, 1, size = N)
if N == 1:
U = U[0]
X = (-1/lam)*np.log(U)

return X
175

#print a few examples of the RGNs to the screen


lam = 1
print_vals(draw_exp, lam, 1)

X_0 = 2.4838379957
X_1 = 0.593858616083
X_2 = 0.53703944167
X_3 = 0.0388069650697
X_4 = 1.23049637556
Note that we can directly sample from an Exp(λ) distribution with Numpy’s exponential RNG
with: np.random.exponential(lam).
Example 6. (Gamma) Generate a Gamma(20, 1) random variable.
Solution: If X1 , X2 , . . . , Xn are drawn iid from an Exp(λ) distribution, then Y = X1 +
X2 + . . . + Xn ∼ Gamma(n, λ). Therefore, to generate a Gamma(n, λ) random variable,
we need only to generate n independent Exp(λ) random variables and add them.
def draw_gamma(alpha, lam, N):
"""
A Gamma(n, lambda) pseudo-RNG using the method of inverse transformation
"""
n = alpha
if N > 1:
U = np.random.uniform(0, 1, size = (N, n))
X = np.sum((-1/lam)*np.log(U), axis = 1)

else:
U = np.random.uniform(0, 1, size = n)
X = np.sum((-1/lam)*np.log(U))

return X
#print a few examples of the RGNs to the screen
alpha = 20
lam = 1
print_vals(draw_gamma, alpha, lam, 1)

X_0 = 17.4925086879
X_1 = 20.6155480241
X_2 = 26.9115218192
X_3 = 22.3654600391
X_4 = 22.331744631
Note that we can directly sample from a Gamma(n, λ) distribution with Numpy’s gamma RNG
with: np.random.gamma(shape, scale).
Example 7. (Poisson) Generate a Poisson random variable. Hint: In this example, use the fact
that the number of events in the interval [0, t] has Poisson distribution when the elapsed times
between the events are Exponential.
Solution: As shown in the book, we need only to continuously generate Exp(λ) variables
176 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

and count the number of draws it takes for the sum to be greater than 1. The Poisson
random variable is then the count minus 1.
def draw_poiss(lam, N):
"""
A Poiss(lambda) pseudo-RNG
"""
X_list = []

for _ in range(N):
summ = 0
count = 0
while summ <= 1:
summ += draw_exp(lam, 1)
count += 1
X_list.append(count-1)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


lam = 1
print_vals(draw_poiss, lam, 1)

X_0 = 0
X_1 = 2
X_2 = 2
X_3 = 1
X_4 = 2
Note that we can directly sample from a P oiss(λ) distributions with Numpy’s: np.random.poisson(lam)
function.
Example 8. (Box-Muller) Generate 5000 pairs of normal random variables and plot both his-
tograms.
Solution: Using the Box-Muller transformation as described in the book:
def draw_gaus_pairs(N):
"""
An N(0, 1) pseudo-RNG to draw N pairs of indepedent using the Box-Muller
transformation
"""
U1 = np.random.uniform(size = N)
U2 = np.random.uniform(size = N)

Z1 = np.sqrt(-2*np.log(U1))*np.cos(2*np.pi*U2)
Z2 = np.sqrt(-2*np.log(U1))*np.sin(2*np.pi*U2)
177

return (Z1, Z2)

#print a few examples of the RGNs to the screen


Z1_arr, Z2_arr = draw_gaus_pairs(5)

for i, (Z1, Z2) in enumerate(zip(Z1_arr, Z2_arr)):


print('(Z_1, Z_2)_' + str(i)+' = (', Z1, Z2, ')')

(Z_1, Z_2)_0 = ( 0.722134435205 -0.189448731182 )


(Z_1, Z_2)_1 = ( -0.918558147113 0.247330492682 )
(Z_1, Z_2)_2 = ( -1.42078058592 -0.914027516141 )
(Z_1, Z_2)_3 = ( 1.19799155228 -1.49105841693 )
(Z_1, Z_2)_4 = ( -0.65055423687 0.179187077215 )
In addition to plotting the histograms (plot in the first panel below) I also make a scatter plot of
the 2 Gaussian random variables. The Box-Muller method produces pairs of independent random
variables, and indeed, in the plot we see a bivariate Normal distribution with no correlation, i.e., it
is axis-aligned (recall that independence =⇒ ρ = 0). I further compute the correlation coefficient
between Z1 and Z2 and it is indeed very close to 0.
#plot the histograms and scatter plot

#set seed for reproducibility


np.random.seed(8)

#generate data
Z1_arr, Z2_arr = draw_gaus_pairs(5000)

#plot histograms
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

bins = np.linspace(-5, 5, 50)


ax1.hist(Z1_arr, bins, alpha=0.5, normed=1, label='$Z_1$', edgecolor = 'black')
ax1.hist(Z2_arr, bins, alpha=0.5, normed=1, label='$Z_2$', edgecolor = 'black')
ax1.legend(loc='upper right')
ax1.set_xlabel('$Z$', size = 15)
ax1.set_ylabel('Probability Density', size = 15)
ax1.tick_params(labelsize=15)

#plot scatter plot


ax2.scatter(Z1_arr, Z2_arr, s=2)
ax2.set_xlabel('$Z_1$', size = 15)
ax2.set_ylabel('$Z_2$', size = 15)
ax2.set_ylim((-4, 4))
ax2.set_xlim((-4, 4))
ax2.tick_params(labelsize=15)

print('correlation coefficient = ', np.corrcoef(Z1_arr, Z2_arr)[0, 1])

#reset seed
178 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

np.random.seed(0)

correlation coefficient = 0.0177349514518

Note that we can directly sample from a N (0, 1) distribution with Numpy’s normal RNG with:
np.random.randn(d0, d1, ..., dn), where d0, d1, …, dn are the dimensions of the desired output
array.
Exercise 1. Write Python programs to generate Geom(p) and P ascal(m, p) random variables.
Solution: As in the book, I generate Bern(p) random variables until the first success
and count the number of draws to generate a Geom(p) random variable. To generate
a P ascal(m, p) random variable, I generate Bern(p) random variables until I obtain m
successes and count the number of draws.
def draw_geom(p, N):
"""
A Geom(p) pseudo-RNG
"""
X_list = []
for _ in range(N):

count = 0
X = 0
while X == 0:
X = draw_bern(p, 1)
count += 1
X_list.append(count)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


p = 0.2
print_vals(draw_geom, p, 1)
179

X_0 = 15
X_1 = 1
X_2 = 1
X_3 = 8
X_4 = 2
def draw_pascal(m, p, N):
"""
A Pascal(m, p) pseudo-RNG
"""
X_list = []
for _ in range(N):
count_succ = 0
count = 0
while count_succ < m:
X = draw_bern(p, 1)
count_succ += X
count += 1
X_list.append(count)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


p = 0.2
m = 2
print_vals(draw_pascal, m, p, 1)

X_0 = 17
X_1 = 10
X_2 = 7
X_3 = 3
X_4 = 4
Note that we can directly sample from Geom(p) and P ascal(m, p) distributions with Numpy’s
np.random.geometric(p) and np.random.negative_binomial(n, p) functions respectively.
Exercise 2. (Poisson) Use the algorithm for generating discrete random variables to obtain a
Poisson random variable with parameter λ = 2.
Solution:
from scipy.misc import factorial

def draw_poiss2(lam, N):


"""
A Poiss(lambda) pseudo-RNG using the method to generate an
arbitrary discrete random variable
"""
X_list = []
180 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

for _ in range(N):
P = np.exp(-lam)
i = 0
U = np.random.uniform()
while U >= P:
i += 1
P += np.exp(-lam)*lam**i/(factorial(i)+0)

X_list.append(i)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


lam = 2
print_vals(draw_poiss2, 2, 1)

X_0 = 3
X_1 = 0
X_2 = 5
X_3 = 2
X_4 = 5
Exercise 3. Explain how to generate a random variable with the density


f (x) = 2.5x x

for 0 < x < 1.



Solution: The CDF is given by FX (x) = 2.5 0x x′ 3/2 dx′ = x5/2 , and therefore FX−1 (x) =
x2/5 . Using the method of inverse transformation, if U ∼ U nif (0, 1), then FX−1 (U ) is
distributed according to the desired distribution.
def draw_dist3():
"""
A pseudo-RNG for the distribution in Exercise 3
"""
U = np.random.uniform()
return U**(0.4)
#print a few examples of the RGNs to the screen
print_vals(draw_dist3)

X_0 = 0.8178201131579468
X_1 = 0.8861754700680049
X_2 = 0.27369087549414306
X_3 = 0.6033871249144047
X_4 = 0.4285059109745954
181

Exercise 4. Use the inverse transformation method to generate a random variable having distri-
bution function

x2 + x
FX (x) = ,
2

for 0 ≤ x ≤ 1.
Solution: By inverting the CDF, we have that.


1 1
FX−1 (x) =− + + 2x,
2 4

for 0 ≤ x ≤ 1.
def draw_dist4():
"""
A pseudo-RNG for the distribution in Exercise 4
"""
U = np.random.uniform()
return -0.5 + np.sqrt(0.25 + 2*U)

#print a few examples of the RGNs to the screen


print_vals(draw_dist4)

X_0 = 0.417758353296
X_1 = 0.198180089883
X_2 = 0.441257859881
X_3 = 0.538521058539
X_4 = 0.115056902
Exercise 5. Let X have a standard Cauchy distribution. function

1 1
FX (x) = arctan x + .
π 2

Assuming you have U ∼ U nif (0, 1), explain how to generate X. Then, use this result to produce
1000 samples of X and compute the sample mean. Repeat the experiment 100 times. What do you
observe and why?
Solution: The inverse CDF is given by FX−1 (x) = tan[π(x − 1/2)].
def draw_stand_cauchy(N):
"""
A standard Cauchy pseudo-RNG using the method of inverse transformation
"""
U = np.random.uniform(size = N)
X = np.tan(np.pi*(U - 1/2))

if N == 1: return X[0]
else: return X
182 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

#print a few examples of the RGNs to the screen


print_vals(draw_stand_cauchy, 1)

X_0 = 0.691013110859
X_1 = 0.212342443875
X_2 = -0.907695727473
X_3 = 0.0731660554841
X_4 = -3.28946953204
#plot means for 100 trials

#set seed for reproducibility


np.random.seed(5)

#compute means and plot


means = [np.mean(np.array(draw_stand_cauchy(1000))) for _ in range(100)]

plot_results(range(100), means, xlabel='trial', ylabel='mean', \


title='Standard Cauchy mean')

#reset seed
np.random.seed(0)

We see that the means for each trial vary wildly. This is because the Cauchy distribution actually
has no mean.
Exercise 6. (The Rejection Method) When we use the Inverse Transformation Method, we need
a simple form of the CDF, F (x), that allows direct computation of X = F −1 (U ). When F (x)
doesn’t have a simple form but the PDF, f (x), is available, random variables with density f (x)
can be generated by the rejection method. Suppose you have a method for generating a random
variable having density function g(x). Now, assume you want to generate a random variable having
183

density function f (x). Let c be a constant such that f (y)/g(y) ≤ c (for all y). Show that the
following method generates a random variable, X, with density function f (x).
f (Y )
1) initialize U and Y such that U > cg(Y )
f (Y ) {
repeat until U ≤ cg(Y )

2) Generate Y having density g


3) Generate a random number U from U nif (0, 1)
}

4) Set X = Y
Solution:
Firstly, as a technical matter, note that c ≥ 1, which can be shown by integrating both sides of
f (y) ≤ cg(y).
We see that this algorithm keeps iterating until it outputs a random variable Y , given that we know
f (Y ) f (Y )
that U ≤ cg(Y ) . Therefore, the goal is to show that the random variable Y |U ≤ cg(Y ) has PDF f (y)
( )
f (Y )
(or equivalently CDF F (y)). In other words, we must show that P Y ≤ y U ≤ cg(Y ) = F (y). I
show this with Baye’s rule:

( )
f (Y )
(
f (Y )
) P U≤ cg(Y ) Y ≤ y P (Y ≤ y)
P Y ≤ y U ≤ = ( )
cg(Y ) P U≤ f (Y )
cg(Y )
( )
f (Y )
P U≤ cg(Y ) Y ≤ y G(y)
= (
f (Y )
) .
P U ≤ cg(Y )

( ) ( )
f (Y ) f (Y )
Thus, we must calculate the quantities: P U ≤ cg(Y ) Y ≤ y and P U ≤ cg(Y ) .

As an intermediate step, note that

( ) ( )
f (Y ) f (y)
P U≤ Y = y = P U ≤ Y = y
cg(Y ) cg(y)
( )
f (y)
=P U ≤
cg(y)
( )
f (y)
= FU
cg(y)
f (y)
= ,
cg(y)

where in the second line I have used that U and Y are independent and in the fourth I have used
the fact that for a uniform distribution FU (u) = u. Notice that the requirement that f (y)/g(y) ≤ c
(for all y) is crucial at this step. This is because f (y)/g(y) ≤ c =⇒ c > 0 (since f (y) and g(y)
are positive), so that 0 < f (y)/cg(y) ≤ 1. If this condition did not hold, then the above expression
184 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

f (y)
would be min{1, cg(y) }, for positive c and 0 for negavtive c, which would interfere with the rest of
the derivation.
( )
f (Y )
I may now calculate P U ≤ cg(Y ) :

( ) ∫ ∞ ( )
f (Y ) f (Y )
P U≤ = P U≤ Y = y g(y)dy
cg(Y ) −∞ cg(Y )
∫ ∞
f (y)
= g(y)dy
−∞ cg(y)

1 ∞
= f (y)dy
c −∞
1
= .
c

I now calculate the remaining quantity:

( )
( ) f (Y )
f (Y ) P U≤ cg(Y ) , Y ≤y
P U≤ Y ≤y =
cg(Y ) G(y)
∫∞ ( )
f (Y )
−∞ P U≤ cg(Y ) , Y ≤ y Y = v g(v)dv
=
G(y)
( )
∫∞ f (Y )
−∞ P U≤ cg(Y ) Y ≤ y, Y = v P (Y ≤ y|Y = v)g(v)dv
= ,
G(y)

where in the second line I have used the law of total probability, and in the third line I have used
the definition of conditional probability. Note that:

{
1 for v ≤ y
P (Y ≤ y|Y = v) =
0 for v > y,

and thus

( )
∫y f (Y )
(
f (Y )
)
−∞ P U≤ cg(Y ) Y ≤ y, Y = v g(v)dv
P U≤ Y ≤y =
cg(Y ) G(y)
( )
∫y f (Y )
−∞ P U≤ cg(Y ) Y = v g(v)dv
=
G(y)
∫y f (v)
−∞ cg(v) g(v)dv
=
G(y)
1
c F (y)
= ,
G(y)

where in the second line I have used the fact that conditioning on Y = v already implies that Y ≤ y
since we only consider values
( of v less than or
) equal to y in the integration. In the third line I have
f (Y )
used the expression for P U ≤ cg(Y ) Y = y that we derived above.
185

Inserting these quantities into Baye’s rule:

( )
( ) f (Y )
f (Y ) P U≤ cg(Y ) Y ≤ y G(y)
P Y ≤ y U ≤ = ( )
cg(Y ) P U ≤ cg(Y f (Y )
)
1
F (y)
c
G(y)
=
G(y)
1
c
= F (y),

which is what we set out to prove.


Exercise 7. Use the rejection method to generate a random variable having density function
Beta(2, 4). Hint: Assume g(x) = 1 for 0 < x < 1.
Solution: I first visualize these distributions so we can get a handle on what we are
dealing with.
#plot Beta(2, 4) and Unif(0, 1)
from scipy.stats import beta

x1, x2 = np.linspace(0, 1, 1000), np.linspace(0, 1, 1000)


y1, y2 = beta.pdf(x1, 2, 4), x2*0+1
labels = ['Beta(2, 4)', 'Unif(0, 1)']

plot_results([x1, x2], [y1, y2], xlim=(0, 1), ylim=(0, 2.5), xlabel='$X$', \


ylabel='PDF', labels=labels)

Since f (x)/g(x) (where f (x) is the PDF of the Beta and g(x) is the PDF of the uniform) needs to
be smaller than c for all x in the support of these distributions, a fine value of c to use would be
2.5 since it is evident from the plot that this value satisfies the requirement. The book uses the
186 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

smallest possible value of c, i.e., the max of the Beta(2, 4) distribution, which it derives analytically
and finds to be 135/64 ≈ 2.11. It is not necessary to use the smallest value of c, but will certainly
help the speed of the algorithm since the algorithm only stops when U ≤ f (Y )/cg(Y ). I will stick
with the value of 2.5 just to illustrate that the algorithm works for this value as well.
def draw_beta_2_4(N):
"""
A Beta(2, 4) pseudo-RNG using the rejection method
"""
c = 2.5

X_list = []
for _ in range(N):

U = 1
f_Y = 0
g_Y =1

while U > f_Y/(c*g_Y):


Y = np.random.uniform()
U = np.random.uniform()

f_Y = 20*Y*(1-Y)**3
g_Y = 1

X_list.append(Y)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


print_vals(draw_beta_2_4, 1)

X_0 = 0.4236547993389047
X_1 = 0.07103605819788694
X_2 = 0.11827442586893322
X_3 = 0.5218483217500717
X_4 = 0.26455561210462697
Note that we can directly sample from a Beta(α, β) distribution with Numpy’s beta RNG with:
np.random.beta(a, b).
Exercise 8. Use the rejection method to generate a random variable having the Gamma(5/2, 1)
density function. Hint: Assume g(x) is the PDF of the Gamma(1, 2/5).
Solution: Note that there is a mistake in the phrasing of the question in the book.
The PDF for g(x) should be Gamma(1, 2/5), not Gamma(5/2, 1). Also note that we
cannot use the method that we used in Example, 6. since in this case α is not an
integer (however, we can use that method to draw from g(x)). I first visualize these
distributions so we can get a handle on what we are dealing with.
187

#plot Gamma(5/2, 1) and Gamma(1, 2/5)

x1, x2 = np.linspace(0, 20, 1000), np.linspace(0, 20, 1000)


f, g = (4/(3*np.sqrt(np.pi)))*(x1**1.5)*np.exp(-x1), 0.4*np.exp(-0.4*x2)
labels = ['Gamma(5/2, 1)', 'Gamma(1, 2/5)']

plot_results([x1, x2], [f, g], xlim=(0, 15), ylim=(0, 0.4), xlabel='$X$', \


ylabel='PDF', labels=labels)

The max{f (x)/g(x)} for x > 0 is approximately given by:


np.max(f/g)

1.6587150033103788
As a sanity check, this value is very close to the analytically derived value in the book, which is
( )3/2
10 5

3 π 2 e−3/2 ≈ 1.6587162. Therefore, I set the value of c to be 1.7, and use the function I wrote
in Example. 6, draw_gamma(alpha, lam, N), to draw from g(x).
def draw_gamma_2(alpha, lam, N):
"""
A Gamma(5/2, 1) pseudo-RNG using the rejection method
"""
c = 1.7

X_list = []
for _ in range(N):

U = 1
f_Y = 0
188 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

g_Y =1

while U > f_Y/(c*g_Y):


Y = draw_gamma(1, 0.4, 1)
U = np.random.uniform()

f_Y = (4/(3*np.sqrt(np.pi)))*(Y**1.5)*np.exp(-Y)
g_Y = 0.4*np.exp(-0.4*Y)

X_list.append(Y)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


print_vals(draw_gamma_2, 5/2, 1, 1)

X_0 = 1.96233211971
X_1 = 1.22716649756
X_2 = 2.55754781375
X_3 = 0.900161721137
X_4 = 3.89706921546
Exercise 9. Use the rejection method to generate a standard normal random variable. Hint:
Assume g(x) is the PDF of the exponential distribution with λ = 1.
Solutuion As in the book, to solve this problem, I use the rejection method to sample
from a half Gaussian:

2 x2
f (x) = √ e− 2 ,

with range (0, ∞), with an


√ Exp(1) distribution for g(x). The book analytically computes
max{f (x)/g(x)} to be 2e/π ≈ 1.32, and I thus use c = 1.4. Once the algorithm is
able to sample from the half Gaussian, to turn this distribution into a full Gaussian
with range R, one need only to randomly multiply by -1. I therefore sample Q (∈ {0, 1})
from a Bern(0.5) distribution and multiply by 1-2Q (∈ {−1, 1}) in order to sample from
the full Gaussian.
def draw_standard_normal(N):
"""
A standard normal pseudo-RNG using the rejection method
"""
c = 1.4

X_list = []
for _ in range(N):

U = 1
189

f_Y = 0
g_Y =1

while U > f_Y/(c*g_Y):


Y = draw_exp(1, 1)
U = np.random.uniform()

f_Y = (2/np.sqrt(2*np.pi))*np.exp(-(Y**2)/2)
g_Y = np.exp(-Y)

# draw Bern(0.5) random variable for the sign


Q = draw_bern(0.5, 1)

X_list.append(Y*(1-2*Q))

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


print_vals(draw_standard_normal, 1)

X_0 = 1.1538237197
X_1 = -2.28234324111
X_2 = -0.426012274543
X_3 = -1.40884434358
X_4 = -0.421092193245
Exercise 10. Use the rejection method to generate a Gamma(2, 1) random variable conditional on
its value being greater than 5. Hint: Assume g(x) is the density function of exponential distribution.
Solution As in the book, I use an Exp(0.5) conditioned on X > 5 as the distribution
for g(x). It is not difficult to show by integrating the PDF of this distribution that
G−1 (x) = 5 − 2 ln(1 − x) (where G is the CDF). I therefore use the method of inverse
transformation to first draw a random variable from this distribution (Y ). Note that
for U ∼ U nif (0, 1), 1 − U ∼ U nif (0, 1), and therefore the formula for G−1 (U ) can be
simplified to 5 − 2 ln(U ). I then use the rejection method to sample from the desired
distribution. By maximizing f (x)/g(x), the book shows that c must be greater than
5/3, and I therefore use c = 1.7.
def draw_gamma_2_1_cond_5(N):
"""
A Gamma(2, 1) conditional on X>5 pseudo-RNG using the rejection method
"""
c = 1.7

X_list = []
for _ in range(N):

U = 1
190 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

f_Y = 0
g_Y =1

while U > f_Y/(c*g_Y):


Y = 5 - 2*np.log(np.random.uniform())
U = np.random.uniform()

f_Y = Y*np.exp(5-Y)/6
g_Y = np.exp((5-Y)/2)/2

X_list.append(Y)

if N == 1:
return X_list[0]
else:
return X_list

#print a few examples of the RGNs to the screen


print_vals(draw_gamma_2_1_cond_5, 1)

X_0 = 6.76250850879
X_1 = 5.73497460514
X_2 = 5.14665551227
X_3 = 5.8087003199
X_4 = 5.66723645483
Notice that, as required, the random variables are all > 5.
As a final check to close this chapter, I draw samples from most of the RNG functions that I
implemented above, compute the corresponding PMFs/PDFs, and compare to the theoretical dis-
tributions. I first check the discrete distributions, and I start by writing a function that will compute
the empirical PMFs. Note that the phrase, “empirical PMF”, (and “empirical PDF”) is standard
terminology to refer to the probability distribution associated with a sample of data. Formally, for
a collection of data, {xi }Ni=1 , they are given by

1 ∑N
PX (x) = I(x = xi )
N i=1

for the empirical PMF, and by

1 ∑N
fX (x) = δ(x = xi )
N i=1

for the empirical PDF (where I(·) is the indicator function, and δ(·) is the delta function).
def compute_PMFs(counts, xrange):
"""
Compute empirical PMFs from a specified array of random variables,
and a specified range
"""
191

count_arr = []
xrange2 = range(np.max([np.max(xrange), np.max(counts)])+1)
for i in xrange2:
count_arr.append(np.sum(counts==i))
pmf = np.array(count_arr)/np.sum(np.array(count_arr))
return pmf[np.min(xrange):np.max(xrange)+1]

I now compute the theoretical distributions, generate the data and compute the empirical distribu-
tions.
from scipy.stats import bernoulli, binom, poisson, geom, nbinom
#set seed for reproducibility
np.random.seed(1984)

x_ranges = [range(2), range(26), range(9), range(1, 11), range(4, 26), range(9)]

#compute PMF arrays for the theoretical distributions


numpy_dists = [bernoulli, binom, poisson, geom, nbinom, poisson]
numpy_args = [[0.5], [50, 0.2], [1], [0.5], [4, 0.5, 4], [1]]
numpy_y = [np_dist.pmf(xrange, *np_args) for np_dist, xrange, np_args in \
zip(numpy_dists, x_ranges, numpy_args)]

N = 1000 #number of points to sample

# draw random variables from my functions and compute corresponding PMFs


my_rngs = [draw_bern, draw_bin, draw_poiss, draw_geom, draw_pascal, draw_poiss2]
my_args = [[0.5, N], [50, 0.2, N], [1, N], [0.5, N], [4, 0.5, N], [1, N]]
my_counts = [rng(*args) for rng, args in zip(my_rngs, my_args)]
my_y = [compute_PMFs(np.array(counts), xrange) for counts, xrange in \
zip(my_counts, x_ranges)]

Finally, I plot the results.


#plot theoretical and empirical PMFs
names = ['Bern(0.5)', 'Bin(50, 0.2)', 'Poiss(1)', 'Geom(0.5)', 'Pascal(4, 0.5)', \
'Poiss(1) (discrete RV method)']
legend_loc = ['upper right']*6
legend_loc[0] = 'upper center'

f, [[ax1, ax2, ax3], [ax4, ax5, ax6]] = plt.subplots(2, 3, figsize=(15, 10))


ax_arr = [ax1, ax2, ax3, ax4, ax5, ax6]

for i, ax in enumerate(ax_arr):
ax.plot(x_ranges[i], numpy_y[i], 'bo', ms=8, label='Theoretical Dist.', \
alpha=.8)
ax.vlines(x_ranges[i], 0, numpy_y[i], colors='b', lw=5, alpha=0.5)
ax.plot(x_ranges[i], my_y[i], 'bo', ms=8, label='Empirical Dist.', \
color='green', alpha=.8)

ax.legend(loc=legend_loc[i], fontsize=11)
192 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON

ax.set_title(names[i], size = 15)


ax.set_ylim(ymin=0)
ax.tick_params(labelsize=15)

if i in [3, 4, 5]:
ax.set_xlabel('$X$', size = 15)
if i in [0, 3]:
ax.set_ylabel('PMF', size = 15)

We see that the empirical distributions match almost perfectly with the theoretical distributions,
with even better correspondence for larger N .
I now check some of the continuous RNG functions that I implemented in this chapter. I first start
by computing the theoretical distributions and generating the data.
from scipy.stats import expon, gamma, cauchy, beta, norm

x_ranges = [np.linspace(0, 8, 1000), np.linspace(0, 50, 1000), \


np.linspace(-20, 20, 1000), np.linspace(0, 1, 1000), \
np.linspace(0, 15, 1000), np.linspace(-5, 5, 1000)]

#compute PDF arrays for the theoretical distributions


numpy_dists = [expon, gamma, cauchy, beta, gamma, norm]
numpy_args = [[0, 1], [20, 0, 1], [], [2, 4], [5/2, 0, 1], []]
numpy_y = [np_dist.pdf(xrange, *np_args) for np_dist, xrange, np_args in \
zip(numpy_dists, x_ranges, numpy_args)]
193

N = 1000 #number of points to sample


# draw random variables from my functions to be plotted as histograms in next cell
my_rngs = [draw_exp, draw_gamma, draw_stand_cauchy, draw_beta_2_4, draw_gamma_2, \
draw_standard_normal]
my_args = [[1, N], [20, 1, N], [N], [N], [5/2, 1, N], [N]]
my_rvs = [rng(*args) for rng, args in zip(my_rngs, my_args)]

#reset seed
np.random.seed(0)

I now plot normalized histograms of the data and compare to the theoretical distributions. Again,
we see almost perfect correspondence between the empirical and theoretical distributions. The
correspondence becomes even better with larger values of N .
#plot theoretical and empirical PDFs
names = ['Exp(1) (inverse trans.)', 'Gamma(20, 1) (inverse trans.)', \
'Cauchy(0, 1) (inverse trans.)', 'Beta(2, 4) (rejection)', \
'Gamma(5/2, 1) (rejection)', 'N(0, 1) (rejection)']
bin_arr = [50, 35, 60, 45, 45, 35]
xlims=[(0, 8), (0, 50), (-20, 20), (0, 1), (0, 15), (-5, 5)]
range_arr = [None]*6
range_arr[2] = (-20, 20)

f, [[ax1, ax2, ax3], [ax4, ax5, ax6]] = plt.subplots(2, 3, figsize=(15, 10))


ax_arr = [ax1, ax2, ax3, ax4, ax5, ax6]

for i, ax in enumerate(ax_arr):
ax.plot(x_ranges[i], numpy_y[i], label='Theoretical Dist.', color='black', \
linewidth=3, alpha=.7)
ax.hist(my_rvs[i], bins=bin_arr[i], alpha=.5, edgecolor='black',normed=True, \
label='Empirical Dist.', range=range_arr[i])

ax.set_title(names[i], size = 15)


ax.legend(loc='upper right', fontsize=10)
ax.set_xlim(xlims[i])
ax.tick_params(labelsize=15)

if i in [3, 4, 5]:
ax.set_xlabel('$X$', size = 15)
if i in [0, 3]:
ax.set_ylabel('PDF', size = 15)
194 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
Chapter 11

Recursive Methods

195
196 CHAPTER 11. RECURSIVE METHODS

Problem 1.

(a) The characteristic polynomial for this recursive formula is

3
x2 − 2x + = 0,
4
which has roots 1/2 and 3/2, and therefore:
 n  n
3 1
an = α +β .
2 2

Using the initial conditions a0 = 1 and a1 = −1 leads to α = −1 and β = 1. Thus, the


solution to the recurrence equation is:
 n  n
3 1
an = − + .
2 2

(b) The characteristic polynomial for this recursive formula is

x2 − 4x + 4 = 0,
which can be factored into (x − 2)2 = 0. The polynomial thus has one root, x = 2, with a
multiplicity of 2, and therefore:
an = α2n + βn2n .
Using the initial conditions a0 = 2 and a1 = 6 leads to α = 2 and β = 1. Thus, the solution
to the recurrence equation is:
an = 2nn+1 + n2n .

Problem 2.

(a) Let An,k be the event of observing exactly k heads out of n coin tosses, and let H denote the
event that the last coin toss is a heads. By conditioning on the last coin toss I obtain:

P (An,k ) = P (An,k |H)p + P (An,k |H c )(1 − p)


= P (An−1,k−1 )p + P (An−1,k )(1 − p),

where the equality follows because if the last coin toss is heads, then we need exactly k − 1
heads from the first n − 1 tosses, and if the last coin toss is tails, then we need exactly k
heads from the first n − 1 tosses. Converting this to the notation used in the problem:

an,k = an−1,k−1 p + an−1,k (1 − p)

=⇒
an+1,k+1 = an,k p + an,k+1 (1 − p).

(b) We recognize that this is precisely a binomial


 experiment, so the probability associated with
n k
exactly k heads out of n is given by k p (1 − p)n−k , and therefore, using the equation above,
     
n + 1 k+1 (n+1)−(k+1) n k n−k n
p (1 − p) =p p (1 − p) + (1 − p) pk+1 (1 − p)n−(k+1) ,
k+1 k k+1
197

which, when simplified results in:


     
n+1 n n
= + .
k+1 k k+1
We need the restriction that 0 ≤ k < n to hold for this equation to be true, since the original
recursion relation does not hold if k = n (in that case the last flip cannot be a tails since we
need all flips to be heads, so that the P (An,k |H c ) term should be 0).

Problem 3. Let A be the desired event and let q = 1 − p be the probability of tails. To solve this
problem, I first condition on whether the first toss is a heads or tails:

P (A) = P (A|H)p + P (A|T )q.

To help solve for P (A|H), I now condition on the second toss:

P (A|H) = P (A|HH)p + P (A|HT )q


= 1 · p + P (A|T )q,

where P (A|HH) = 1 since if you flip 2 consecutive heads, the experiment is done and where
P (A|HT ) = P (A|T ), since the first heads does not matter because we are interested in 2 consecutive
heads, and 1 isolated heads does not get us any closer to the event A.
To help solve for P (A|T ), I also condition on the second toss:

P (A|T ) = P (A|T H)p + P (A|T T )q


= P (A|H)p + 0 · q,

where P (A|T T ) = 0 since if you flip 2 consecutive tails, the experiment is done (and the desired even
did not occur) and where P (A|T H) = P (A|H), for essentially the same reason that P (A|HT ) =
P (A|T ) as described above.
I now re-express these 3 equations in slightly more readable notation:

 a = aH p + aT q
aH = p + aT q
 T
a = aH p,
and we see that we have a system of 3 equations with 3 unknowns. Solving for a and plugging
q = 1 − p back in I find that:
p2 (2 − p)
a= .
1 − p(1 − p)
As a check, we know that in the limit that p goes to 1, this expression should evaluate to unity (we
definitely get HH before T T ) and in the limit that p goes to 0 this expression should evaluate to
0 (we definitely get T T before HH). Indeed it is easily to check that this expression satisfies these
2 limits.

Problem 4. Let An be the event that the number of heads out of n tosses is divisible by 3, p be
the probability of heads and q = 1 − p be the probability of tails. To solve this problem recursively,
I first condition on whether the first toss is a heads or tails:

P (An ) = P (An |H)p + P (An |T )q


= P (An |H)p + P (An−1 )q.

Here, P (An |T ) = P (An−1 ) since if the first toss is a tails, as in the sequence below, we have observed
no heads,
198 CHAPTER 11. RECURSIVE METHODS

T ...
n n−1 n−2 1

so that the experiment just starts over at 1 less total flips (n − 1), and is equivalent to the sequence
below:

... .
n−1 n−2 1

The numbers below the sequence n, n − 1, n − 2, . . . show the number of flips remaining before you
make that particular flip.
To solve for P (An |H), I condition on the second toss:

P (An |H) = P (An |HH)p + P (An |HT )q


= P (An |HH)p + P (An−1 |H)q.

Here, P (An |HT ) = P (An−1 |H) since the the probability of An for the sequence below,

H T ...
n n−1 n−2 1

is the same as the probability of An for a sequence that starts with 1 heads with n − 1 flips:

H ... .
n−1 n−2 1

Finally, to solve for P (An |HH), I condition on the third toss:

P (An |HH) = P (An |HHH)p + P (An |HHT )q


= P (An − 3)p + P (An−1 |HH)q.

Here, P (An |HHT ) = P (An−1 |HH) since the the probability of An for the sequence below,

H H T ...
n n−1 n−2 n−3 1

is the same as the probability of An for a sequence that starts with 2 heads with n − 1 flips:

H H ... .
n−1 n−2 n−3 1

Also P (An |HHH) = P (An − 3) since, if we have already gotten 3 heads in the first 3 flips then the
probability that the number of heads flipped in the sequence is divisible by 3 is the same as if the
remaining n − 3 flips is divisible by 3.
I summarize this set of recursive equations in somewhat more readable notation:
 HH
 an = an−3 p + aHH n−1 q
aH = aHH p + aH q
 n n n−1
an = aH
n p + an−1 q.

We see that the equations are a coupled set of recursive equations, so must be solved simultaneously
by first solving for aHH H
n , then using this to solve for an , then using this to solve for an , iteratively
until we reach the desired value of n. In order to do this, we will need several initial conditions
which we can easily be compute by hand. For a sequence with n = 1, the number of heads is
divisible by 3 if we throw one tails (probability q). For a sequence with n = 2, the number of heads
is divisible by 3 if we throw two tails (probability q 2 ). For a sequence with n = 3, the number of
heads is divisible by 3 if we throw 3 tails or 3 heads (probability p3 + q 3 ):
199

a1 = q
a2 = q 2
a3 = p3 + q 3 .

For a sequence that starts with 1 head and with n = 1, the number of heads is never divisible by
3 (probability 0). For a sequence that starts with 1 head and with n = 2, the number of heads is
never divisible by 3 (probability 0). For a sequence that starts with 1 head and with n = 3, the
number of heads is only divisible by 3 if we throw 2 heads after the first (probability p2 ):

aH
1 =0
aH
2 =0
aH 2
3 =p .

Finally, for a sequence that starts with 2 head and with n = 1, the number of heads is never
divisible by 3 (probability 0). For a sequence that starts with 2 head and with n = 2, the number
of heads is never divisible by 3 (probability 0). For a sequence that starts with 2 head and with
n = 3, the number of heads is only divisible by 3 if we throw 1 head after the first 2 (probability
p):

aHH
1 =0
aHH
2 =0
aHH
3 = p.

I can check this coupled set of recursive equations by recognizing that we can compute P (An )
directly using the binomial distribution and only summing over the number of successes which are
divisible by 3. This can be written as:
b3c  
n
X n 3k n−3k
P (An ) = p q .
3k
k=0

I wrote a python function (below) to compute P (An ) using both methods. I compute P (An ) for a
range of n, for several values of p, and plot P (An ) calculated recursively against P (An ) calculated
with the binomial distribution as well as the 45 degree line in Fig. 11.1. If there is perfect agreement
between the 2 methods, the points should lie along this line, and indeed this is exactly what we
see.
import numpy a s np
from s c i p y . s p e c i a l import binom

d e f c omp u t e b i n o m r e c u r ( p , N ) :
”””
Compute p r o b a b i l i t y t h a t t h e number o f heads out o f N
c o i n f l i p s ( each o f p r o b a b i l i t y p ) i s d i v i s i b l e by 3 .
Returns t h e p r o b a b i l i t y computed from t h e b i n o m i a l
d i s t r i b u t i o n and t h e p r o b a b i l i t y computed from r e c u r s i o n .
”””
200 CHAPTER 11. RECURSIVE METHODS

1.0
p = 0.2 1.0
p = 0.5 1.0
p = 0.8

0.8 0.8 0.8


an (recursion)

an (recursion)

an (recursion)
0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
an (binomial) an (binomial) an (binomial)

Figure 11.1: Comparison of P (An ) calculated recursively against P (An ) calculated with the bino-
mial distribution (Problem 4).

#compute t h e p r o b a b i l i t y from t h e b i n o m i a l
P arr =[]
f o r n in range (1 , N) :
P = np . sum ( np . a r r a y ( [ binom ( n , 3∗ k ) ∗ p ∗ ∗ ( 3 ∗ k )∗(1 −p ) ∗ ∗ ( n−3∗k ) \
f o r k i n r a n g e ( 0 , i n t ( np . f l o o r ( n / 3 ) + 1 ) ) ] ) )
P a r r . append (P)

#compute t h e p r o b a b i l i t y from r e c u r s i o n
q = 1−p

#i n i t i a l i z e r e c u r s i o n
an HH = [ 0 , 0 , p ]
an H = [ 0 , 0 , p ∗ ∗ 2 ]
an = [ q , q ∗ ∗ 2 , p∗∗3+q ∗ ∗ 3 ]

for i n r a n g e (N−4):
#r e c u r s i o n update e q u a t i o n s
an HH new = an [ −3]∗ p+an HH [ −1]∗ q
an H new = an HH new∗p+an H [ −1]∗ q
an new = an H new ∗p+an [ −1]∗ q

an HH . append ( an HH new )
an H . append ( an H new )
an . append ( an new )

r e t u r n ( P arr , an )

You might also like