Solutions To Probability Book
Solutions To Probability Book
1 Basic Concepts 5
5 Joint Distributions 71
3
4 CONTENTS
Chapter 1
Basic Concepts
5
6 CHAPTER 1. BASIC CONCEPTS
Problem 1.
(a)
A ∪ B = {1, 2, 3} ∪ {2, 3, 4, 5, 6, 7} = {1, 2, 3, 4, 5, 6, 7}
(b)
(c)
(d) No, A, B and C do not partition S since 2, 3 ∈ A as well as in B. 7 is also in both B and C.
Problem 2.
(a)
[6, 8] ∪ [2, 7) = [2, 8]
(b)
[6, 8] ∩ [2, 7) = [6, 7)
(c)
[0, 1]c = (−∞, 0) ∪ (1, ∞)
(d)
[6, 8] − (2, 7) = [7, 8]
Problem 3.
(a)
(A ∪ B) − (A ∩ B) =
= (A ∪ B) ∩ (A ∩ B)c
= (A ∪ B) ∩ (Ac ∪ B c ),
(b)
B − C = B ∩ Cc
(c)
(A ∩ C) ∪ (A ∩ B)
(d)
(C − A − B) ∪ ((A ∩ B) − C)
7
Problem 4.
(a)
A = {(H, H), (H, T )}
(b)
B = {(H, T ), (T, H), (T, T )}
(c)
C = {(H, T ), (T, H)}
Problem 5.
(a) |A2 | is half of the numbers from 1 to 100, so |A2 | = 50. To solve for |A3 | note that there are 2
numbers between each pair of elements in A3 where A3 is assumed to be pre-sorted (e.g., 4, 5
are between 3 and 6). There are also |A3 |−1 of these pairs, and thus |A3 |+2(|A3 |−1)+3 = 100,
where I have added 3 to account for the numbers at the beginning and end of the sequence
which are not divisible by 3 (1, 2 and 100). Thus, I find that |A3 | = 33. |A4 | is exactly half
of |A2 |, and thus |A4 | = 25. Finally, to solve for |A5 | we may use the same method we used
to solve for |A3 |: |A5 | + 4(|A5 | − 1) + 4 = 100, from which we find that |A5 | = 20.
(b) By inclusion-exclusion:
Note that |A2 ∩ A3 | = |A6 | = 16, |A2 ∩ A5 | = |A10 | = 10, |A3 ∩ A5 | = |A15 | = 6, where |A10 |
and |A15 | were found by counting (since there are very few elements in these sets), and |A6 |
was found by the same method I used to compute |A3 |. Lastly, the intersection of all 3 sets
is given by the set of multiples of 30, so that |A2 ∩ A3 ∩ A5 | = |{30, 60, 90}| = 3. Therefore:
|A2 ∪ A3 ∪ A5 | = 50 + 33 + 20 − 16 − 10 − 6 + 3 = 74.
B
10 20 15
Problem 7.
(a) A is a subset of a countable set, N, and is thus countable.
(b) As shown in the book, if we can write any set, S in the form:
[[
S= {qij }, (1.1)
i∈B j∈C
where B and C are countable sets, then S is a countable set. It is easy to see that we may
re-write B as:
8 CHAPTER 1. BASIC CONCEPTS
[ [ √
B= {ai + bj 2}, (1.2)
i∈Q j∈Q
√
where qij ≡ ai + bj 2, and thus B is countable.
(c) C is uncountable. One way to prove this is to note that for all x ∈ [0, 1], (x, 0) ∈ C, so that
C ⊃ [0, 1], i.e, C is a superset of an uncountable set and is thus uncountable.
Problem 8. I first prove that An ⊂ An+1 (a proper subset) for all n = 1, 2, . . .. To do this, it
suffices to prove that (n − 1)/n < n/(n + 1), which I do with proof by contradiction. By assuming
(n − 1)/n ≥ n/(n + 1), after a little algebra, one concludes that −1 ≥ 0, which is clearly a
contradiction and therefore (n − 1)/n < n/(n + 1). Thus the union of all the An s is given by the
largest set in the sequence, which is A∞ (= limn→∞ [0, n−1
n )). After applying L’hopitals rule, one
can show that A∞ = [0, 1), and thus:
∞
[
A= An = [0, 1).
n=1
Problem 9. As with the previous problem, one may show that An+1 ⊂ An for all n = 1, 2, . . . by
proving that 1/(n + 1) < 1/n. This is somewhat obvious, but if you really want to be formal, you
can prove it with a proof by contradiction. Therefore, the intersection of all the An s is given by
the smallest set, A∞ , which is limn→∞ [0, n1 ) = [0, 0) = {0}, and thus:
∞
\
A= An = {0}.
n=1
Problem 10.
(a) To motivate the bijection (the one-to-one mapping between 2N and C) we are about to
construct, note that for every set in 2N , a natural number n will either appear once, or not at
all. Therefore, it is convenient to indicate its presence in the set with a 1 and its absence with
a 0. For example {1, 3, 6} will get mapped to the sequence 101001000 . . . (this is implicitly
assuming that we have pre-ordered the elements in the particular set from 2N ). In general,
the bijective mapping we use f : 2N → C, is given by:
where 1 is the so-called indicator function which is 1 if its argument evaluates to true and
0 otherwise. To prove that this mapping is bijective, we must prove it is both injective and
surjective.
To prove it is injective, I use a proof by contradiction. Assume it is not injective. Under this
assumption there exists x, x0 ∈ 2N , where x 6= x0 such that f (x) = f (x0 ). x and x0 can either
have the same cardinality, or they can be different. Without loss of generality, if they are
different, let us call x the one with the larger cardinality. Since x 6= x0 there exists at least 1
natural number n in x which is not in x0 . Therefore in the sequences f (x) and f (x0 ), there
is at least one value in the sequences which does not match up, namely the value at position
n, and therefore f (x) 6= f (x0 ), which violates our original assumption.
The proof of surjectivity is also straightforward.
9
(b) Any number in x ∈ [0, 1) always has a unique binary expansion given by x = b1 /2+b2 /22 +...,
and therefore we can construct a bijective mapping between x ∈ [0, 1) and C by computing
b1 /2 + b2 /22 + ..., and then by dropping the 0. at the beginning of the sequence. Since there
is a bijection between 2N and C and a bijection between C and [0, 1) (and given the fact that
the composition of 2 bijections is a bijection) there is thus a bijection between 2N and [0, 1).
Assuming (correctly so) that the interval [0, 1) is uncountable, then so too is 2N .
Problem 11. As shown in the previous problem, there is a bijection between [0, 1) and C. There-
fore, if C is uncountable, then so too is [0, 1). We can use what is known as Cantor’s diagonal
argument to prove that C is uncountable.
Let us try to search for a bijective mapping between C and N. Suppose, for example, that the
first few mappings are given by:
1 → 0000000 . . .
2 → 1111111 . . .
3 → 0101010 . . .
4 → 1010101 . . .
5 → 1101011 . . .
6 → 0011011 . . .
7 → 1000100 . . .
..
.
Let us now construct a new sequence, s ∈ C by enumerating the complement of the elements
along the diagonal of the mapping (which I have highlighted in boldface above), s = 1011101 . . ..
By construction, s differs from every proposed mapping since the nth digit in s is different than
the nth digits in all of the mappings. Thus, no natural number gets mapped to s, and hence
the proposed mapping is not surjective. The mappings I chose for illustration in this example for
1, . . . , 7 were arbitrary, and this argument applies to any potential mapping. Therefore, there is
no bijective mapping between N and C, and hence no bijection between [0, 1) and N. Thus, the
interval [0, 1) is uncountable.
Problem 12.
(c) x can be all triplets that contain exactly 2 heads: (H, H, T ), (H, T, H) or (T, H, H).
Problem 13.
(a) The universal set is partitioned by the events a, b, d, and thus P (b) = 1 − P (a) − P (d) =
1 − 0.5 − 0.25 = 0.25.
(b) Since the events b and d are disjoint, by the 3rd axiom of probability, P (b∪d) = P (b)+P (d) =
0.25 + 0.25 = 0.5.
Problem 14.
(b)
P (Ac ∩ B) = P (B − A)
= P (B) − P (A ∩ B)
= 0.7 − 0.2
= 0.5
(c)
P (A − B) = P (A) − P (A ∩ B)
= 0.4 − 0.2
= 0.2
P (Ac − B) = P (S) − P (A ∪ B)
= 1 − 0.9
= 0.1,
P (Ac ∪ B) = P (S) − P (A − B)
= 1 − 0.2
= 0.8.
(f)
P (A ∩ (B ∪ Ac )) = P ((A ∩ B) ∪ (A ∩ Ac ))
= P ((A ∩ B) ∪ ∅)
= P (A ∩ B)
= 0.2.
Problem 15.
(a) The second roll is independent of the first, so we only need to consider the second roll, in
which case P (X2 = 4) = 1/6 since this is a finite sample space with equal probabilities for all
outcomes.
(b) The sample space is {1, 2, . . . , 6}×{1, 2, . . . , 6}, which has a cardinality of 36, and the possible
outcomes corresponding to the event that X1 + X2 = 7 are given by the set
{(1, 6), (6, 1), (2, 5), (5, 2), (3, 4), (4, 3)}, which has a cardinality of 6, and therefore P (X1 +
X2 = 7) = 6/36 = 1/6.
(c) Listing out the tuples that satisfy the second condition in a matrix-like representation, we
have:
11
Problem 16.
P∞ k
(a) The formula for a geometric series will be useful here: k=0 cr = a/(1 − r) for |r| < 1. To
solve for c, we can use the normalization constraint:
∞
X
1= P (k)
k=1
X∞ k
1
= −c + c
3
k=0
c
= −c + ,
1 − 1/3
and therefore c = 2.
(b)
(c)
∞ k
X 1
P ({3, 4, 5, . . .}) = 2
3
k=3
X∞ k
1 1 1
= −2 1 + + +2
3 9 3
k=0
13 3
= −2 +2
9 2
1
=
9
This answer may also have been computed 1 − P (1) − P (2).
Problem 17. Let us write down what we know in equations. Let a, b, c, d represent the events
that teams, A, B, C and D win the tournament respectively. Then as stated in the problem,
12 CHAPTER 1. BASIC CONCEPTS
P (a) = P (b), P (c) = 2P (d) and P (a ∪ c) = 0.6. Since the events partition the sample space,
P (a ∪ c) = P (a) + P (c). We know one more equation, which is that the probabilities must sum to
one: P (a) + P (b) + P (c) + P (d) = 1. We therefore have a linear system with 4 equations and 4
unknowns, and it will thus be convenient to write this in matrix notation in order to solve for the
probabilities:
1 −1 0 0 P (a) 0
0 0 1 −2 P (b) 0
1 0 1 0 P (c) = 0.6 .
1 1 1 1 P (d) 1
=⇒
−1
P (a) 1 −1 0 0 0
P (b) 0 0 1 −2 0
P (c) = 1 0 1 0 0.6
P (d) 1 1 1 1 1
2 1 −3 2 0
1
1 −3 2 0
=−2 −1 4
2 0.6
−1 −1 2 −1 1
0.2
0.2
=
0.4 .
0.2
Problem 18.
1
(a) P (T ≤ 1) = 16
(b)
P (T > 2) = 1 − P (T ≤ 2)
4
=1−
16
3
=
4
(c)
P (1 ≤ T ≤ 3) = P (T ≤ 3) − P (T < 1)
9 1
= −
16 16
1
=
2
Problem 19. The solutions to the quadratic are given by the quadratic formula:
√
−1 ± 1 − 4AB
X= , (1.3)
2A
13
y=
y
1
4 1
x
0
0 1
x
which has real solutions iff the condition 1 − 4AB ≥ 0 is satisfied. We therefore seek the probability
that P (1 − 4AB ≥ 0) (in the unit square), which, since the point (A, B) is picked uniformly, is the
fraction of area in the unit square which satisfies this constraint. Therefore points which satisfy
the following inequalities contribute to this probability:
1 1
≥ y,
4 x
x≤1
and
y ≤ 1,
where the last 2 inequalities follow since the randomly drawn points must lie within the unit square.
The area in the unit square which satisfies these constraints is shown in Fig. 6.2.
It is clear from the figure that the area is given by:
Z
1 1 1 1
P (real solns.) = + dx
4 4 1/4 x
1 1
= + ln 4
4 4
≈ 0.60.
Problem 20.
where in the figure, A1 is the innermost circle, (A2 −A1 ) is the “annulus” around A1, (A3 −A2 )
is the next “annulus” and so-forth. It is clear that the union of A1 and all of the annuli results
in A, and that each of these regions are also disjoint. I utilize the previous equation in the
desired proof:
14 CHAPTER 1. BASIC CONCEPTS
Proof.
∞
X
P (A) = P (A1 ) + P (Ai − Ai−1 )
i=2
n
X
= P (A1 ) + lim P (Ai − Ai−1 )
n→∞
i=2
Xn
= P (A1 ) + lim [P (Ai ) − P (Ai−1 )]
n→∞
i=2
= P (A1 ) + lim {[P (A2 ) − P (A1 )] + [P (A3 ) − P (A2 )] + [P (A4 ) − P (A3 )]
n→∞
+ . . . + [P (An ) − P (An−1 )]}
= P (A1 ) + lim [P (An ) − P (A1 )]
n→∞
= lim P (An )
n→∞
T
(b) Redefining A: A ≡ ∞ i=1 Ai , we seek to find P (A). If A1 , A2 , . . . is a series of decreasing
c c
events, then A1 , A2 , . . . must be a series of increasing events, and we can can therefore utilize
the results of the part (a) on sequence of the complements (as well as De Morgan):
∞
!c ! ∞
!
\ [
P (Ac ) = P Ai =P Aci = lim P (Acn ).
n→∞
i=1 i=1
Proof.
P (A) = 1 − P (Ac )
= 1 − lim P (Acn )
n→∞
= lim [1 − P (Acn )]
n→∞
= lim P (An )
n→∞
15
Problem 21.
(a) Let us define new events, Bi , such that B1 =A1 , B2 = A2 − A1 , B3 = A3 − A2 − A1 , . . .. Note
that the Bi s are disjoint. Also note that:
n
[
Bi = A1 ∪ (A2 − A1 ) ∪ (A3 − A2 − A1 ) ∪ . . . ∪ (An − An−1 − . . . − A1 )
i=1
= A1 ∪ A2 ∪ A3 . . . ∪ An
[n
= Ai ,
i=1
S∞ S∞
and for the same reason i=1 Bi = i=1 Ai . Using these facts, the proof is now straightfor-
ward:
Proof.
∞
! ∞
!
[ [
P Ai =P Bi
i=1 i=1
∞
X
= P (Bi )
i=1
n
X
= lim P (Bi )
n→∞
i=1
n
!
[
= lim P Bi
n→∞
i=1
n
!
[
= lim P Ai
n→∞
i=1
(b) The prove this second result I use the previous result as well as De Morgan (twice):
Proof.
∞
! ∞
!
\ [
P Ai =P Aci
i=1 i=1
n
!
[
= lim P Aci
n→∞
i=1
n
!
\
= lim P Ai
n→∞
i=1
Problem 22. Let Acof f be the event that a customer purchases coffee and Acake be the event that a
customer purchases cake. We know that P (Acof f ) = 0.7, P (Acake ) = 0.4 and P (Acof f , Acake ) = 0.2.
Thus, the conditional probability we seek is:
Problem 23.
(a)
P (A ∩ B) 0.1 + 0.1
P (A|B) = = ≈ 0.57
P (B) 0.1 + 0.1 + 0.1 + 0.05
(b)
P (C ∩ B) 0.1 + 0.05
P (C|B) = = ≈ 0.43
P (B) 0.1 + 0.1 + 0.1 + 0.05
(c)
P (B ∩ (A ∪ C)) 0.1 + 0.1 + 0.05
P (B|A ∪ C) = = ≈ 0.36
P (A ∪ C) 0.1 + 0.2 + 0.1 + 0.1 + 0.05 + 0.15
(d)
P (B ∩ (A ∩ C)) 0.1
P (B|A, C) = = = 0.5
P (A ∩ C) 0.1 + 0.1
Problem 24.
(a)
3
P (2 ≤ X ≤ 5) = = 0.3
10
(b)
2
P (X ≤ 2|X ≤ 5) = = 0.4
5
(c)
P (3 ≤ X ≤ 8 ∩ X ≥ 4) 4 2
P (3 ≤ X ≤ 8|X ≥ 4) = = =
P (X ≥ 4) 6 3
Problem 25. Let ON denote the event that a student lives on campus, OF F denote the event
that a student lives off campus and A denote the event that a student receives an A. Given the
data I compute the following probabilities:
200 1
P (ON ) ≈ =
600 3
120 1
P (A) ≈ =
600 5
1 80 1
P (A ∩ ON ) = P (A) − P (A ∩ OF F ) ≈ − =
5 600 15
If the events ON and A are independent, then P (A ∩ ON ) = P (A)P (ON ). Looking at the
probabilities above, we see that the data suggests this relationship, and thus the data suggests that
getting an A and living on campus are independent.
Problem 26. Let N1 be the number of times out of n that a 1 is rolled, N6 be the number of
times out of n that a 6 is rolled and Xi be the value of the ith roll. Then:
17
0.1 0.08
0.8
0.8 0.9 0.72
0.3 0.06
0.2
0.2
0.7 0.14
In the second line I have used De Morgan, and I have also used the fact, several times, that the
outcome of roll i is independent of the outcome of roll j. Testing for a for values of n, I find that,
when n = 1, the probability is 0, which makes sense because at the very minimum we would need
at least one 1 and one 6, which cannot happen if we have only rolled once. The probability then
monotonically increases, which also makes sense because it becomes more and more likely that we
roll at least one 1 and at least one 6 the more rolls we throw. Note that as a sanity check, one can
show that limn→∞ 1 − (2(5n ) − 4n )/(6n ) is 1, so that our formula for the probability is bounded
between 0 and 1. Also note that this formula can also be obtained more easily with combinatorics,
which will be introduced in Chapter 2.
Problem 27.
P (G∩E c ) 0.72
(c) P (G|E c ) = P (E c ) = 1−0.14 ≈ 0.84.
18 CHAPTER 1. BASIC CONCEPTS
Problem 28. Let Ai be the event that the ith (i = 1, 2, 3) unit of the 3 picks is defective, while
the other 2 are not defective. Note that A1 , A2 and A3 are all disjoint, since it is impossible for any
unit to be both defective and not defective simultaneously. The probability we seek is therefore:
Problem 29. Let F be the event that the system is functional, and Ci be the event that component
i is functional.
(a) P (F ) = P (C1 , C2 , C3 ) = P1 P2 P3
(b) By inclusion-exclusion:
P (F ) = P (C1 ∪ C2 ∪ C3 )
= P1 + P2 + P3 − P1 P2 − P1 P3 − P2 P3 + P1 P2 P3
(c)
P (F ) = P ((C1 , C3 ) ∪ (C2 , C3 ))
= P (C1 , C3 ) + P (C2 , C3 ) − P ((C1 ∩ C3 ) ∩ (C2 ∩ C3 ))
= P1 P3 + P2 P3 − P (C1 , C2 , C3 )
= P1 P3 + P2 P3 − P1 P2 P3
Problem 30.
(a) The region in the unit square corresponding to set A can be made more clear if we write the
absolute value as a piecewise function:
( (
1
1 x−y ≤ 2 if x ≥ y y ≥ − 12 + x if y ≤ x
|x − y| ≤ =⇒ 1
=⇒ .
2 y−x≤ 2 if x < y y ≤ 12 + x if y > x
This piecewise function, along with the fact that A must be bounded in the unit square leads
to hashed-in region in Fig. 5.6. The region corresponding to set B is just the are in the unit
square above the 45◦ line (corresponding to the gray shaded region in Fig. 5.6).
(b) Using a little geometry, I find: P (A) = 1 − 2 21 12 12 = 34 and P (B) = 12 .
1 1
1 1 3
(c) Again, using
1 3 some geometry, I find: P (A ∩ B) = 2 − 2 2 2 = 8 . Since P (A)P (B) =
3
4 2 = 8 , the 2 events are indeed independent.
Problem 31.
19
5
0.
+
x
x
=
=
y
y
1
5
0.
y
−
x
=
y
0
0 1
x
Figure 1.4: The unit square for Problem 30. The shaded region represents the set B and the hashed
region represents the set A
(a) Let s be the event that the received email is spam and r be the event that the received email
contains the word refinance. From the problem statment, we know that P (s) = 0.5 (so that
P (sc ) = 0.5), P (r|s) = 0.01 and P (r|sc ) = 0.00001. Using Bayes’ rule:
P (r|s)P (s)
P (s|r) =
P (r)
P (r|s)P (s)
=
P (r|s)P (s) + P (r|sc )P (sc )
(0.01)(0.5)
=
(0.01)(0.5) + (0.00001)(0.5)
≈ 0.999
Problem 32.
(a) There are 4 possible paths from A to B: 1 to 4 (path 1), 2 to 5 (path 2), 1 to 3 to 5 (path
3), 2 to 3 to 4 (path 4). Let Pi be the event that path i is open. Only 1 path needs to be
open for event A to occur, so the probability of A is given by the probability of P1 or P2 or
P3 or P4 . We expand this probability with inclusion-exclusion, making sure to enumerate all
unique pairs and all unique triplets:
20 CHAPTER 1. BASIC CONCEPTS
P (A) = P (P1 ∪ P2 ∪ P3 ∪ P4 )
= P (P1 ) + P (P2 ) + P (P3 ) + P (P4 )
− P (P1 ∩ P2 ) − P (P1 ∩ P3 ) − P (P1 ∩ P4 ) − P (P2 ∩ P3 ) − P (P2 ∩ P4 ) − P (P3 ∩ P4 )
+ P (P1 ∩ P2 ∩ P3 ) + P (P1 ∩ P2 ∩ P4 ) + P (P1 ∩ P3 ∩ P4 ) + P (P2 ∩ P3 ∩ P4 )
− P (P1 ∩ P2 ∩ P3 ∩ P4 )
= P (B1 ∩ B4 ) + P (B2 ∩ B5 ) + P (B1 ∩ B3 ∩ B5 ) + P (B2 ∩ B3 ∩ B4 )
− P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 )) − P ((B1 ∩ B4 ) ∩ (B1 ∩ B3 ∩ B5 )) − P ((B1 ∩ B4 ) ∩ (B2 ∩ B3 ∩ B4 ))
− P ((B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 )) − P ((B2 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
− P ((B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
+ P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 )) + P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
+ P ((B1 ∩ B4 ) ∩ (B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
+ P ((B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
− P ((B1 ∩ B4 ) ∩ (B2 ∩ B5 ) ∩ (B1 ∩ B3 ∩ B5 ) ∩ (B2 ∩ B3 ∩ B4 ))
= P (B1 ∩ B4 ) + P (B2 ∩ B5 ) + P (B1 ∩ B3 ∩ B5 ) + P (B2 ∩ B3 ∩ B4 )
− P (B1 ∩ B4 ∩ B2 ∩ B5 ) − P (B1 ∩ B4 ∩ B3 ∩ B5 ) − P (B1 ∩ B4 ∩ B2 ∩ B3 )
− P (B2 ∩ B5 ∩ B1 ∩ B3 ) − P (B2 ∩ B5 ∩ B3 ∩ B4 ) − P (B1 ∩ B3 ∩ B5 ∩ B2 ∩ B4 )
+ P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 ) + P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 )
+ P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 ) + P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 )
− P (B1 ∩ B2 ∩ B3 ∩ B4 ∩ B5 )
= P1 P4 + P2 P5 + P1 P3 P5 + P2 P3 P4 − P1 P4 P2 P5 − P1 P4 P3 P5 − P1 P4 P2 P3
− P2 P5 P1 P3 − P2 P5 P3 P4 + 2P1 P2 P3 P4 P5
= P1 P4 (1 − P2 P5 − P3 P5 − P2 P3 ) + P2 P5 + P1 P3 P5 + P2 P3 P4
− P2 P5 P1 P3 − P2 P5 P3 P4 + 2P1 P2 P3 P4 P5
As a sanity check, if bridge 3 does not exist (i.e., if P3 = 0), then there are only 2 paths and
by inclusion-exclusions, P (A) = P1 P4 + P2 P5 − P1 P4 P2 P5 . In the limit that P3 = 0, we see
that, indeed, the above formula matches this probability.
(b) To solve for P (B3 |A) I use Bayes’ rule:
P (A|B3 )P3
P (B3 |A) = .
P (A)
P (A) has already been calculated. To solve for the probability of A conditioned on B3 we
need only to condition each probability term in P (A) on B3 , which effectively turns all the
P3 terms in the formula for P (A) to unity. Therefore,
P (A|B3 ) = P1 P4 (1 − P2 P5 − P5 − P2 ) + P2 P5 + P1 P5 + P2 P4 − P2 P5 P1 − P2 P5 P4 + 2P1 P2 P4 P5 ,
and we can insert P (A|B3 ) and P (A) into Bayes’ rule to obtain the answer:
P (A|B3 ) =
P1 P4 P3 (1 − P2 P5 − P5 − P2 ) + P3 P2 P5 + P3 P1 P5 + P3 P2 P4 − P3 P2 P5 P1 − P3 P2 P5 P4 + 2P1 P2 P3 P4 P5
.
P1 P4 (1 − P2 P5 − P3 P5 − P2 P3 ) + P2 P5 + P1 P3 P5 + P2 P3 P4 − P2 P5 P1 P3 − P2 P5 P3 P4 + 2P1 P2 P3 P4 P5
21
Problem 33. Without loss of generality, let us call the door that you picked door 1, and let us
arbitrarily denote the remaining doors by 2 and 3. Let Ci denote the event that the car is behind
door i and Hi denote the event that the host opens door i. The original probability that you
guessed the door with the car is P (C1 ) = 1/3. Since the host will not open door 1, and he will also
not open the door with the car behind it, we have the following probabilities:
P (H1 |C1 ) = 0
1
P (H2 |C1 ) =
2
1
P (H3 |C1 ) =
2
P (H1 |C2 ) = 0
P (H2 |C2 ) = 0
P (H3 |C2 ) = 1
P (H1 |C3 ) = 0
P (H2 |C3 ) = 1
P (H3 |C3 ) = 0
If the host opens door 3, we would like to know P (C2 |H3 ), because if this value is higher than
1/3, it is in our interest to switch to door 2. Likewise if the host opens door 2, we would like
to know P (C3 |H2 ) to know if we should switch to door 3. Given the symmetry of the problem
P (C2 |H3 ) = P (C3 |H2 ), so I only need to compute the probability once, which I do using Baye’s
rule:
P (H3 |C2 )P (C2 )
P (C2 |H3 ) =
P (H3 |C1 )P (C1 ) + P (H3 |C2 )P (C2 ) + P (H3 |C3 )P (C3 )
1 31
= 1 1 1
2 3 + 1 3 +0
2
= .
3
It is therefore in your interest to switch to door 2 if the host opens door 3 or to switch to door 3 if
the host opens door 2.
Problem 34.
(a) P (A) = 1/6, P (B) = |{(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}|/36 = 1/6, P (A, B) = 1/36.
Since P (A)P (B) = 1/36 = P (A, B), the events are indeed independent.
(b) P (C) = 1/6, so that P (A, C) = 1/36, and P (A, C) = 1/36 and therefore they are indepen-
dent.
(c) P (B)P (C) = 1/36, P (B, C) = 1/36, so yes, they are independent.
(d) The events A, B and A, C and B, C are pairwise independent. We also need to check if
P (A, B, C) = P (A)P (B)P (C). The probability of P (A, B, C) equals 0 since those events
cannot all occur at once, whereas P (A)P (B)P (C) 6= 0. Therefore, the events A, B and C
are not independent.
22 CHAPTER 1. BASIC CONCEPTS
Problem 35. Let X1 denote the outcome of the first roll, W denote the event that I win, and let
the probability of tails be q (= 1 − p). From Bayes’, the probability that the first roll was heads is
given that I won the game is:
Note that since P (W ) = P (W |X1 = H)p + P (W |X1 = T )q, the first term in the parentheses in
the above equation represents P (W |X1 = H) while the second term in the parentheses represents
P (W |X1 = T ). I solve for both of these seperately:
P (W |X1 = H) = q 0 p + qp2 + . . .
= (qp)0 p + (qp)1 p + . . .
∞
X
=p (qp)k
k=0
p
= ,
1 − qp
while
P (W |X1 = T ) = q 0 p2 + qp3 + . . .
= (qp)0 p2 + (qp)1 p2 + . . .
∞
X
= p2 (qp)k
k=0
p2
= ,
1 − qp
where I have used the formula for a geometric series. Thus we can compute the probability of
winning as
Problem 36. Let Hn+1 denote the event that the (n + 1)th flip is a head, H . . . H denote the
event of observing n heads and F denote the event that we pick the fair coin. We would like to find
P (Hn+1 |H . . . H), and we know that from the law of total probability P (Hn+1 ) = P (Hn+1 |F )P (F )+
P (Hn+1 |F c )P (F c ). By conditioning all of the probabilities on H . . . H, this equation gives a formula
for the probability we desire:
P (H . . . H|F )P (F )
P (F |H . . . H) =
P (H . . . H|F )P (F ) + P (H . . . H|F c )P (F c )
1 n 1
= 2 2 1
1 n 1
2 2 + 2
1
.
1 + 2n
Thus:
1 1 1
P (Hn+1 |H . . . H) = + 1 −
2 1 + 2n 1 + 2n
1
=1− .
2(1 + 2n )
We can check this formula for the extremes that n = 0 and n → ∞. In the first case, if n = 0, we
can calculate the probability of heads directly: P (H) = (1/2)(1/2) + 1(1/2) = 3/4, which matches
what the formula predicts when n = 0. When n → ∞, we would expect that the coin is probably
unfair, so that the probability of the next flip landing heads is 1. Indeed, this is what the formula
predicts in the limit that n → ∞.
Problem 37. Let Xi denote the number of girls for the ith child. Note that Xi can only take on
values 0 or 1. We seek the probability P (X1 = 1, . . . , Xn = 1|X1 + . . . + Xn ≥ 1). This can be
re-written with Bayes’ rule:
24 CHAPTER 1. BASIC CONCEPTS
1
P (X1 = 1, . . . , Xn = 1|X1 + . . . + Xn ≥ 1) = .
2n −1
We can test this formula for low values of n. For n = 1, P (X1 = 1|X1 = 1) = 1, which the
above formula predicts. For n = 2, by listing out the boy/girl event space, it is not difficult to
determine that P (X1 = 1, X2 = 1|X1 + X2 ≥ 1) = 1/3, which also what the above formula predicts.
Problem 38. Let L be the event that the family has at least 1 daughter named Lilia, G . . . G be
the event that the n children are girls, BG . . . G be the event that the first child is a boy and the
following n − 1 children are girls, GBG . . . G be the event that the first child is a girl, the second
is a boy and following n − 2 children are girls, etc.... We are interested in P (G . . . G|L) which we
can obtain with Bayes’ rule:
P (L|G . . . G)P (G . . . G)
P (G . . . G|L) = . (1.6)
P (L)
P (L) = 1 − P (Lc )
= 1 − P (G . . . G)[P (Lc |G . . . G) + P (Lc |BG . . . G) + P (Lc |GBG . . . G) + . . . + P (Lc |B . . . B)]
= 1 − P (G . . . G)[(1 − α)n + (1 − α)n−1 + (1 − α)n−1 + . . . + (1 − α)0 ]
Xn
n!
= 1 − P (G . . . G) (1 − α)k
k!(n − k!)
k=0
= 1 − P (G . . . G)(2 − α)n .
In the third line I used the fact that to have no daughters named Lilia given n daughters, the
daughters need to be not be named Lilia n times, which occurs with probability (1 − α)n . In the
fourth line, I used the fact that the total number of sequences of BG . . . G, GBG . . . G, ..., is given
by the number of permutations of n elements (n!) divided by the number of repeats for any element
(which for the Gs is k!, where k is the number of Gs in the sequence, and which for the Bs is
(n − k)!). This is a simple combinatorics problem which will be discussed in the following chapter.
Finally, to solve the summation, I used the binomial theorem.
We need one more probability, which is P (L|G . . . G) = 1 − P (Lc |G . . . G) = 1 − (1 − α)n .
Sticking in all of these probabilities into Baye’s rule, I obtain:
25
n
[1 − (1 − α)n ] 12
P (G . . . G|L) = n
1 − 12 (2 − α)n
1 − (1 − α)n
=
2n − (2 − α)n
1 − (1 − nα)
≈
2 − 2n (1 − nα
n
2 )
1
= ,
2n−1
where I have used a Taylor expansion to simplify the polynomial terms since α 1. The case of
n = 2 corresponds to Problem 7 of section 1.4.5 in the book. Evaluating my formula with n = 2 I
find that P (GG|L) = (2 − α)/(4 − α) ≈ 1/2 which is the same formula as the answer to Problem
7 of section 1.4.5.
Problem 39. Let R be the event that a randomly chosen child is a girl, and G . . . G be the event
that the family has n girls. We seek to find P (G . . . G|R), which we can get from Bayes’ rule:
P (R|G . . . G)P (G . . . G)
P (G . . . G|R) = .
P (R)
R is certain to happen conditioned on G . . . G, P (G . . . G) is simply (1/2)n and the probability
of randomly choosing a girl from a family without any prior information about genders is 1/2 as
shown for the n = 1 (S = {G, B}) and n = 2 (S = {BB, BG, GB, GG}) cases below:
n=1:
1 1 1
P (R) = P (R|G)P (G) + P (R|B)P (B) = 1 +0 = ,
2 2 2
n=2:
27
28 CHAPTER 2. COMBINATORICS: COUNTING METHODS
Problem 1. We can use the multiplication principle, making sure to enumerate all the cream/-
sugar/milk possibilities:
3 3 3 3
4·3· + + + = 4 · 3 · 8 = 96. (2.1)
0 1 2 3
Problem 2. Let N be the number of unique permutations of the 8 people in the 12 chairs. The 4
empty chairs are indistinguishable, so, for any given unique permutation, the permutations amongst
those 4 chairs do not count toward the number of unique permutations, N . We know that the total
number of permutations (including the non-unique permutations is 12!), and therefore 12! = N 4!,
so that
12!
= 19958400. (2.2)
4!
Problem 3.
(a) Let B represent the set of the 20 black cell phones, {b1 , b2 , . . . , b20 }, and W represent the set
of the 30 white cell phones, {w1 , w2 , . . . , w30 }. Let B be the set containing all possible sets of
the 4 distinct black cell phones that were chosen (without replacement) from the 20 black cell
phones, B = {{b1 , b2 , b3 , b4 }, {b1 , b2 , b3 , b5 } . . . {b17 , b18 , b19 , b20 }}, and W be the corresponding
set for the 6 white cell phones. Therefore, the sets of sets representing all unique ways to
obtain 4 black cell phones and 6 white cell phones is given by {B1 ∪W1 , B1 ∪W2 . . . B|B| ∪W|W| },
whose total cardinality can be seen to be |B||W|. |B| is clearly 20 , and |W| is clearly 30 ,
20 30
4 6
so the size of this set is 4 6 . The sample space for this experiment is all possible unique
sets of size 10 that can be chosen from B ∪ W . Therefore, the probability of obtaining exactly
4 black cell phones is given by:
20
30
4 6
P (4 black phones) = 50
≈ 0.28. (2.3)
10
In this problem I somewhat laboriously spelled out how to obtain the proper number of
sets from the sample space with exactly 4 black cell phones. I did this for the purpose of
illustration since this type of situation arises commonly in combinatorics problems. In the
future I will typically be typically be more terse.
(b)
Problem 4.
(a) The sample space is all possible sets of length 5 chosen from the 52 cards, and the events we
are interested in are all possible sets of size 5, one of which is certainly an ace. Therefore:
4 48
1
P (NA = 1) = 52
4 ≈ 0.30.
5
29
(b) Let NA ≥ 1 be the event that the hand contains at least 1 ace. It will be easier to consider
the complement of this event:
P (NA ≥ 1) = 1 − P (NA = 0)
48
5
=1− 52
5
≈ 0.34.
Problem 5. It will be convenient to use Bayes’ rule so that we can move NA ≥ 1 to the first slot
of P (·|·):
and therefore:
1 · 0.04
P (NA = 2|NA ≥ 1) ≈ ≈ 0.12. (2.5)
0.34
Problem 6. Let C4 be the event that C receives exactly 4 spades. Each player has 13 cards, and
between players A and B, we know there are 7 spades, and 19 non-spades. This leaves 6 spades
and 20 non-spades to be chosen amongst players C and D. If the 26 cards are first dealt to A and
B, and another 13 are dealt to C, then the probability that C obtains exactly 4 spades is:
6 20
4
P (C4 ) = 26
9 ≈ 0.24.
13
Problem 7. Let J be the event that Joe is chosen and Y be the event that you are chosen. By
inclusion-exclusion:
P (J ∪ Y ) = P (J) + P (Y ) − P (J, Y ).
1 49
There are 1 14 different ways Joe can be chosen and the same number of ways you can be chosen.
There are 22 4813 different ways both you and Joe can be chosen, and thus:
49
48
14 13
P (J ∪ Y ) = 2 50
− 50
≈ 0.51.
15 15
Problem 8. In general, for a sequence with n elements, r of which are unique, the number of
unique permutations is given by:
n!
N= , (2.6)
n1 !n2 ! . . . nr !
where ni is the number of repeats of the ith unique element in the original sequence. This can
easily be shown, since the number of total permutations must be equal to to the sum of each
unique permutation, times the number of times each element in the unique permutation can be
permuted amongst themselves, n! = N n1 !n2 ! . . . nr !. For example, one unique permutation of the
30 CHAPTER 2. COMBINATORICS: COUNTING METHODS
word “Massachusetts” is Massachusetts itself. We see that the “a”s can be permuted 2! ways
amongst second and fifth position, while still forming the word Massachusetts. Likewise, the
“s”s can be permuted 4! ways and the “t”s 2! ways, resulting in 2!4!2! permutations of all letters
which result in this unique permutation. Thus, the total number of ways of arranging the word
“Massachusetts” is:
n! 13!
N= = = 64864800. (2.7)
na !ns !nt ! 2!4!2!
Problem 9.
(b) Since both the number of heads and number of tails must be > 8, the possible observed
number of heads (tails) can be 9 (11) or 10 (10) or 11 (9). These are disjoint events, so the
total probability we are interested in is
Problem 10. Let u denote a move up and r denote a move to the right. A path from (0, 0)
to (20, 10) can be represented by a sequence of us and rs. Note that in every possible sequence,
there must be 10 us and 20 rs because we always need to travel 10 units up and 20 units to the
right regardless of the path. Therefore, the problem reduces to ascertaining the number of unique
sequences with 10 us and 20 rs, which, from Problem 8 we can see to be:
30!
= 30045015.
20!10!
Problem 11. Let A denote the event that the message passes through (10, 5) on its way to (20, 10).
To reach the point (10, 5) on the way from (0, 0) to (20, 10), the first 15 entries of the sequence must
have exactly 5 us and 10 rs. This may occur in any number of the unique permutations 15!/(10!5!).
To reach (20, 10), the remaining entries must also contain exactly 5 us and 10 rs, again giving
15!/(10!5!) unique permutations from (10, 5) to (20, 10). The total number of unique permutations
starting at (0, 0) and going through (10, 5) on its way to (20, 10) is therefore (15!/(10!5!))2 , so that
the probability that the message goes through (10, 5) is:
15! 2
10!5!
P (A) = 30!
≈ 0.30
10!20!
Problem 12. Let A denote the event that the message passes through (10, 5). This occurs if, out
of the first 15 entries of the sequence there are exactly 5 us and 10 rs in any order. For a binary
outcome experiment, the probability of obtaining 5 us with probability pa is given by the binomial
distribution:
31
15 5
P (A) = p (1 − pa )10 .
5 a
Problem 13. Let pi be the probability of flipping a heads for coin i (i ∈ {1, 2}), let Ci be the
event that coin i is chosen. Using the law of total probability and the binomial distribution, I find:
(a)
P (NH ≥ 3) = P (NH = 3 ∪ NH = 4 ∪ NH = 5)
= P (NH = 3) + P (NH = 4) + P (NH = 5)
2
X
= [P (NH = 3|Ci )P (Ci ) + P (NH = 4|Ci )P (Ci ) + P (NH = 5|Ci )P (Ci )]
i=1
2
1X 5 3 2 5 4 5 5
= pi (1 − pi ) + pi (1 − pi ) + p
2 3 4 5 i
i=1
≈ 0.35.
Problem 15. We would like to find P (N1 > 1 ∪ N2 > 1 ∪ . . . ∪ N1 > 6) = 1 − P (N1 ≤ 1, N2 ≤
1, . . . , N6 ≤ 1). For the first roll, we therefore have 6 allowable options, for the second 5 allowable
options, . . .. Therefore the probability is:
6·5·4·3·2
P (N1 > 1 ∪ N2 > 1 ∪ . . . ∪ N1 > 6) = 1 − ≈ 0.91.
65
32 CHAPTER 2. COMBINATORICS: COUNTING METHODS
Problem 16.
10
(a) Let A be the desired event. If the first 15 cards are to have 10 red cards, then there are 10
5 10
different possible
groups for the first 15 cards, of which we can arrange in 15! possible ways.
There are 55 possible groups for the remaining 5 cards, of which we can arrange 5! possible
ways. Finally, the total number of permutations of the 20 cards is 20!, and therefore:
10
10
5
5 10 15! 5 5!
P (A) = ≈ 0.02.
20!
(b) Let A0 be the event we desire. This problem is almost identical to the first:
10 10
3
2
0 7 8 15! 3 2 5!
P (A ) = ≈ 0.35.
20!
Problem 17. Let Bi be the event that I choose bag i (i ∈ {1, 2}) and Nr = 2 be the event that I
choose exactly 2 red marbles out of the 5. Using Bayes’
0.41 · 0.5
P (B1 |Nr = 2) = ≈ 0.55. (2.9)
0.41 · 0.5 + 0.34 · 0.5
Problem 18. Let E c denote the event that an error has not occurred for the Xith trial. We seek
the probability of all sequences of length n, ending in E c , where the 1st n − 1 entries can be any
sequence, provided they contain exactly k − 1 E c s, which I denote by An−1 . The probability we
desire is P (An−1 , Xn = E c ) = P (An−1 )P (Xn = E c ) by independence. I note that P (An−1 ) is a
binomial distribution, and therefore:
c n − 1 k−1 (n−1)−(k−1) n−1 k
P (An−1 , Xn = E ) = p (1 − p) p= p (1 − p)n−k
k−1 k−1
Problem 19. Let yi ≡ xi − 1 for i = 1, . . . , 5, and therefore all yi s can take on values {0, 1, 2, . . .}.
The equation for which we are trying to find the number of distinct integer solutions then becomes:
y1 + y2 + y3 + y4 + y5 = 95, (2.10)
5+95−1
99
which has 95 = 95 integer solutions.
33
Problem 20. It is not difficult to explicitly to enumerate the total number of solutions when
x1 = 0, 1, . . . , 10. The total number of integer valued solutions is thus the number of solutions for
when x1 = 0, plus the number of solutions for when x1 = 1, . . . plus the total number of solutions
for when x1 = 10. In each one of these instances, we must find the number of integer solutions for
the equation
x2 + x3 + x4 = 100 − i,
(where x2 , x3 , x4 ∈ {0, 1, 2, . . .}) which has 3+100−i−1
100−i solutions. Therefore, the total number of
integer solutions for this equation, N , with x1 ∈ {0, 1, . . . , 10} is:
10
X
3 + 100 − i − 1
N=
100 − i
i=0
X10
1
= 10302 − 203i + i2
2
i=0
11 · 10302 203 1
= − (0 + 1 + 2 + . . . + 10) + (0 + 1 + 4 + . . . + 100)
2 2 2
= 51271.
Problem 21. Let A1 = {(x1 , x2 , x3 ) : x1 + x2 + x3 = 100, x1 ∈ {41, 42, . . .}, x2 , x3 ∈ {0, 1, 2, . . .}},
and let A2 and A3 be defined analogously. By inclusion-exclusion, the total number of possible
unique integer solutions to this problem is then:
where the second equality follows from symmetry. The cardinality of |A1 ∩ A2 ∩ A3 | is 0 since it is
impossible to have all xi s > 40 and constrained to add to 100.
The cardinality of |A1 | can be found
by letting y1 ≡ x1 − 41, so that y1 , x2 , x3 ∈ {0, 1, 2, . . .}:
y1 + x2 + x3 = 59, which has 3+59−159 = 61
59 solutions. The cardinality of |A1 ∩ A2 | can be found
by letting y1 ≡ x1 −
41 and y2 ≡ x2 − 41 so that y1 , y2 , x3 ∈ {0, 1, 2, . . .}: y1 + y2 + x3 = 18, which
has 3+18−1
18 = 20
18 solutions. Therefore, the total number of solutions to this problem is:
61 20
|A1 ∪ A2 ∪ A3 | = 3 −3 = 4920.
59 18
The following bit of python code confirms what we derived theoretically:
In [ 1 ] : i=r a n g e ( 1 0 1 ) ; j=r a n g e ( 1 0 1 ) ; k=r a n g e ( 1 0 1 )
In [ 2 ] : t u p s = [ ( x , y , z ) f o r x i n i f o r y i n j f o r z i n k ]
In [ 3 ] : l e n ( [ x f o r x i n t u p s i f x [ 0 ] + x [ 1 ] + x [2]==100
. . . : and ( x [0] >40 o r x [1] >40 o r x [ 2 ] > 4 0 ) ] )
Out [ 3 ] : 4920
34 CHAPTER 2. COMBINATORICS: COUNTING METHODS
Chapter 3
35
36 CHAPTER 3. DISCRETE RANDOM VARIABLES
Problem 1.
(a) RX = {0, 1, 2}
(d)
P (X = 0 ∩ X < 2)
P (X = 0|X < 2) =
P (X < 2)
P (X = 0)
=
P (X < 2)
1/2
=
1/2 + 1/3
3
=
5
which, when solved for results in p = 1/6 and p0 = 1/3. Thus, the PMF for this problem is:
1
for x = 0
3
1
for x = 1
6
PX (x) = 1
for x = 2
6
1
for x = 3
3
0 otherwise,
Problem 3. The range of both X and Y is {1, 2, . . . , 6}, so that RZ = {−5, 4, . . . , 4, 5}. We may
find the PMF by conditioning and using the law of total probability:
37
P (Z = k) = P (X − Y = k)
6
X
= P (X = k + Y |Y = y)P (Y = y)
y=1
6
X
= P (X = k + y|Y = y)P (Y = y)
y=1
6
X
= P (X = k + y)P (Y = y)
y=1
6
X 1
= P (X = k + y)
6
y=1
6
1X1
= 1{1 ≤ k + y ≤ 6},
6 6
y=1
where the fourth equality follows since X and Y are independent, and where 1{·} is the so called
indicator function which is equal to 1 if its argument evaluates to true, and 0 otherwise. By explicitly
evaluating the sum for all k, I find that P (Z = −5) = 1/36, P (Z = −4) = 2/36, . . . , P (Z = 0) =
6/36, P (Z = 1) = 5/36, . . . , P (Z = 5) = 1/36, which can be conveniently written as:
6 − |k|
P (Z = k) = , (3.1)
36
and which can explicitly be checked to be normalized.
Problem 4.
P (X ≤ 2, Y ≤ 2) = P (X ≤ 2)P (Y ≤ 2)
= [PX (1) + PX (2)][PY (1) + PY (2)]
1 1 1 1
= + +
4 8 6 6
1
= .
8
(c) Since X and Y are independent, P (X > 2|Y > 2) = P (X > 2) = 1/8 + 1/2 = 5/8.
38 CHAPTER 3. DISCRETE RANDOM VARIABLES
(d) I use conditioning, the law of total probability and independence to solve for this:
4
X
P (X < Y ) = P (X < Y |Y = y)P (Y = y)
y=1
4
X
= P (X < y|Y = y)P (Y = y)
y=1
4
X
= P (X < y)P (Y = y)
y=1
X50 k 50−k
50 1 1
P (X1 + X2 + . . . + X50 > 30) =
k 2 2
k=31
50 X50
1 50
=
2 k
k=31
≈ 0.06,
where the summation has been evaluated numerically.
Problem 6. The formula for P (XN = 0) was derived in the book and is given by:
1 1 1
P (XN = 0) = − + . . . (−1)N .
2! 3! N!
I will need this formula in my answer below. Let Ai (i = 1, . . . , N ) be the event that the ith person
receives their hat. Therefore, for XN = 1:
P (XN = 1) = P (A1 , Ac2 , Ac3 , . . . , AcN ) + P (Ac1 , A2 , Ac3 , . . . , AcN ) + . . . + P (Ac1 , Ac2 , Ac3 , . . . , AN )
= N · P (A1 , Ac2 , Ac3 , . . . , AcN )
= N P (A1 )P (Ac2 , Ac3 , . . . , AcN )
1
= N P (XN −1 = 0)
N
= P (XN −1 = 0),
39
where I have used symmetry in the second equality, independence in the third and the fact that
the probability that person 1 gets their hat out of N hats is 1/N .
For XN = 2:
X
P (XN = 2) = P (Ai , Aj )P (XN −2 = 0)
i<j
N
= P (A1 , A2 )P (XN −2 = 0)
2
N 1 1
= P (XN −2 = 0)
2 N N −1
1
= P (XN −2 = 0),
2!
where I am summing over all N choose 2 unordered pairs of people who get their hats. The
probability that the first person gets their hat is 1/N while the probability that the second person
gets their hat is 1/(N − 1).
Continuing in this fashion one can see the general formula for the PMF we would like to derive:
1
P (XN = k) = P (XN −k = 0) for k = 0, 1, 2, . . . , N,
k!
where
1 1 1
P (XN −k = 0) = − + . . . (−1)N −k .
2! 3! (N − k)!
Problem 7. Computing the probabilities will be simplified by noting that P (X > 5) = 1−P (X ≤
5), and P (X > 5|X < 8) = P (5 < X < 8)/P (X < 8). Note, that I do not explicitly evaluate the
formulas to obtain a numerical answer, but this can easily be done numerically on a computer.
(a) X ∼ Geom(1/5)
(i)
5
X k−1
1 1
P (X > 5) = 1 − 1−
5 5
k=1
(ii)
6
X k−1
1 1
P (2 < X ≤ 6) = 1−
5 5
k=3
(iii)
P (5 < X < 8)
P (X > 5|X < 8) =
P (X < 8)
P7 1
1 k−1
k=6 5 1 − 5
=P
7 1 1 k−1
k=1 5 1 − 5
(ii)
6 k
X 10−k
10 1 1
P (2 < X ≤ 6) = 1−
k 3 3
k=3
(iii)
P7 10
1 k
1 10−k
k=6 k 3 1− 3
P (X > 5|X < 8) = P7
10 1 k 1 10−k
k=0 k 3 1− 3
(ii)
6
X 3
k−1 1 1 k−3
P (2 < X ≤ 6) = 1−
2 2 2
k=3
(iii)
P7 k−1
1 3
1 k−3
k=6 2 2 1− 2
P (X > 5|X < 8) = P7
k−1 1 3 1 k−3
k=3 2 2 1− 2
(ii)
6 10 10
X j 12−j
P (2 < X ≤ 6) = 20
j=3 12
(iii)
P7 (10j )(12−j
10
)
j=6 20
(12)
P (X > 5|X < 8) =
P7 (10j )(12−j
10
)
j=2 20
(12)
(e) X ∼ Pois(5)
(i)
5
X e−5 5k
P (X > 5) = 1 −
k!
k=0
(ii)
6
X e−5 5k
P (2 < X ≤ 6) =
k!
k=3
(iii)
P7 e−5 5k
k=6 k!
P (X > 5|X < 8) = P7 e−5 5k
k=0 k!
41
Problem 8.
(a) In general, for this problem P (X = x) = P (F1 , F2 , . . . Fx−1 , Sx ) = P (F1 )P (F2 ) . . . P (Fx−1 )P (Sx ),
where I have used independence. Therefore P (X = 1) = P (S1 ) = 1/2,
P (X = 2) = P (F1 )P (S2 )
" 2 #
1 1
= 1−
2 2
3
=
8
and
(b) By inspection, one can determine that the general formula for P (X = k) for k = 1, 2, . . . is:
"
Y 1 j
k−1 k #
P (X = k) = 1− 1 .
2 2
j=0
(c)
P (X > 2) = 1 − P (X ≤ 2)
= 1 − [P (X = 1) + P (X = 2)]
1 3
=1− +
2 8
1
=
8
(d)
P (X = 2, X > 1)
P (X = 2|X > 1) =
P (X > 1)
P (X = 2)
=
1 − P (X = 1)
3/8
=
1 − 1/2
3
=
4
Problem 9. To prove this equation, I will work on the RHS and LHS of the equation separately.
Let me first simplify the LHS:
42 CHAPTER 3. DISCRETE RANDOM VARIABLES
P (X > m + l, X > m)
P (X > m + l|X > m) =
P (X > m)
P (X > m + l)
=
P (X > m)
P∞
p(1 − p)k−1
= Pk=m+l+1
∞ j−1
j=m+1 p(1 − p)
p(1 − p)m+l [1 + (1 − p) + (1 − p)2 + . . .]
=
p(1 − p)m [1 + (1 − p) + (1 − p)2 + . . .]
= (1 − p)l .
The RHS can also be simplified:
∞
X
P (X > l) = p(1 − p)k−1
k=l+1
Problem 10.
(a) We have dealt with this type of problem extensively in the combinatorics chapter. The
probability we seek is |A|/|S|, where A is the set of all possible ways we can pick exactly 4
red balls out of 10, and S is the set of all possible ways to pick 10 balls out of the 50. Let Xr
be total number of red balls drawn. We thus we:
20
30
4 6
P (Xr = 4) =
50 ≈ 0.28. (3.2)
10
(b) We seek to find P (Xr = 4|Xr ≥ 3). The inequality will be easier to deal with if I put it in
the first slot of P (·|·), and thus I start by employing Bayes’ rule:
where in the second equality I have used the fact that P (Xr ≥ 3|Xr = 4) = 1 and in the third
equality I used what was derived in the previous part of this problem.
Problem 11.
(a) The average number of emails received on the weekend is 2 per hour or 8 per 4 hours. Since
we are modeling this process with a Poisson distribution, the probability that you receive 0
emails on the weekend per 4 hour interval is:
e−8 · 80
P (k = 0) = ≈ 3.4 × 10−4 . (3.3)
0!
(b) The average number of emails received on the weekend is 2 per hour and 10 per hour on
any weekday. Let Awd be the event that a weekday was chosen and Awe be the event that a
weekend was chosen. This problem can be solved using Baye’s rule
P (k = 0|Awd )P (Awd )
P (Awd |k = 0) =
P (k = 0|Awd )P (Awd ) + P (k = 0|Awe )P (Awe )
e−10 · 75
= −10 5
e · 7 + e−2 · 27
≈ 8.4 × 10−4 .
Problem 12. The CDF can easily be computed from the PMF:
0 for x < −2
0 for x < −2
0.2 for −2 ≤ x < −1
0.2 for −2 ≤ x < −1
0.2 + 0.3
0.5 for −1 ≤ x < 0
for −1 ≤ x < 0
FX (x) = =
0.2 + 0.3 + 0.2 for 0≤x<1
0.7 for 0 ≤ x < 1
0.2 + 0.3 + 0.2 + 0.2 for 1≤x<2
0.9 for 1 ≤ x < 2
0.2 + 0.3 + 0.2 + 0.2 + 0.1
1
for x≥2 for x ≥ 2.
See Fig. 3.1 for a plot of this function.
Problem 13. Whenever there is a jump in the CDF at a value of x, this indicates that that value
of x is in the range of X. Therefore, RX = {0, 1, 2, 3}. The probability at x can be found by
subtracting out the probabilities at values < x from FX (x). Therefore, the following equations give
the probabilities we need:
P (0) = FX (0)
P (1) = F (1) − P (0)
X
P (2) = FX (2) − P (1) − P (0)
P (3) = F (3) − P (2) − P (1) − P (0),
X
and when plugging in the values for FX (x), this leads to:
1
for x = 0
6
1 for x = 1
PX (x) = 31
for x = 2
4
1 for x = 3.
4
1.0
0.8
0.6
FX (x)
0.4
0.2
0.0
−4 −3 −2 −1 0 1 2 3 4
x
Figure 3.1: The associated CDF for the PMF of problem 12.
Problem 14.
(a)
E[X] = 1 · 0.5 + 2 · 0.3 + 3 · 0.2 = 1.7
(b)
E[X 2 ] = 1 · 0.5 + 4 · 0.3 + 9 · 0.2 = 3.5
=⇒
V ar[X] = E[X 2 ] − E[X]2 = 3.5 − 1.72 = 0.61
=⇒
p
SD[X] = V ar[X] ≈ 0.78
X 2
E[Y ] = PX (x)
x
x∈RX
2 2 2
= · 0.5 + · 0.3 + · 0.2 ≈ 1.43.
1 2 3
Problem 15. The range of X is {1, 2, 3, . . .}. For x ≥ 5, these values get mapped to 0, 1, 2, . . ..
The values x = 1, 2, 3, 4 get mapped to 4, 3, 2, 1, and thus RY = {0, 1, 2, . . .}. To solve for the the
corresponding PMF, note that P (X = k) = (1/3)(2/3)k−1 , and that PY (y = k) = P (Y = k) =
P (|X − 5| = k). We therefore have:
45
4
1 2
PY (y = 0) = P (X = 5) =
3 3
3 5
1 2 1 2
PY (y = 1) = P (X = 4 or X = 6) = +
3 3 3 3
6
1 2 2 1 2
PY (y = 2) = P (X = 3 or X = 7) = +
3 3 3 3
1 2 1 1 2 7
PY (y = 3) = P (X = 2 or X = 8) = +
3 3 3 3
1 2 0 1 2 8
PY (y = 4) = P (X = 1 or X = 9) = +
3 3 3 3
1 2 9
PY (y = 5) = P (X = 10) =
3 3
1 2 10
PY (y = 6) = P (X = 11) =
3 3
..
.
Problem 16. I first note that the range of Y is {0, 1, 2, 3, 4, 5}, so that its PMF is
11
PY (y = 0) = P (X = −10 or X = −9 or . . . or X = 0) =
21
1
PY (y = 1) = P (X = 1) =
21
1
PY (y = 2) = P (X = 2) =
21
1
PY (y = 3) = P (X = 3) =
21
1
PY (y = 4) = P (X = 4) =
21
6
PY (y = 5) = P (X = 5 or X = 6 or . . . or X = 10) = ,
21
which indeed sums to 1.
Problem 17. Since E[X] was found to be 1/p for the geometric distribution from Example 3.12 in
the book, if we can solve for E[X 2 ] then we can compute the variance with V ar[X] = E[X 2 ]−E[X]2 .
To do this, we will need a few formulas involving the geometric series. I claim that:
∞
X 1
xk = |x| < 1,
1−x
k=0
∞
X 1
kxk−1 = |x| < 1,
(1 − x)2
k=0
46 CHAPTER 3. DISCRETE RANDOM VARIABLES
and
∞
X 1+x
k 2 xk−1 = |x| < 1.
(1 − x)3
k=0
The first formula is simply the sum of a geometric series, the second was already proved in the
book in Example 3.12. I now prove the third formula.
Proof. We can take derivatives of the LHS and RHS of the second equation above to prove the
third. Differentiating the LHS results in:
∞ ∞
d X k−1 X
kx = k(k − 1)xk−2
dx
k=0 k=0
∞
X ∞
X
2 k−2
= k x − kxk−2
k=1 k=1
∞
X ∞
X
2 j−1
= (j + 1) x − (j + 1)xj−1
j=0 k=0
X∞ ∞
X ∞
X ∞
X ∞
X
= j 2 xj−1 + 2 jxj−1 + xj−1 − jxj−1 − xj−1
j=0 j=0 j=0 j=0 j=0
X∞ ∞
X
= j 2 xj−1 + jxj−1
j=0 j=0
∞
X 1
= j 2 xj−1 + ,
(1 − x)2
j=0
where I have made the substitution j = k − 1. Differentiating the RHS results in:
d 1 2
= ,
dx (1 − x)2 (1 − x)3
and putting the two together completes the proof.
Problem 18. In Problem 5 from 3.1.6 of the book, we showed that if X1 , X2 , . . . , Xm ∼ Geom(p) =
P ascal(1, p) (iid), then X = X1 + X2 + . . . + Xm ∼ P ascal(m, p). Therefore V ar[X] = V ar[X1 ] +
V ar[X2 ]+. . .+V ar[Xm ] = m(1−p)/(p2 ), by linearity in variance of independent random variables.
47
Problem 19. I use LOTUS repeatedly in this problem and linearity of expectation.
Y 3
E[X] = E − +
2 2
1 3
= − E[Y ] +
2 2
=1
Problem 20.
(a) The range of X is {1, 2, 3, 4, 5, 6} and the probability for any of these values, x, is simply
Nx /1000, where Nx is the number of households with x people. Therefore:
0.1 for x=1
0.2 for x=2
0.3 for x=3
PX (x) =
0.2 for x=4
0.1 for x=5
0.1 for x = 6.
The expected value of X is: E[X] = 1 · 0.1 + 2 · 0.2 + 3 · 0.3 + 4 · 0.2 + 5 · 0.1 + 6 · 0.1 = 3.3.
(b) The probability of picking a person from a household with k people is equal to the total
number of people in households with k people divided by the total number of people in the
town. In other words, P (Y = k) = (k · Nk )/3300, so that:
1
for y =1
33
4
for y =2
33
9
33 for y =3
PY (y) =
8
for y =4
33
5
for y =5
33
6
33 for y = 6,
Problem 21.
48 CHAPTER 3. DISCRETE RANDOM VARIABLES
200
150
E[X]
100
50
0
0 10 20 30 40 50
N
Figure 3.2: The expected number of tries to observe all unique coupons at least once.
(a) It takes 1 try to observe the first unique coupon. Let this first coupon be called type C1 . Let
the random variable, X1 , be the number of times it takes to observe a coupon different than
type C1 . Call this type C2 . Let the random variable, X2 , be the number of times it takes to
observe a coupon different than type C1 and C2 . Call this type C3 . Let us proceed in this
fashion until we observe N − 1 unique coupons. Finally, let the random variable XN −1 be
the number of times it takes to observe a coupon different than type C1 , C2 , . . . , CN −1 , and
call this coupon type CN . Therefore, the total number of times it takes to observe all unique
coupons at least once is X = 1 + X1 + X2 + . . . + XN −1 .
For each Xi , if we consider choosing C1 , C2 , . . . or, Ci−1 as a failure and Ci as a success, we
see that this is nothing more than a geometric random variable with probability (N − i)/N
of success (since there are N − i un-observed coupons left). Therefore, X1 , X2 , . . . , XN −1 ∼
Geom( NN−i ). Further let X0 ∼ Geom( NN−0 ), and note that the probability of observing X0 = 1
for this distribution is unity since we are sure to have a success on the first trial. Thus, if we
desire, we can replace 1 in X = 1 + X1 + X2 + . . . + XN −1 with the “random variable” X0 .
(b) The expected number of tries it takes to observe all unique coupons at least once is:
E[X] = E[1 + X1 + X2 + . . . + XN −1 ]
= 1 + E[X1 ] + E[X2 ] + . . . + E[XN −1 ]
N N N
=1+ + + ... +
N −1 N −2 N − (N − 1)
N
X −1
1
=N .
N −i
i=0
The summation can be written in terms of a special function (called the digamma function),
but I believe it is more illustrative to plot the actual function itself. In Fig. 3.2, I show E[X]
for N = 1 to 50 which I calculated numerically with the summation formula I derived above.
Problem 22.
(a) Let X 0 be the number of tosses until the game ends. We recognize that X 0 is distributed
as a geometric random variable with p = q = 1/2 since the coin is fair. The range of X 0
is RX 0 = {1, 2, 3, 4 . . .}. Let the random variable X denote the amount of money won from
49
the game which has range RX = {1, 2, 4, 8 . . .}. The function f : RX 0 → RX is given by the
0
bijective mapping 2X −1 . Thus, the PMF of X is given by P (X = x) = P (X 0 = x0 ), where x0
is the pre-image of x under f . That is, the PMF of X is given by: P (X = 1) = P (X 0 = 1) = p,
P (X = 2) = P (X 0 = 2) = p2 , P (X = 4) = P (X 0 = 3) = p3 , P (X = 8) = P (X 0 = 4) = p4 , . . ..
Thus the expected value of X is given by the following summation, which we see diverges:
X ∞
X
E[X] = xP (X = x) = 2k−1 P (X 0 = k)
x∈RX k=1
X∞ k−1
k−1 1 1
= 2
2 2
k=1
= ∞.
Thus, only considering your expected winnings (and ignoring issues like the variance of your
winnings and your particular risk tolerance) you would be willing to pay any amount of money
to play this game.
(b) By noting that . . . , 26−1 , 27−1 , 28−1 . . . = . . . , 32, 64, 128, . . ., one sees that when X 0 = 8,
X = 27 = 128 which is the first time that X takes on a value greater than 65. Therefore, the
probability we desire is:
∞
X
P (X > 65) = P (X 0 = k)
k=8
7 k−1
X 1 1
=1−
2 2
k=1
" 2 7 #
1 1 1 1
=1− + + ... +
2 2 2
1
= .
128
(c) This problem is very similar to part a, except that the summation is truncated when x takes
on the value 230 , which occurs when k = 31, thereafter, the payout remains 230 . Therefore,
the expected value of Y is:
31
X ∞
X
0
E[Y ] = 2 k−1
P (X = k) + 230 P (X 0 = k)
k=1 k=32
X31 k∞
X
1 1
= + 230
2 2
k=1 k=32
X∞ k 0
31 30 −32 1
= +2 2
2 0
2
k =0
31 1
= + 230 2−32
2 1 − 1/2
= 16.
We therefore see that in part a, the majority of the summation that contributes to the
expectation value of X occurs much later in the series. This is called a “paradox” since, in
50 CHAPTER 3. DISCRETE RANDOM VARIABLES
the first part the expectation value was infinite, but in the second part, even though 230 is a
very large number, the expected winnings is much lower than what one would have guessed.
where
Therefore:
α∗ = arg min{E[X 2 ] − 2αµ + α2 } = arg min{−2αµ + α2 },
α∈R α∈R
d
(−2αµ + α2 ) = −2µ + 2α,
dα
equal to zero, and solving for α∗ . This results in α∗ = µ.
Problem 24. If you choose to roll the die for a second time, your expected winnings is E[Y ] = 3.5.
Therefore, if you roll less than 3.5 on the first roll (i.e., 1, 2 or 3) you should roll again because you
expect to do better on the second roll. However, if you roll a 4, 5 or 6, you will expect to do worse
on the second roll, so you should not roll again.
Given this strategy, your expected winnings is:
where, in the second equality, E[Y 1{X ≤ 3}] = E[Y ]E[1{X ≤ 3}] since X and Y are independent
(given the set strategy).
Problem 25.
(a) In Fig. 3.3 I have plotted both P (X ≥ x) and P (X ≤ x) for this PMF. It is clear from this
figure that in the range [2, ∞), P (X ≤ x) ≥ 1/2, and that in the range (−∞, 2] P (X ≥ x) ≥
1/2. The only value that these ranges share in common is 2, and this is therefore the median
for this PMF.
51
P (X ≤ x), P (X ≥ x)
0.8
0.6
0.4
0.2
0.0
−1 0 1 2 3 4 5
x
0.8
0.6
0.4
0.2
0.0
−1 0 1 2 3 4 5 6 7 8
x
(b) In Fig. 3.4 I have plotted both P (X ≥ x) and P (X ≤ x) for a die roll. It is clear from this
figure that in the range [3, ∞), P (X ≤ x) ≥ 1/2, and that in the range (−∞, 4] P (X ≥ x) ≥
1/2. The (not unique) medians for this distribution are the intersection of these 2 sets, which
is the interval [3, 4].
kx u
X
P (X ≤ x) = p q k−1 ,
k=1
where q = 1 − p and kxu is the appropriate upper integer bound which depends on the (not
necessarily integer) value of x. By considering the staircase shape of P (X ≤ x), one can
realize that for any x, P (X ≤ x) = P (X ≤ bxc), which holds up until dxe (where b·c and
d·e are defined as rounding down and up to the nearest integer respectively). Therefore, if
we want to find the lowest value of x, x0 for P (X ≤ x0 ) still equals P (X ≤ x), this occurs at
the integer value x0 = bxc. bxc is therefore the appropriate value to use for kxu , and we can
52 CHAPTER 3. DISCRETE RANDOM VARIABLES
P (X ≤ x) = P (X ≤ bxc)
bxc
X
−1
= pq qk
k=1
bxc
X
= pq −1 −q 0 + qk
k=0
∞
X ∞
X
= pq −1 −1 + qk − qk
k=0 k=bmc+1
" ∞ ∞
#
X X
k0
= pq −1 −1 + q k − q bxc+1 q
k=0 k0 =0
" #
−1 1 q bxc+1
= pq −1 + −
1−q 1−q
= 1 − q bxc .
Any value m, for which P (X ≤ m) ≥ 1/2 is a potential candidate for the median (but of
course we still have to consider the values of x for which P (X ≥ x) ≥ 1/2) and the lowest
value for which this occurs, call it bmL c, can now be found by setting P (X ≤ bmL c) = 1/2,
resulting in:
1
bmL c = .
log2 1/q
Similarly:
P (X ≥ x) = P (X ≥ dxe)
∞
X
−1
= pq qk
k=dxe
∞
X
dxe−1 0
= pq qk
k0 =0
pq dxe−1
=
1−q
= q dxe−1 .
Thus, the highest value for which P (X ≥ x) ≥ 1/2, call it dmU e is found when this equation
equals 1/2, resulting in:
1
dmU e = + 1.
log2 1/q
Therefore, for the geometric distribution, P (X ≤ m) ≥ 1/2 and P (X ≥ m) ≥ 1/2 for
x ∈ [bmL c, dmU e]. This interval thus gives the (not unique) medians for the geometric
distribution.
Chapter 4
53
54 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES
Problem 1.
(a) We recognize that this is a uniform random variable, so its CDF is:
0 for x < 2
FX (x) = x−2
for 2 ≤ x ≤ 6
4
1 for x > 6.
(b) For a uniform random variable, the expectation value is at the midpoint: E[X] = 2 + [(6 −
2)]/2 = 4.
Problem 2.
R∞
(a) By normalization, 1 = c 0 exp (−4x)dx, which leads to c = 4.
(b) ( (
0 for x ≤ 0 0 for x ≤ 0
FX (x) = Rx 0 0 =
4 0 e−4x dx for x > 0 1 − e−4x for x > 0.
(c)
Z 5
P (2 < X < 5) = 4 e−4x dx = e−8 − e−20 .
2
Z ∞
E[X] = 4 xe−4x dx
0
Z ∞
∞
=− xe−4x 0 + e−4x dx
0
1
= ,
4
where the limits in the first term were evaluated using L’hopital’s rule.
Problem 3.
(a) Using LOTUS:
Z 1
n 2 2 n 2 2
E X X + = x x + dx
3 0 3
1 1 2 1 1
= x3+n + xn+1
3+n 0 3 n+1 0
1 2 1
= + for n = 1, 2, 3, . . .
3+n 3 n+1
(b) We have already found E[X] and E[X 2 ] in the first part and thus:
Problem 4.
(a) For this problem, we have RX = [0, 1] and RY = [1/e, 1]. Thus in range of x = 0 to 1, the
CDF of Y is given by:
FY (y) = P (Y ≤ y)
= P (e−X ≤ y)
= P (X ≥ − ln y)
= 1 − FX (− ln y)
= 1 + ln y,
where I used the fact that for x ∈ [0, 1] for a uniform 0, 1 distribution, FX (x) = x. Therefore:
0 for y < 1e
FY (y) = 1 + ln(y) for 1e ≤ y ≤ 1
1 for y > 1.
(b)
0 for y < 1e
dFY
fY (y) = = y1 for 1e ≤ x ≤ 1
dy
0 for x > 1
(c)
Z 1
1 1
E[Y ] = y dy = 1 −
1/e y e
Problem 5.
(a) The range of X and Y are RX = (0, 2] and RY = (0, 4], so that for y ∈ RY , we have
FY (y) = P (Y ≤ y)
= P (X 2 ≤ y)
√
= P (0 < X < y)
Z √y
5 4
= x dx
0 32
1
= y 5/2 ,
32
and therefore:
0 for y ≤ 0
FY (y) = 1 5/2
y for 0 < y ≤ 4
32
1 for y > 4.
(b)
0 for y ≤ 0
dFY 5 3/2
fY (y) = = 64 y for 0 < y ≤ 4
dy
0 for y > 4
56 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES
(c)
Z 4
5
E[Y ] = y · y 3/2 ≈ 2.9
64 0
Problem 6. We can convert the PDF for X to the PDF for Y using the method of transformations:
(
y d(y/α) λ
λ −α y
fY (y) = fX = αe for y > 0
,
α dy 0 otherwise
Problem 7.
Z ∞
n
E[X ] = xn λe−λx dx
0
∞ n Z ∞
= −xn e−λx + xn−1 λeλx dx
0 λ 0
n
= E[X n−1 ],
λ
where the first term evaluated to zero by repeated application of L’Hopital’s rule.
(b) We can use several properties of the Gamma function to prove this relation:
Z ∞
E[X n ] = xn λe−λx dx
0
λ
= Γ(n + 1)
λn+1
n!
= n,
λ
where in the second equality I used the second property of the Gamma function given in the
book, and in the third equality I used the fourth property of the Gamma function given in
the book.
Problem 8.
(a)
0−3
P (X > 0) = 1 − Φ ≈ 0.84
3
(b)
8−3 −3 − 3
P (−3 < X < 8) = Φ −Φ ≈ 0.93
3 3
57
(c)
P (X > 5, X > 3)
P (X > 5|X > 3) =
P (X > 3)
P (X > 5)
=
P (X > 3)
1 − Φ 5−3
3
=
1 − Φ 3−3
3
≈ 0.50
Problem 9. By Theorem 4.3 in the book, if X ∼ N (3, 9), and Y = 5−X, then Y ∼ N (−3+5, 9) =
N (2, 9).
(a)
2−3
P (X > 2) = 1 − Φ ≈ 0.63
3
(b)
3−2 −1 − 2
P (−1 < Y < 3) = Φ −Φ ≈ 0.47
3 3
(c)
Problem 10. I first note that RX = R, and RY = [0, ∞). The range of X can be partitioned into
2 regions, X ≤ 0 and X > 0 which are strictly decreasing, and increasing respectively, where the
corresponding inverse transformation back to X for both of these regions is:
(
−Y 2 for X ≤ 0
X=
Y2 for X > 0.
Therefore:
d(y 2 ) 2
2 d(−y )
2
fY (y) = fX (y ) + fX (−y )
dy dy
4 y 4
= √ ye− 2 for y ≥ 0,
2π
which, as a sanity check, I made sure analytically integrates to unity over the range 0 to infinity.
Problem 11.
58 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES
(a)
Z ∞
P (X > 2) = 2 e−2x = e−4
2
Z ∞
E[Y ] = (2 + 3x)2e−2x dx
0
Z ∞ Z ∞
−2x
=2 2e dx + 3 2xe−2x dx
0 0
= 2 + 3E[X]
3
=2+
2
7
= ,
2
where I have used the fact that E[X] for Exp(λ) is 1/λ (as computed in the book). To
compute V ar[Y ], I first must compute E[Y 2 ], which I do using LOTUS:
Z ∞
E[Y 2 ] = (2 + 3x)2 2e−2x dx
0
Z ∞ Z ∞ Z ∞
−2x −2x
=4 2e dx + 12 2xe dx + 9 2x2 e−2x dx
0 0 0
= 4 + 12E[X] + 9E[X 2 ]
12 9 · 2
=4+ +
2 4
29
= ,
2
where I have used the fact that E[X 2 ] for Exp(λ) is 2/(λ2 ) (as computed in the book).
Finally, the variance is:
9
V ar[Y ] = E[Y 2 ] − E[Y ]2 = .
4
(c)
Problem 12. The equations defining the median for a continuous variable, P (X < m) = 1/2 and
P (X ≥ m) = 1/2, are actually equivalent. That is, P (X < m) = 1/2 ⇔ P (X ≥ m) = 1/2 (which
can easily be verified), so we can use whichever is convenient. Since we know the CDFs for the
desired distributions, so using the condition that P (X < m) = 1/2 will be most convenient.
P (W < m) = P (W ≤ m)
m−µ
=Φ
σ
1
= .
2
Since the standard normal is symmetric about 0, this implies that Φ(0) = 1/2, and therefore
(m − µ)/σ = 0, and thus m = µ. Since we knew that a Gaussian is symmetric about its
mean, this is what we expected.
Problem 13.
(a) See Fig. 4.1 for a plot of the CDF. X is a mixed random variable because there is a jump in
the CDF at x = 1/4 (indicating a probability “point mass” at 1/4 of P (X = 1/4) = 1/2) and
the CDF does not exhibit the staircase shape associated with only discrete random variables.
(b)
1 1
P X≤ = FX
3 3
1 1
= +
3 2
5
=
6
60 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES
1.0
0.8
0.6
FX (x)
0.4
0.2
0.0
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x
(c)
1 1
P X≥ =1−P X <
4 4
1 1
=1− P X ≤ −P X =
4 4
1 1
= 1 − FX −P X =
4 4
1 1 1
=1− + −
4 2 2
3
=
4
(d) The CDFs for both the discrete and continuous contributions can be written piece-wise as:
(
0 for x < 14
D(x) = 1
2 for x ≥ 14 ,
and
0 for x < 0
C(x) = x for 21 ≤ x ≤ 1
1
2
2 for x > 12 .
These functions can be re-written using the unit step function:
1 1
D(x) = u x − ,
2 4
and
1 1 1
C(x) = xu(x) − xu x − + u x− ,
2 2 2
where, in C(x), I have started subtracting off the linear equation at x = 1/2, and adding a
constant 1/2 at x = 1/2 so as to keep the function flat at 1/2 after x = 1/2.
61
(e) Since C(x) increases linearly from 0 to 1/2, and has total probability mass 1/2, we expect
c(x) to be uniform with height 1 over the range 0 to 1/2. Differentiating C(x)
d
c(x) = C(x)
dx
du(x) 1 d 1 1 d 1
= u(x) + x −u x− −x u x− + u x−
dx 2 dx 2 2 dx 2
1 1 1 1
= u(x) + xδ(x) − u x − − xδ x − + δ x−
2 2 2 2
1
= u(x) − u x − ,
2
which is exactly what we had anticipated. Here I used the fact that when x 6= 1/2, xδ(x−1/2)
and (−1/2)δ(x − 1/2) equal 0, while for for x = 1/2, (−1/2)δ(x − 1/2) = (−1/2)δ(0) and
xδ(x − 1/2) = (1/2)δ(0). Also, in either case, xδ(x) = 0.
(f)
Z ∞ X
E[X] = xc(x)dx + xk ak
−∞ k
Z 1/2
1 1
= xdx + P X =
0 4 4
1 1/2 1 1
= x2 dx + ·
2 0 4 2
1
=
4
Problem 14.
(c)
Z ∞
Z 1/2
2 1 2 1
E[X ] = x δ x− dx + x2 dx
2 −∞ 4 0
1 1 2 1 3 1/2
= + x
2 4 3 0
7
=
96
=⇒
Problem 15.
(a) From the form of the given generalized PDF, it is clear that there are 2 probability point
masses at x = 1 and x = −2 (with P (X = 1) = 1/6 and P (X = −2) = 1/3), as well as a
continuous random variable contribution from a Gaussian PDF. Since the continuous PDF
contributes 0 probability at specific points, P (X = 1) = 1/6 and P (X = −2) = 1/3.
(b)
Z ∞
1 1 1 1 − x2
P (X ≥ 1) = δ(x + 2) + δ(x − 1) + √ e 2 dx
1 3 6 2 2π
Z ∞
1 1 1 x2
= + √ e− 2 dx
6 2 1 2π
1 1
= + [1 − Φ(1)]
6 2
≈ 0.25
(c)
P (X = 1, X ≥ 1)
P (X = 1|X ≥ 1) =
P (X ≥ 1)
P (X = 1)
=
P (X ≥ 1)
1
6
= 1
6 + 12 [1 − Φ(1)]
≈ 0.68
63
(d) We can calculate E[X] by explicitly integrating over the generalized PDF:
Z ∞
1 1 1 1 x2
E[X] = x δ(x + 2) + δ(x − 1) + √ e− 2 dx
−∞ 3 6 2 2π
Z ∞
1 1 1 1 x2
= (−2) + (1) + x √ e− 2 dx
3 6 2 −∞ 2π
−2 1 1
= + + ·0
3 6 2
1
=− ,
2
where the integral in the second line is equal to zero, since this is just the mean of a standard
normal distribution.
We can also calculate E[X 2 ] by explicitly integrating over the generalized PDF:
Z ∞
2 2 1 1 1 1 − x2
E[X ] = x δ(x + 2) + δ(x − 1) + √ e 2 dx
−∞ 3 6 2 2π
Z ∞
1 1 1 1 x 2
= (4) + (1) + x2 √ e− 2 dx
3 6 2 −∞ 2π
4 1 1
= + + (1)
3 6 2
= 2,
where the integral in the second line is equal to 1, since this is just the variance of a standard
normal distribution.
Thus:
Problem 16.
(a) Let D denote the event that the device is defective, and let P (D) = pd = 0.02. By the law of
total probability, we have:
FX (x) = P (X ≤ x)
= P (X ≤ x|D)P (D) + P (X ≤ x|Dc )P (Dc )
= u(x)pd + (1 − e−λx )(1 − pd )u(x).
where I have used the fact that at x 6= 0, δ(x) = 0, and at x = 0, (1 − exp (−λx)) = 0, so that
the term, (1−e−λx )δ(x) is 0 for all x. We also could have written this PDF down immediately
by realizing that there is a probability point mass at x = 0 with total probability 0.02, and
there is a continuous probability contribution from the exponential distribution which must
integrate to 1 − 0.02 = 0.98.
(b)
Z ∞h i
P (X ≥ 1) = pd δ(x) + (1 − pd )u(x)λe−λx dx
1
Z ∞
= (1 − pd ) λe−λx dx
1
−λ
= (1 − pd )e
= (0.98)e−2
≈ 0.133
(c)
P (X > 2, X ≥ 1)
P (X > 2|X ≥ 1) =
P (X ≥ 1)
P (X > 2)
=
P (X ≥ 1)
R ∞ −λx
e dx
= R2∞ −λx
1 e dx
= e−λ
= e−2
≈ 0.135
Z ∞ h i
E[X] = x pd δ(x) + (1 − pd )u(x)λe−λx dx
−∞
Z ∞
= (1 − pd ) xλe−λx dx
0
1
= (1 − pd )
λ
1
= (0.98)
2
= 0.49,
where I have used the fact that E[X] = 1/λ for an exponential distribution.
65
where I have used the fact that E[X 2 ] = 2/λ2 for an exponential distribution. Therefore, the
variance is:
Problem 17.
(a) We realize that for Lap(0, 1), fX (x) is an even function, while x is an odd function and
therefore E[X] = 0. Also,
Z ∞
1
2
E[X ] = 2 x2 e−x dx = 2,
0 2
where I have used the fact that since we are integrating an even function times an even
function we need only integrate from 0 to ∞ and multiply by 2. I have also used the fact
that the integrand is E[X 2 ] of an Exp(1) distribution, and we know this integral evaluates
to 2/λ2 . Therefore V ar[X] = 2.
y−µ d y−µ
fY (y) = fX
b dy b
(
1
exp y−µ 1
for y−µ
b <0
= 12 b y−µb
1 y−µ
2 exp − b b for b ≥ 0
(
1
exp y−µ for y < µ
= 2b1
b y−µ
2b exp − b for y ≥ µ,
(c) Since
E[Y ] = E[bX + µ]
= bE[X] + µ
= µ,
66 CHAPTER 4. CONTINUOUS AND MIXED RANDOM VARIABLES
and
Problem 18. We see firstly, that RX = R and RY = [0, ∞). Also note that Y = |X| = −X
for X < 0 and X for X ≥ 0. I use the method of transformations, breaking fX (x) into 2 strictly
(1) (2)
monotonic regions. Let fX (x) = (1/2b) exp (x/b) and fX (x) = (1/2b) exp (−x/b), then:
(1) dy (2) d(−y)
fY (y) = fX (−y) + fX (y)
dy dy
1 y 1 y
= exp − + exp −
2b b 2b b
1 y
= exp − ,
b b
which is Exp(1/b).
Z 0 Z ∞
1 x 1 x
E[X] = 2
dx + dx
−∞ π 1 + x 0 π 1 + x2
Z 1 Z ∞
1 du 1 du
= +
∞ 2π u 1 2π u
1 1 1 ∞
= ln(1 + x2 ) + ln(1 + x2 )
2π ∞ 2π 1
= −∞ + ∞,
Z ∞
2 1 x2
E[X ] = 2
dx
−∞ π 1 + x
Z
2 ∞ x2
= dx
π 0 1 + x2
2
= [x − arctan(x)]∞ 0
π
2
= lim [x − arctan(x)]
π x→∞
2 2 π
= lim x − ·
π x→∞ π 2
=∞
Problem 20.
67
where I have use the fact that the term in the brackets is the same integral one must compute
to find the variance of a 0, σ 2 normal distribution.
Z x2
2σ 2 x2
FX (x) = e−u du = 1 − e− 2σ2 .
0
(c) The range of both X and Y is [0, ∞). Therefore, for all y ∈ [0, ∞):
2
y2 d y
fY (y) = fX
2σ 2 dy 2σ 2
y − y22
e 2σ for y ≥ 0
= σ2
0 for y < 0,
which is Rayleigh(σ).
Problem 21.
(b)
P (X > 3xm , X > 2xm )
P (X > 3xm |X > 2xm ) =
P (X > 2xm )
P (X > 3xm )
=
P (X > 2xm )
1 − P (X ≤ 3xm )
=
1 − P (X ≤ 2xm )
1 − FX (3xm )
=
1 − FX (2xm )
h α i
xm
1 − 1 − 3x m
= h α i
xm
1 − 1 − 2xm
α
2
=
3
Problem 22.
(a)
FX (x) = P (X ≤ x)
= P (eσZ+µ ≤ x)
ln x − µ
=P Z≤
σ
ln x − µ
=Φ
σ
69
u ≡ ln x − µ,
=⇒
1
du ≡ dx = e−(u+µ) dx,
x
so that:
Z ∞
1 1 u2
E[X] = √ exp − · 2 exp (u + µ)du
2πσ −∞ 2 σ
Z ∞
σ2 1 1 2 2
= exp µ + √ exp − 2 (u − σ ) du
2 2πσ −∞ 2σ
2
σ
= exp µ + .
2
To go from the first equality to the second, I used the “completing the squares” trick to make
the exponent in the form of a Gaussian√ for easy integration. In going from the second equality
to the third, I used the fact that 1/( 2πσ) times the integral evaluates to 1 since this is just
a σ 2 , σ 2 Gaussian.
The expectation value of X 2 is
Z " 2 #
∞
1 1 ln x − µ
E[X 2 ] = √ x exp − ,
2πσ 0 2 σ
Joint Distributions
71
72 CHAPTER 5. JOINT DISTRIBUTIONS
Problem 1.
(a)
1 1
P (X ≤ 2, Y > 1) = 0 + =
12 12
(b)
1 1 5
3 + 12 for x = 1
12 for x = 1
PX (x) = 16 + 0 for x = 2 = 1
for x = 2
1 1
5
6
12 + 3 for x = 4 12 for x = 4
( (
1 1 1 7
PY (y) = 3 + 6 + 12 for y = 1
= 12 for y = 1
1 1 5
12 + 0 + 3 for y = 2 12 for y = 2
(c)
1
P (Y = 2, X = 1) 12 1
P (Y = 2|X = 1) = = 5 =
P (X = 1) 12
5
1 5
P (Y = 2|X = 1) = 6= P (Y = 2) =
5 12
=⇒ not independent
Problem 2.
(a) The ranges of X, Y and Z are: RX = {1, 2, 4}, RY = {1, 2} and RX = {−3, −2, −1, 0, 2}.
The mapping g(x, y) = x − 2y (where g : RX × RY → RZ ) is explicitly given by:
(1, 1) → −1
(1, 2) → −3
(2, 1) → 0
(2, 2) → −2
(4, 1) → 2
(4, 2) → 0.
P (X − 2Y = −3) for z = −3
P (X − 2Y = −2) for z = −2
PZ (z) = P (X − 2Y = −1) for z = −1
P (X − 2Y = 0) for z =0
P (X − 2Y = 2) for z =2
PXY (1, 2) for z = −3
= −2
PXY (2, 2) for z
= PXY (1, 1) for z = −1
PXY (2, 1) + PXY (4, 2) for z =0
PXY (4, 1) for z =2
1
for z = −3
12
0 for z = −2
= 31 for z = −1
1 for z = 0
2
1
12 for z = 2,
(b)
P (X = 2|Z = 0) = P (X = 2|X − 2Y = 0)
= P (X = 2|X = 2Y )
P (X = 2, X = 2Y )
=
P (X = 2Y )
P (X = 2, Y = 1)
=
P (X = 2, Y = 1) + P (X = 4, Y = 2)
1
=
3
Problem 3. Let A be the event that the first coin we pick is the fair coin. We can find the joint
PMF by conditioning on this event, and realizing that once conditioned, X and Y are independent
(i.e., X and Y are conditionally independent given A):
where Pp (z) is the PMF associated with a Bern(p) trial. This PMF can be written conveniently
as Pp (z) = pz (1 − p)1−z , so that the joint PMF is
2 y 1−y 2 x 1−x
1 2 1 1 2 1
PXY (x, y) = + .
2 3 3 2 3 3
Y =0 Y =1
1 1
X=0 6 4
1 1
X=1 4 3
To check if X and Y are independent I first check (x, y) = (0, 0). Adding the table horizontally
and vertically, the marginalized PMFs at these values are PX (0) = 5/12 and PY (0) = 5/12, and
thus PX (0)PY (0) = 25/144 6= 1/6, so X and Y are not independent.
Problem 4.
X∞
1
PX (k) = k+l
2
l=1
" ∞ l
#
1 X 1
= k −1 +
2 2
l=0
1
= k,
2
and by symmetry
1
PY (l) = ,
2l
We then have that:
1
PXY (k, l) =
2k+l
1
=
2k 2l
1 1
= k· l
2 2
= PX (k)PY (l) ∀ (k, l) ∈ N × N,
(b) We can easily enumerate all pairs of (x, y) that satisfy this inequality:
P (X 2 + Y 2 ≤ 10) = PXY (1, 1) + PXY (1, 2) + PXY (1, 3) + PXY (2, 1) + PXY (2, 2) + PXY (3, 1)
1 1 1 1 1 1
= + 2
+ 3
+ 2 + 2 2+ 3
2·2 2·2 2·2 2 ·2 2 ·2 2 ·2
11
= .
16
Problem 5.
75
(a) 1
3
for x = 1
1 1 1
4
for x = 1
3 + 61+ 12 7
PX|Y (x|1) = for x = 2 = 2
1
6
+ 16 + 12
1 for x = 2
3
1
7
1
for x = 4
1
12
for x = 4 7
3
+ 16 + 12
1
(b)
4 2 1 12
E[X|Y = 1] = (1) · + (2) · + (4) · =
7 7 7 7
(c)
12 2 4 12 2 2 12 2 1 52
V ar[X|Y = 1] = 1 − · + 2− · + 4− · =
7 7 7 7 7 7 49
Problem 6. We know that X ∼ P ois(10) and since each customer is female independent of the
other customers, if the total number of customers in an hour is n, then the total number of female
customers in an hour is the sum of n independent Bernoulli random variables. In other words,
Y |X = n ∼ Bin(n, 3/4). Therefore, the joint PMF is:
P (X = n, Y = y) = P (Y = y|X = n)P (X = n)
y n−y n −10
n 3 1 10 e
= .
y 4 4 n!
Problem 7. We know that for a Geom(p) distribution the mean is 1/p and the variance is (1 −
p)/p2 , so we should expect these answers. We can find the mean by conditioning on the first “toss”:
where E[X|H] = 1 since if we know the first toss is a heads, the experiment is done so that the
mean is 1, and E[X|H c ] = (1 + E[X]) since if the first toss is a tails, then we’ve wasted 1 toss, and
since the geometric distribution is memoryless, it starts over at the next toss. Solving this equation
for E[X] we find that E[X] = 1/p, which is what we expected.
We can solve for E[X 2 ] in a similar manner:
where E[X 2 |H] = 1 for the same reason as above, and E[X 2 |H c ] = E[(1 + X)2 ] since, as above,
we’ve wasted 1 toss on the first toss, and then the experiment starts over on the second. Solving this
equation, I find E[X 2 ] = (2 − p)/p2 . The variance is thus: V ar[X] = E[X 2 ] − E[X]2 = (1 − p)/p2 ,
which is what we expected.
iid
Problem 8. If X, Y ∼ Geom(p), the we can easily find the joint PMF and use LOTUS to solve
for the expectation. The joint PMF is:
PXY (x, y) = PX (x)PY (y) = p(1 − p)x−1 p(1 − p)y−1 for x, y = 1, 2, . . . , (5.1)
where I have multiplied the marginal PMFs since X and Y are independent. Using LOTUS:
76 CHAPTER 5. JOINT DISTRIBUTIONS
4
3
2
1
0
y
−1
−2
−3
−4
−4 −3 −2 −1 0 1 2 3 4
x
Figure 5.1: A visual representation of the set C for Problems 9 and 10.
X ∞ 2
∞ X
X2 + Y 2 x + y2
E = p(1 − p)x−1 p(1 − p)y−1
XY xy
x=1 y=1
∞ X
X ∞ ∞ X
X ∞
x y
= p(1 − p)x−1 p(1 − p)y−1 + p(1 − p)x−1 p(1 − p)y−1
y x
x=1 y=1 y=1 x=1
X∞ X ∞
x
=2 p(1 − p)x−1 p(1 − p)y−1
y
x=1 y=1
∞
X ∞
X
x−1 1
=2 xp(1 − p) p(1 − p)y−1 ,
y
x=1 y=1
where going from the second to third line we realize that due to the symmetry, both of the sums
are the same. In the last line the first sum is just the mean of a Geom(p) distribution (1/p). We
can simplify the second sum by utilizing the following Taylor expansion:
∞
X xk
− ln(1 − x) = for |x| < 1.
k
k=1
Problem 9. To better understand what is in the set C, note that C is the set of (x, y) ∈
Z × Z, such that y ≤ 2 − x2 for y ≥ 0 and y ≥ x2 − 2 for y < 0. To visualize C, I plot
the set Z × Z as grey points in Fig. 5.1 as well as the lines y = 2 − x2 and y = x2 − 2. The
shaded grey region (and the lines themselves) represents the region satisfying the 2 conditions,
and thus any grey point in this region (or on the lines) is in C. Therefore, more explicitly,
C = {(0, 0), (1, 0), (0, 1), (1, 1), (0, 2), (0, −1), (1, −1), (0, −2), (−1, 0), (−1, 1), (−1, −1)}.
77
(b) Since there are 3 points at Y = 1, and each point is equally as likely, the total probability mass
at Y = 1 is 3/11, while the total probability mass at (−1, 1), (0, 1), (1, 1) is 1/11 respectively.
Therefore:
1
3 for x = −1
PX|Y (x|1) = 31 for x = 0
1
3 for x = 1.
(c) X and Y are not independent since, for example, at X = −1: P (X = −1|Y = 1) = 1/3 6=
P (X = −1) = 3/11.
where only 4 points contribute to the sum (since the rest have zeros).
Problem 10.
(a)
X 1 1 1
E[X|Y = 1] = xPX|Y (x|1) = (−1) · + (0) · + (1) · = 0
3 3 3
x∈RX|Y =1
(b)
X 1 1 1 2
V ar[X|Y = 1] = x2 PX|Y (x|1) = (1) · + (0) · + (1) · =
3 3 3 3
x∈RX|Y =1
(c) One can easily see that the PMF, PX||Y |≤1 (x) is exactly the same as the PMF for PX|Y (x|1),
and therefore the expectation and variance will be the same, thus E[X||Y | ≤ 1] = 0.
(d) For the same reason as part c of this problem E[X 2 ||Y | ≤ 1] = 2/3.
78 CHAPTER 5. JOINT DISTRIBUTIONS
Problem 11. If there are n cars in the shop, then X = X1 +X2 +. . .+Xn , where Xi is a Bern(3/4)
random variable (as specified in the problem), and where X1 , X2 , . . . , Xn are all independent (as
specified in the problem). Thus we have that X|N = n ∼ Bin(n, 3/4) and for the same reason,
Y |N = n ∼ Bin(n, 1/4).
(a) Noting that RX = RY = {0, 1, 2, 3}, we can use the law of total probability to find both of
the marginal PMFs, which are:
3
X
PX (x) = P (X = x|N = n)PN (n)
n=0
X3 x n−x
n 3 1
= PN (n),
x 4 4
n=0
and
3
X
PY (y) = P (Y = y|N = n)PN (n)
n=0
X3 y n−y
n 1 3
= PN (n).
y 4 4
n=0
I compute both of these PMFs numerically to find:
0.180 for x = 0
0.258 for x = 1
PX (x) ≈ 0.352 for x = 2
0.211 for x = 3
0 otherwise,
and
0.570 for y = 0
0.336 for y = 1
PY (y) ≈ 0.086 for y = 2
0.008 for y = 3
0 otherwise,
which, as a sanity, both add up to approximately 1. We see that since a 4 door car is
more likely than a 2 door car, the marginalized PMF for the 4 door cars skews towards high
numbers, while the marginalized PMF for the 2 door cars skews towards lower numbers.
(b) We can find the joint PMF for X and Y by conditioning on N and using the law of total
probability:
3
X
PXY (x, y) = P (X = x, Y = y|N = n)PN (n),
n=0
where we can get rid of the sum because the probability is 0 if x + y 6= n:
PXY (x, y) = P (X = x, Y = y|N = x + y)PN (x + y)
= P (X = x|Y = y, N = x + y)P (Y = y|N = x + y)PN (x + y)
= P (Y = y|N = x + y)PN (x + y)
y x
x+y 1 3
= PN (x + y),
y 4 4
79
where in the second line I have used the chain rule of probability, in the third I have used the
fact that given Y = y and N = x + y, we are sure that X = x, and in the fourth line I have
used the fact that Y |N = x + y ∼ Bin(x + y, 1/4). I compute the joint PMF numerically and
present the results in the following table:
Y =0 Y =1 Y =2 Y =3
X=3 0.211 0 0 0
(c) X and Y are not independent since PXY (x, y) 6= PX (x)PY (y) ∀x, y. For example PXY (0, 0) =
0.125, while PX (0)PY (0) ≈ 0.180 · 0.570 = 0.103.
Problem 12. I first note that RX = RY = {1, 2, 3, 4, 5} and RZ = {−4, −3, . . . , 3, 4}. I can find
PZ (z) by conditioning on either X or Y and by using independence:
PZ (z) = P (Z = z)
= P (Y = X − z)
5
X
= P (Y = X − z|X = x)PX (x)
x=1
X5
1
= P (Y = x − z|X = x)
5
x=1
5
1X
= P (Y = x − z)
5
x=1
X5
1
= 1{x − z ∈ RY },
25
x=1
80 CHAPTER 5. JOINT DISTRIBUTIONS
where in going from the fourth to fifth line I used independence. Thus we have:
1
for z = −4
25
2
for z = −3
25
3
for z = −2
25
4
for z = −1
25
PZ (z) = 5
for z =0
25
4
for z =1
25
3
25 for z =2
2
for z =3
25
1
25 for z = 4.
Problem 13.
(a)
( (
1 1 1 11
6 + 6 + 8 for x = 0 24 for x = 0
PX (x) = 1 1 1
= 13
8 + 6 + 4 for x = 1 24 for x = 1
1 1 7
6 + 8 for y = 0
24 for y = 0
PY (y) = 16 + 1
for y = 1 = 1
for y = 1
1
6
1
33
8 + 4 for y = 2 8 for y = 2
(b)
1 (
161 for x = 0 4
for x = 0
+8 7
PX|Y (x|0) = 6
1 =
8
for x = 1
3
7 for x = 1
1
6
+ 81
1 (
161 for x = 0 1
for x = 0
+6 2
PX|Y (x|1) = 6
1 =
6
for x = 1
1
2 for x = 1
1
6
+ 61
1 (
181 for x = 0 1
for x = 0
+4 3
PX|Y (x|2) = 8
1 =
4
for x = 1
2
3 for x = 1
1
8
+ 41
E[X|Y = 0] with probability PY (0)
Z = E[X|Y = 1] with probability PY (1)
E[X|Y = 2] with probability PY (2),
or in other words:
PY (0) for z = E[X|Y = 0]
PZ (z) = PY (1) for z = E[X|Y = 1]
PY (2) for z = E[X|Y = 2].
81
We already know the marginal PMF of Y , and thus what is left to calculate is E[X|Y = y]
for all y ∈ RY :
X
E[X|Y = 0] = xPX|Y (x|0)
x∈RX
4 3
=0 +1
7 7
3
= ,
7
X
E[X|Y = 1] = xPX|Y (x|1)
x∈RX
1 1
=0 +1
2 2
1
= ,
2
X
E[X|Y = 2] = xPX|Y (x|2)
x∈RX
1 2
=0 +1
3 3
2
= .
3
Finally, we have that:
7
24 for z = 37
1
PZ (z) = for z = 12
3
3
8 for z = 23 .
(d) For this problem we are checking that the law of iterated expectations holds. That is, we
need to check explicitly that E[X] = E[E[X|Y ]], where the outer expectation on the RHS is
over Y . Computing the LHS I have:
X
11 13 13
E[X] = xPX (x) = 0 +1 = .
24 24 24
x∈RX
E[Z] = EY [E[X|Y ]]
X
= E[X|Y = y]PY (y)
y∈RY
3 7 1 1 2 3
= + +
7 24 2 3 3 8
13
= .
24
The LHS and RHS agree, and thus the law of iterated expectations holds.
82 CHAPTER 5. JOINT DISTRIBUTIONS
(e)
X
E[Z 2 ] = z 2 PZ (z)
z∈RZ
2 2 2
3 7 1 1 2 3
= + +
7 24 2 3 3 8
17
=
56
=⇒ 2
2 17 2 13 41
V ar[Z] = E[Z ] − E[Z] = − =
56 24 4032
Problem 14.
8 for v = 29 .
(b)
X 12 7 1 1 2 3 5
E[V ] = vPV (v) = · + · + · =
49 24 4 3 9 8 21
v∈RV
83
(c) In this problem we are checking that the law of total variance, V ar[X] = EY [V ar[X|Y ]] +
V arY [E[X|Y ]], holds (where the subscript Y on the expectation and variance denotes with
respect to the random variable Y .) Computing the LHS:
which is in agreement with the LHS of the equation. Note that E[V ] and V ar[Z] were
computed in this problem and the previous problem.
where in going from the second to the third line I have used the fact that Xi and N are independent
(for all i), in going from the third to the fourth line I have used the linearity of expectation,
and in going from the fifth to sixth line I have used the fact that for an Exp(λ) distribution,
E[X] = 1/λ . This summation can be computed by considering the Taylor expansion of the
84 CHAPTER 5. JOINT DISTRIBUTIONS
P∞ n
exponential, exp(x) = n=0 (x )/n!. Taking the derivative of both sides of this formula with
respect to x, we find that the desired sum is:
∞
X nxn
= xex ,
n!
n=0
and hence
β
E[Y ] =
.
λ
P
The calculation for V ar[Y ] is similar. For this calculation, we will need ∞ 2 n
n=0 (n x )/n!, which
can be found with the same differentiation strategy. I differentiate the equation for the previous
summation with respect to x once more and solve for the desired summation to find
∞
X n2 xn
= xex + x2 ex ,
n!
n=0
where in going from the second to third line I have used the fact that Xi and N are independent
(for all i), in going from the third to fourth line I have broken the square of the summation
P into the
summation of the squares plus the summation of the cross-terms. The notation j,k:j6=k denotes
85
a sum over all possible tuples of (j, k), where j, k = 1, 2, . . . , n, except the tuples where j = k. In
going from the fourth to fifth line I have used the linearity of expectation, in going from the fifth
to sixth line I have used the independence of all Xi s, and in going from the sixth to seventh line I
have used the fact that for an Exp(λ) distribution, E[X 2 ] = 2/λ2 (as calculated in the book). The
first summationP summation has already been solved for, and to solve the second summation, I use
n2 xn
the formula for ∞ n=0 n! as derived above. Thus I have that
1 β e−β β
E[Y 2 ] = · + 2 βe + β 2 eβ
λ λ λ
β 2 + 2β
= .
λ2
Problem 16.
(a)
Z 1Z ∞
1= fXY (x, y)dxdy
0 0
Z 1Z ∞ Z 1Z ∞
1 −x y
= e dxdy + c dxdy
0 0 2 0 0 (1 + x)2
Z 1Z ∞ Z 1Z ∞
1 −x y
= e dxdy + c dudy
0 0 2 0 1 u2
1 c
= +
2 2
=⇒
c=1
(b)
Z 1/2 Z 1
1
P 0 ≤ X ≤ 1, 0 ≤ Y ≤ = fXY (x, y)dxdy
2 0 0
Z 1/2 Z 1
1 −x y
= e + dxdy
0 0 2 (1 + x)2
1 1
= (1 − e−1 ) +
4 16
≈ 0.22
86 CHAPTER 5. JOINT DISTRIBUTIONS
(c)
Z 1Z 1
P (0 ≤ X ≤ 1) = fXY (x, y)dxdy
0 0
Z 1Z 1
1 −x y
= e + dxdy
0 0 2 (1 + x)2
1 1
= (1 − e−1 ) +
2 4
≈ 0.57
Problem 17.
(a)
Z
fX (x) = fXY (x, y)dy
RY
Z ∞
= e−xy dy
0
−xy ∞
e
=−
x 0
1
=
x
=⇒ (
1
x for 1 ≤ x < e
fX (x) =
0 otherwise
Z
fY (y) = fXY (x, y)dx
R
Z eX
= e−xy dx
1
−xy e
e
=−
y 1
1 1 1
= − ey
y ey e
=⇒ (
1 1 1
y ey − eey for y > 0
fY (y) =
0 otherwise
(b)
Z √
√ 1Z e
P (0 ≤ Y ≤ 1, 1 ≤ X ≤ e) = e−xy dxdy
0 1
Problem 18.
87
(a)
Z
fX (x) = fXY (x, y)dy
RY
Z 2
1 2 1
= x + y dy
0 4 6
1 2 1
= x +
2 3
=⇒ (
1 2 1
2x + 3 for −1 ≤ x ≤ 1
fX (x) =
0 otherwise
Z
fY (y) = fXY (x, y)dx
RX
Z 1
1 2 1
= x + y dx
4 −1 6
1 1
= y+
3 6
=⇒ (
1 1
3y + 6 for 0 ≤ y ≤ 2
fY (y) =
0 otherwise
(b)
Z 1Z 1
1 2 1
P (X > 0, Y < 1) = x + y dxdy
0 0 4 6
1
=
6
(d)
P (X > 0, Y < 1)
P (X > 0|Y < 1) =
P (Y < 1)
1
6
= 1
3
1
= .
2
88 CHAPTER 5. JOINT DISTRIBUTIONS
0
-2 -1 0 1 2
x
-1
-2
Figure 5.2: The region of integration for Problem 18 (e) (shaded region).
(e) We must be slightly care in choosing the bounds of integration for this problem. The upper
bound of the y integral is the upper bound of RY , and the lower bound of the y integral is
max{0, −x}, and not simply −x. This is because for x > 0, −x < 0, but the lower bound of
the range of Y is 0. An illustration of the domain of the double integral is shown in Fig. 6.1.
The probability we seek is thus:
Problem 19.
(a)
∂FXY
fXY (x, y) =
∂x∂y
= e−x 2e−2y
=⇒ (
e−x 2e−2y for x, y > 0
fXY (x, y) =
0 otherwise
89
(b)
Z ∞ Z x/2
1
P Y > X = e−x 2e−2y dydx
2
Z0 ∞ 0
= e−x − e−2x dx
0
1
=
2
(c) X and Y are independent because the joint PDF can be factored into the product of 2 marginal
PDFs. Specifically, the joint PDF can be factored into fX (x)fY (y) where X ∼ Exp(1) and
where where Y ∼ Exp(2).
Problem 20.
(a) To calculate the PDF, we simply need to condition on X > 0, and since a N (0, 1) distribution
is symmetric about zero, we know that P (X > 0) = 1/2:
fX (x)
fX|X>0 (x) =
P (X > 0)
√2 − x22
e for x > 0
= 2π
0 otherwise.
(b)
Z ∞
E[X|X > 0] = xfX|X>0 (x)dx
0
Z ∞
2 x2
=√ xe− 2 dx
2π 0
Z ∞
1 u
=√ e− 2 du
2π 0
2
=√
2π
(c) We can compute E[X 2 |X > 0] by noting that if Y ∼ N (0, 1), then:
1 = E[Y 2 ]
Z ∞
2 y2
=√ y 2 e− 2 dy,
2π 0
90 CHAPTER 5. JOINT DISTRIBUTIONS
where I have used the fact that y 2 times exp (−y 2 /2) is an even function, so I need only
integrate from 0 to infinity and multiply by 2. Thus we have that
Z ∞
2
E[X |X > 0] = x2 fX|X>0 (x)dx
0
Z ∞
2 x2
=√ x2 e− 2 dx
2π 0
= 1.
Finally, we have that:
V ar[X|X > 0] = E[X 2 |X > 0] − (E[X|X > 0])2
2
2
=1− √
2π
π−2
= .
π
Problem 21.
(a) I first find the marginal PDF of Y :
Z 1
2 1
fY (y) = x + y dx
−1 3
2
= (1 + y),
3
so that we have
fXY (x, y)
fX|Y (x|y) =
fY (y)
x + 13 y
2
= 2 ,
3 (1 + y)
and therefore: (
3x2 +y
2(1+y) for −1 ≤ x ≤ 1
fX|Y (x|y) =
0 otherwise.
(b)
Z 1
P (X > 0|Y = y) = fX|Y (x|y)dx
0
Z 1
1
3x2 + y dx
2(1 + y) 0
1
= .
2
Notice that the probability, P (X > 0|Y = y), does not depend on y.
(c) We have already found the marginal PDF of Y , and now I find the marginal PDF of X:
Z 1
1
fX (x) = x2 + y dy
0 3
1
= x2 + .
6
We thus see that fX (x)fY (y) = 2x /3 + y/9 + 2yx2 /3 + 1/9 6= fXY (x, y), and so X and Y
2
Problem 23.
(a) The set E is a diamond shaped region in R2 , upper-bounded by 1 − |x| and lower-bounded by
|x| − 1, as shown in Fig. 5.3. The area of the region is thus 4 times the area of a triangle with
a base length of 1 and height length of 1: 4 · (1/2) · (1) · (1) = 2. Since the total probability
must integrate to unity, we thus have c = 1/2.
so, that (
1 − |x| for −1 ≤ x ≤ 1
fX (x) =
0 otherwise.
92 CHAPTER 5. JOINT DISTRIBUTIONS
1.5
1.0
0.5
0.0
y
−0.5
−1.0
−1.5
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
x
(
1 − |y| for −1 ≤ y ≤ 1
fY (y) =
0 otherwise.
fXY (x, y)
fX|Y (x|y) =
fY (y)
(
1
2(1−|y|) for x, y ∈ E
=
0 otherwise.
(d) X and Y are not independent, as it is clear that fX|Y (x|y) 6= fX (x).
(
1
2 for 0 ≤ x ≤ 2
fX (x) =
0 otherwise,
and
(
1
2 for 0 ≤ y ≤ 2
fY (y) =
0 otherwise.
I solve for the desired probability by conditioning on Y and using the fact that X and Y are
93
independent:
Z 2
P (XY < 1) = P (XY < 1|Y = y)fY (y)dy
0
Z 2
= P (Xy < 1)fY (y)dy
0
Z 2
1
= P X< fY (y)dy
0 y
Z 2 Z min{2,1/y}
= fX (x)fY (y)dxdy
0 0
Z
1 2 1
= min 2, dy
4 0 y
Z 1/2 Z 2 !
1 1
= 2dy + dy
4 0 1/2 y
1 ln 2
= +
4 2
≈ 0.60.
Problem 25. The easiest way to solve this problem will be to use the law of iterated expectations
and the law of total variance. The following information will be useful: for X ∼ Exp(1), E[X] = 1,
V ar[X] = 1 and E[X 2 ] = 2, and for Y |X ∼ U nif (0, X), E[Y |X] = X/2 and V ar[Y |X] = X 2 /12.
(a) I use the law of iterated expectations, where the subscript on the first expectation denotes
an expectation over X:
(b) I use the law of total variance, where the subscripts denote expectation and variance over X:
Problem 26. For X ∼ U nif (0, 1) we have: E[X] = 1/2 and E[X 2 ] = 1/3.
(a) Since X and Y are independent, the expectation of the product is the product of the expec-
tations:
E[XY ] = E[X]E[Y ]
1
= .
4
94 CHAPTER 5. JOINT DISTRIBUTIONS
(c)
(d) We can compute this expectation with a 2D LOTUS over the joint distribution of X and Y .
Since X and Y are independent, fXY (x, y) = fX (x)fY (y) = 1 for x, y ∈ [0, 1]:
Z 1Z 1
E[Y eXY ] = yexy dxdy
0 0
Z 1
= (1 − ey ) dy
0
= e − 2.
Problem 27. I first note that RX = RY = [0, 1] and that RZ = [0, ∞). I solve for the CDF of Z
by conditioning on X and using the fact that X and Y are independent:
FZ (z) = P (Z ≤ z)
X
=P ≤z
Y
X
=P Y ≥
z
Z ∞
x
= P Y ≥ |X = x fX (x)dx
−∞ z
Z 1
x
= P Y ≥ |X = x dx
0 z
Z 1
x
= P Y ≥ dx,
0 z
where in the last line I have used the fact that X and Y are independent. To solve for P (Y ≥ X/z)
by integrating FY (y) over y, some care must be taken in the limits of integration. Since x ∈ [0, 1]
95
1.0
1.4
a) b)
2x
x
1.2 0.8
y=
=
y
1.0
0.6
FZ (z)
0.8
y
x/2
y=
0.6 0.4
0.4
0.2
0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0 1 2 3 4 5 6
x z
Figure 5.4: a) The x − y plane where the shaded region denotes the region of non-zero probability
for the PDF at hand. b) A plot of the CDF of Z.
and z ∈ [0, ∞), this implies that x/z ∈ [0, ∞). However, we know that fY (y) = 0 for y > 1 and thus
the lower bound of integration is not simply x/z, but is min{1, x/z}. This can be seen pictorially in
Fig. 5.4 a) (from a point of view of integrating over the joint PDF to solve the problem, rather than
the conditional PDF), which is the x − y plane, where the grey region corresponds to the region of
non-zero joint probability density. For any given z, P (Y ≥ X/z) represents the total probability
mass above the the line defined by x/z. For example, I have, I have drawn 3 different lines in 5.4
a) corresponding to z = 1/2 (highest line), z = 1 (middle line) and z = 2 (lowest line). For each
of these z values, P (Y ≥ X/z) is the fraction of the grey box above that line. We see that when
z increases from 0 to 1 (corresponding to the vertical line along the y-axis and the line y = x),
the total probability mass above the line increases smoothly. However, due to the edge of the box,
there is a kink in the total probability mass above the line when transitioning from z < 1 to z > 1,
which is the same reason we will get a kink in the function FZ (z) due to min{1, x/z}.
Continuing with the calculation:
Z 1
x
FZ (z) = P Y ≥ dx,
0 z
Z 1Z 1
= fY (y)dydx,
0 min{1,x/z}
Z 1h n x oi
= 1 − min 1, dx
0 z
Z 1 n xo
=1− min 1, dx
z
0Z 1 Z 1
x
=1− 1{x ≥ z}dx + 1{x < z}dx ,
0 0 z
where I have picked out the proper value of the min function by utilizing an indicator function
(which is much nicer to use in an integral since it “kills” the integral whenever the logical condition
evaluates to false).
96 CHAPTER 5. JOINT DISTRIBUTIONS
which I plot in Fig. 5.4 b). Notice that even though this is a piecewise function, it appears very
smooth because, at the transition (z = 1), both the actual function and the first derivative match
between the two piecewise regions.
To find the PDF we need only to differentiate:
dFZ (z)
fZ (z) =
dz
1
2 for 0 ≤ z ≤ 1
1
= 2z 2 for z > 1
0 otherwise.
Problem 28.
(a) To find fU |X (x|u) I first solve for the conditional CDF then differentiate with respect to u:
FU |X (u|x) = P (U ≤ u|X = x)
= P (X + Y ≤ u|X = x)
= P (x + Y ≤ u|X = x)
= P (Y ≤ u − x)
= Φ(u − x),
where in the fourth line I have used the fact that X and Y are independent. To find the
conditional PDF I now differentiate:
∂FU |X
= Φ0 (u − x)
∂u
1 (u−x)2
= √ e− 2 ,
2π
(b) If X ∼ N (µx , σx2 ) and X ∼ N (µy , σy2 ) are independent, then, as shown in the book by the
method of convolution, X + Y ∼ N (µx + µy , σx2 + σy2 ), and thus:
U ∼ N (0, 2).
fU |X (u|x)fX (x)
fX|U (x|u) =
fU (u)
(u−x)2 x 2
√1 e− 2 √1 e− 2
2π 2π
= u 2
1
√
2 π
e− 4
1 (u−x)2 x2 u2
= √ e− 2 − 2 + 4
π
1 u 2
= √ e−(x− 2 )
π
2
(x− u
2)
1 −
2· 1
=√ √ e 2 ,
2π(1/ 2)
where I have used the “completing the square” trick in the exponential to make it more
Gaussian. We recognize this distribution as a normal, u/2, 1/2 distribution:
u 1
X|U = u ∼ N , .
2 2
u 1
(d) Since X|U = u ∼ N 2, 2 , we have:
u
E[X|U = u] = ,
2
and
1
V ar[X|U = u] = .
2
Problem 29. This problem can be solved using the method of transformations. Since X and Y
are independent, we have an axis-aligned 2D Gaussian for the joint distribution:
where I have used the Pythagorean trigonometric identity. Thus, we have that:
1 2
Thus, we see that the function re− 2 r , which only depends on r, is always positive (for r ≥ 0) and
integrates to 1. This is the marginal distribution of R. The function 1/(2π) is always positive and
integrates to 1 for θ ∈ [−π, π], and this is thus the marginal distribution of Θ. Therefore we see
that fRΘ (r, θ) can be factored into fR (r)fΘ (θ), and thus R and Θ are independent.
iid
Problem 30. If X, Y ∼ U nif (0, 1), then:
(
1 for x, y ∈ [0, 1]
FXY (x, y) =
0 otherwise.
The Jacobian has already been calculated in the previous problem (J = r), so that
where in Fig. 5.5, I have indicated where in the r − θ plane fRΘ (r, θ) is non-zero.
We can further examine the constraints r cos θ, r sin θ ∈ [0, 1] to gain more insight. Satisfying
these conditions is equivalent to simultaneously satisfying the following four conditions:
r cos θ ≤ 1
r sin θ ≤ 1
r cos θ ≥ 0
r sin θ ≥ 0.
99
√
2
r
0
0 π/4 π/2
θ
Figure 5.5: The r − θ plane for Problem 30. The grey region denotes the region in the plane where
fRΘ (r, θ) is non-zero.
r sin θ, r cos θ
0 θ/2
θ
Figure 5.6: r sin θ (solid line) and r cos θ (dashed line) for 0 ≤ θ ≤ π/2
Since r is always positive, the last 2 conditions yield cos θ ≥ 0 and sin θ ≥ 0, which only happens
in the first quadrant, i.e., 0 ≤ θ ≤ π/2. If we plot the first 2 conditions for 0 ≤ θ ≤ π/2, as in
Fig. 5.6, we see that when 0 ≤ θ ≤ π/4, r cos θ ≤ 1 =⇒ r sin θ ≤ 1 and when π/4 ≤ θ ≤ π/2,
r sin θ ≤ 1 =⇒ r cos θ ≤ 1.
Thus, we can re-write fRΘ (r, θ) with the constraints specified a little more explicitly:
r for 0 ≤ θ ≤ π4 and r ≤ cos1 θ
fRΘ (r, θ) = r for π4 ≤ θ ≤ π2 and r ≤ sin1 θ
0 otherwise,
(where the inequalities did not flip when I divided by cos θ and sin θ because these are both positive
in the first quadrant). Note that these constraints imply that fRΘ (r, θ) > 0 in the unit square and
fRΘ (r, θ) = 0 outside of the unit square, as we would expect for X, Y ∼ U nif (0, 1) as shown in
Fig. 5.7. The figure shows the unit square in the x − y plane (where the probability is non-zero),
and it shows that for θ less than π/4, r is constrained by 0 ≤ r ≤ 1/ cos θ (for values of r within the
unit square). One can similarly show in this figure that for θ greater than π/4, 0 ≤ r ≤ 1/ sin θ.
100 CHAPTER 5. JOINT DISTRIBUTIONS
r
θ
0
0 1
x
Figure 5.7: The unit square in the x − y plane. In polar coordinates, to be within the unit square,
it can be seen geometrically that r is constrained by 0 ≤ r ≤ 1/ cos θ for 0 ≤ θ ≤ π/4, and by
0 ≤ r ≤ 1/ sin θ for π/4 < θ ≤ π/2.
= 1.
For this problem R and Θ are not independent, because there is no way to factor fRΘ (r, θ) into an
equation of just r times an equation of just θ, since the values of r over which the PDF is non-zero
explicitly depend on the values of θ.
Problem 31. The covariance can be computed straight from its definition:
1
X
2 2 1 1 1 11
E[X ] = x PX (x) = 1 · + + =
8 6 6 24
x=0
101
=⇒
V ar[X] = E[X 2 ] − E[X]2
2
11 11
= −
24 24
143
= ,
576
and for Y we have
2
X
1 1 1 1
E[Y ] = yPY (y) = 1 · + +2· + = 1,
4 6 8 6
y=0
2
X
2 2 1 1 2 1 1 19
E[Y ] = y PY (y) = 1 · + +2 · + =
4 6 8 6 12
y=0
=⇒
V ar[Y ] = E[Y 2 ] − E[Y ]2
19
= − 12
12
7
= .
12
Finally, the correlation is:
1
Cov[X, Y ]
ρXY = p = q 24 ≈ 0.11. (5.2)
V ar[X]V ar[Y ] 143
· 7
576 12
=⇒
Cov[X, Y ] = 1.
The correlation is now straightforward to calculate:
Cov[X, Y ] 1 1
ρXY = p =√ = .
V ar[X]V ar[Y ] 4·9 6
Problem 34. We know that X ∼ U nif (1, 3) (so that E[X] = 2) and Y |X = x ∼ Exp(x) (so
that E[Y |X] = 1/x). Since Cov[X, Y ] = E[XY ] − E[X]E[Y ], and since we know the distribution
of Y |X = x we can probably solve most of the expectations by conditioning on X and using the
law of iterated expectations. To solve for E[Y ] I use the law of iterated expectations (where the
subscript X denotes an expectation over the random variable X):
Cov[Z, W ] = Cov[7 + X + Y, 1 + Y ]
= Cov[X + Y, Y ]
= Cov[X, Y ] + V ar[Y ]
=0+1
= 1,
where I have used the fact that Cov[X, Y ] = 0 since X and Y are independent. Calculating the
variances is easy as well:
and
V ar[W ] = V ar[1 + Y ] = V ar[Y ] = 1.
Thus we have that the correlation is:
Cov[Z, W ] 1
ρZW = p = √ ≈ 0.71.
V ar[Z]V ar[W ] 2
Problem 36.
(a)
2
X + 2Y ∼ N (µX + 2µY , σX + 4σY2 + 2 · 2ρσX σY ) = N (1, 4)
=⇒
3−1
P (X + 2Y ≤ 3) = Φ = Φ(1) ≈ 0.84.
2
(b)
Cov[X − Y, X + 2Y ] = Cov[X, X] + 2Cov[X, Y ] − Cov[X, Y ] − 2Cov[Y, Y ]
2
= σX + ρσX σY − 2σY2
=1
Problem 37.
(a)
2
X + 2Y ∼ N (µX + 2µY , σX + 4σY2 + 2 · 2ρσX σY ) = N (3, 8)
=⇒
P (X + 2Y > 4) = 1 − P (X + 2Y ≤ 4)
4−3
=1−Φ √
8
1
=1−Φ √
2 2
≈ 0.36.
(b) Since X and Y are uncorrelated, jointly normal random variables they are independent, and
thus:
E[X 2 Y 2 ] = E[X 2 ]E[Y 2 ]
= (V ar[X] + E[X]2 )(V ar[Y ] + E[Y ]2 )
2
= (σX + µ2X )(σY2 + µ2Y )
= 10.
Problem 38.
(a) X and Y are jointly normal random variables, and thus by Theorem 5.4 in the book:
x − µX
Y |X = x ∼ N µY + ρσY , (1 − ρ2 )σY2
σX
3(x − 2) 27
=N 1− , .
4 4
We therefore can immediately read off that
3(3 − 2) 1
E[Y |X = 3] = 1 − = .
4 4
104 CHAPTER 5. JOINT DISTRIBUTIONS
P (X + 2Y ≤ 5|X + Y = 3) = P (U ≤ 5|V = 3)
5−4
= Φ q
27
7
r !
7
=Φ
27
≈ 0.69.
Chapter 6
105
106 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
Problem 1.
(a)
Z
fXY (x, y) = fXY Z (x, y, z)dz
(R 1
= 0 (x + y)dz for 0 ≤ x, y ≤ 1
0 otherwise
(
x + y for 0 ≤ x, y ≤ 1
=
0 otherwise
(b)
Z
fX (x) = fXY (x, y)dy
(R 1
= 0 (x + y)dy for 0 ≤ x ≤ 1
0 otherwise
(
x + 21 for 0 ≤ x ≤ 1
=
0 otherwise
so that
fXY Z (x, y, z)
fXY |Z (x, y|z) = = fXY Z (x, y, z).
fZ (z)
(d)
fXY |Z (x, y|z) = fXY Z (x, y, z) = fXY (x, y)
Problem 2.
Since X, Y , Z are independent, fXY |Z (x, y|1) = fXY (x, y) = fX (x)fY (y), so that:
and
E[X 2 Y 2 Z 2 |Z = 1] = E[X 2 Y 2 ] = E[X 2 ]E[Y 2 ] = 1.
107
Problem 3. To solve this problem, I first state a general result for a multivariate normal. Suppose
that
XA µA ΣAA ΣAB
∼N , ,
XB µB ΣBA ΣBB
where XA , µA ∈ Rm , XB , µB ∈ Rn , ΣAA ∈ Rm×m , ΣBB ∈ Rn×n , ΣAB ∈ Rm×n and ΣBA =
ΣTAB . (Note that, here, I have written this vector equation out in the so-called “partitioned form”
for convenience.) Then, it is not difficult to show that1 :
I now define the random variable, U = Y + Z, solve for the joint PDF of X, Y, U , then condition
on U using the formula above so that I may compute E[XY |Y + Z = 1]. Since U = Y + Z, and
Y and Z are 2 independent N (1, 1) distributions, then U ∼ N (2, 2). Recall that for 3 marginally
normal distributions, the joint distribution is:
X1 E[X1 ] V ar[X1 ] Cov[X1 , X2 ] Cov[X1 , X3 ]
X2 ∼ N E[X2 ] , Cov[X2 , X1 ] V ar[X2 ] Cov[X2 , X3 ] .
X3 E[X3 ] Cov[X3 , X1 ] Cov[X3 , X2 ] V ar[X3 ]
Thus, to solve for the joint distribution of X, Y, U , all that is left to do is to calculate the covariance
terms involving U , Cov[U, X] = Cov[Y + Z, X] = 0 and Cov[U, Y ] = Cov[Y + Z, Y ] = V ar[Y ], so
the the joint distribution is:
X 1 1 0 0
Y ∼ N 1 , 0 1 1 .
U 2 0 1 2
To solve for the distribution of X, Y |U = 1, it is not difficult to identify that, here, ΣAA = I2 (the
2 × 2 identity matrix), ΣAB = [0, 1]T , ΣBA = [0, 1], and ΣBB = 2, where I identify XA with
[X, Y ]T and XB with U . The mean of the conditional distribution is therefore given by
1 0 1 1
µA|B = + (1 − 2) = 1 ,
1 1 2 2
We would like to solve for E[XY |U = 1], which we can get easily from the covariance term of the
above distribution: E[XY |U = 1] = Cov[X, Y |U = 1] + E[X|U = 1]E[Y |U = 1] = 0 + (1) · (1/2) =
1/2.
Problem 4. Due to the symmetry of the problem, Y1 , Y2 , . . . Yn are all identically distributed,
and thus:
E[Y ] = E[Y1 + Y2 + . . . + Yn ] = E[Y1 ] + E[Y2 ] + . . . + E[Yn ] = nE[Y1 ],
1
for example, see: Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, 2006.
108 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
and
n
X X
V ar[Y ] = V ar[Y1 + Y2 + . . . + Yn ] = V ar[Yi ] + 2 Cov[Yi , Yj ] = nV ar[Y1 ] + 2nCov[Y1 , Y2 ].
i=1 i<j
The reason there is a factor of n in front of the covariance term can be seen from the matrix below
(an n = 5 example). The summation over i < j means that we can only consider pairs of i, j
above the diagonal (shaded cells). Further, it is evident that Yi , Yj pairs that are 2 or more apart
are independent. For example, Y2 = X2 X3 , Y3 = X3 X4 , Y4 = X4 X5 , so that Y2 and Y3 are not
independent because they share the X3 random variable, but Y2 and Y4 are independent because
they share no random variables (and all the Xi s are independent). Since independence =⇒ no
covariance, the only Yi , Yj pairs that contribute to the sum are the n − 1 terms right above the
diagonal (as indicated by the spades in the figure). We also cannot forget that the Yn , Y1 pair is
not independent because they share the X1 random variable (the spade in the upper right hand
corner). We thus see that there are n pairs that contribute to the sum.
Y1 Y2 Y3 Y4 Y5
Y1 ♠ ♠
Y2 ♠
Y3 ♠
Y4 ♠
Y5
iid
It remains to compute E[Y1 ], V ar[Y1 ] and Cov[Y1 , Y2 ]. Since Y1 = X1 X2 , and X1 , X2 ∼ Bern(p),
the range of Y1 is {0, 1}, with probability p2 of obtaining 1 (X1 = 1 and X2 = 1). In other words,
Y1 ∼ Bern(p2 ), so that E[Y1 ] = p2 and V ar[Y1 ] = p2 (1 − p2 ). All that is left to do is to compute
the covariance:
where in the second line I have used the fact that all the Xs are independent, and in the fourth
line I have used the fact that for a Bern(p) distribution, p(1 − p) = E[X 2 ] − p2 .
Thus, we have that:
E[Y ] = np2
and
Problem 5.
109
(b) To solve for V ar[X], I already have E[X], so I just need to solve for E[X 2 ]:
E[X 2 ] = E[(X1 + X2 + . . . + Xk )2 ]
Xk X
=E Xi2 + Xi Xj
i=1 i,j:i6=j
k
X X
= E[Xi2 ] + E[Xi Xj ],
i=1 i,j:i6=j
P
where in the third line I have used the linearity of expectation, and where the notation i,j:i6=j
refers to a summation over all i, j pairs (i, j=1, 2, . . . , k) expect the pairs for which i = j
(which was accounted for in the first summation).
1
X
E[Xi2 ] = j 2 P (Xi = j)
j=0
= P (Xi = 1)
b
= ,
b+r
The second summation is slightly more difficult. The strategy I take is to condition on one of
the random variables and to use the law of total expectation (since Xi , Xj ∈ {0, 1}, so that
one of the terms will go to zero):
I thus need to solve for P (Xi = 1|Xj = 1), which I can do in a very similar combinatorial
fashion as I did for P (Xi = 1). For this probability, the total sample space is all sequences
of size k with b blue balls and r red balls, with a blue ball in the j th spot. The size of the
sample space is thus (b + r − 1)!/[(b − 1)!r!]. The number of unique sequences with a blue
ball in the j th spot and a blue ball in the ith spot is (b + r − 2)!/[r!(b − 2)!], so that
(b+r−2)!
r!(b−2)! b−1
P (Xi = 1|Xj = 1) = = .
(b+r−1)! b+r−1
(b−1)!r!
111
MX (s) = E[esX ]
∞
X
= p(1 − p)k−1 esk
k=1
(∞ )
p X
s k
= [(1 − p)e ] − 1 ,
1−p
k=0
where we recognize that the summation is a geometric series, and is finite provided (1 − p)es < 1.
Using the formula for a geometric series, and simplifying, I have that:
pes
MX (s) = ,
1 + (p − 1)es
Problem 7. We can solve this problem by realizing that the k th derivative of the MGF evaluated
at s = 0 gives the k th moment of the distribution:
dMX
E[X] =
ds s=0
d 1 1 s 1 2s
= + e + e
ds 4 2 4 s=0
= 1,
112 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
2 d2 MX
E[X ] =
ds2 s=0
d2 1 1 s 1 2s
= 2 + e + e
ds 4 2 4 s=0
3
= .
2
Problem 8. We already know from Problem 5 in section 6.1.6 of the book that the MGF for a
N (µ, σ 2 ) distribution is M (s) = exp(sµ + σ 2 s2 /2), and since the MGF of the sum of independent
random variables is the product of the MGFs of the random variables, we have that:
MX (s) = E[esX ]
Z
λ ∞ −λ|x|+sx
= e dx
2 −∞
Z 0 Z ∞
λ x(λ+s) x(s−λ)
= e dx + e dx
2 −∞ 0
Notice that, for both integrals to be finite, we have the conditions that λ + s > 0 (for the first
integral) and s − λ < 0 (for the second), or in other words |s| < λ. Assuming these two conditions,
the integral can easily be evaluated, and is:
λ2
MX (s) = .
λ2 − s2
Problem 10.
MX (s) = E[esX ]
Z ∞ sx α α−1 −λx
e λ x e
= dx
0 Γ(α)
Z ∞
λα
= xα−1 e−(λ−s)x dx
Γ(α) 0
λα Γ(α)
= for s < λ
Γ(α) (λ − s)α
α
λ
=
λ−s
113
Problem 11. For Xi ∼ Exp(λ), from Example 6.5 in the book, we have that MXi = λ/(λ − s)
for s < λ. Moreover since the MGF of the sum of independent random variables is the product of
the MGFs of the random variables, we have that:
MY (s) = MX1 (s)MX2 (s) . . . MXn (s)
n
λ
= ,
λ−s
which, from the previous problem, we notice is the MGF of a Gamma(n, λ) random variable. By
Theorem 6.1 in the book, the MGF of a random variable uniquely determines its distribution, so
Y ∼ Gamma(n, λ).
Problem 13.
(a) To solve for E[U ], I first find the marginal PDFs:
(R 1 (
1 3 1
0 2 (3x + y)dy for 0 ≤ x ≤ 1 x+ for 0 ≤ x ≤ 1
fX (x) = = 2 4 ,
0 otherwise 0 otherwise
and (R 1 (
1 1
0 2 (3x for 0 ≤ y ≤ 1
+ y)dx y + 34 for 0 ≤ y ≤ 1
fY (y) = = 2 .
0 otherwise 0 otherwise
R1 R1
Thus, E[X] = 0 x(3x/2 + 1/4)dx = 5/8 and E[Y ] = 0 y(y/2 + 3/4)dy = 13/24, so that
5
E[X] 8 .
E[U ] = = 13
E[Y ] 24
(b) In order to solve forRU , I will first need to compute E[X 2 ], E[Y 2 ] and E[XY ]:
Z 1
2 2 3 1 11
E[X ] = x x+ dx = ,
0 2 4 24
Z 1
1 3 3
E[Y 2 ] = y2 y+ dy = ,
0 2 4 8
and Z Z
1 1 1 1
E[XY ] = xy(3x + y)dxdy = .
2 0 0 3
I can now immediately write down the correlation matrix:
RU = E[U U T ]
E[X 2 ] E[XY ]
=
E[Y X] E[Y 2 ]
11 1
= 241
3 .
3
3 8
114 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
Problem 14.
(a) First note that the range of Y is [0, 1]. Since we know the distribution of Y |X = x and the
distribution of X, the law of total probability for PDFs will probably be useful. Note that
fY |X (y|x) = 1/x for 0 ≤ y ≤ x and 0 otherwise, which can be written as (1/x)1{0 ≤ y ≤ x},
which will be helpful in the integral to get the bounds of integration correct:
Z 1
fY (y) = fY |X (y|x)fX (x)dx
0
Z 1
1
= 1{0 ≤ y ≤ x}dx
0 x
Z 1
1
= 1{y ≤ x}dx for y > 0
0 x
Z 1
1
= dx
y x
= − ln y,
and thus (
− ln y for 0 ≤ y ≤ 1
fY (y) =
0 otherwise,
which I checked integrates to 1.
Finding the PDF of Z is very similar to that of finding the PDF for Y . Firstly, the range of
Z is [0, 2]. Note that in this case fZ|X (z|x) = 1/2x for 0 ≤ z ≤ 2x and 0 otherwise, which
can be written as (1/2x)1{0 ≤ z ≤ 2x}, so that the integral above becomes:
Z 1
1
fZ (z) = 1{z ≤ 2x}dx for z > 0
0 2x
Z 1
1
= dx
z/2 2x
ln 2 ln z
= − ,
2 2
and thus (
ln 2 ln z
2 − 2 for 0 ≤ z ≤ 2
fZ (z) =
0 otherwise,
which I checked integrates to 1.
115
Problem 15.
(a) As stated in the problem, we have that
X1 1 4 1
∼N , ,
X2 2 1 1
and for a bivariate normal, we know that
X1 E[X1 ] V ar[X1 ] Cov[X1 , X2 ]
∼N , .
X2 E[X2 ] Cov[X2 , X1 ] V ar[X2 ]
Thus, I have that X2 ∼ N (2, 1), so that:
P (X2 > 0) = 1 − P (X2 ≤ 0)
0−2
=1−Φ
1
= Φ(2)
≈ 0.98.
(b)
Y = AX + b
2 1 −1
X1
= −1 1 + 0
X2
1 3 1
2X1 + X2 − 1
= −X1 + X2
X1 + 3X2 + 1
=⇒
2E[X1 ] + E[X2 ] − 1
E[Y ] = −E[X1 ] + E[X2 ]
E[X1 ] + 3E[X2 ] + 1
2·1+2−1
= −1 + 2
1+3·2+1
3
= 1
8
116 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
(c) We know that a linear combination of a multivariate Gaussian random variable is also Gaus-
sian. Specifically, Y is distributed as Y ∼ N (AE[X] + b, ACX AT ), and thus the covariance
matrix of Y is
CY = ACX AT
2 1
4 1 2 −1 1
= −1 1
1 1 1 1 3
1 3
21 −6 18
= −6 3 −13 .
18 −3 19
Y1 E[Y1 ] V ar[Y1 ] Cov[Y1 , Y2 ] Cov[Y1 , Y3 ]
Y2 ∼ N E[Y2 ] , Cov[Y2 , Y1 ] V ar[Y2 ] Cov[Y2 , Y3 ] ,
Y3 E[Y3 ] Cov[Y3 , Y1 ] Cov[Y3 , Y2 ] V ar[Y3 ]
Problem 16. To solve this problem, I first review how to “complete the square” for matrices. For
a ∈ R, x, b ∈ Rm and C ∈ Rm×m (and symmetric), a quadratic of the form
1
a + bT x + xT Cx
2
can be factored into the form
1
(x − m)T M (x − m) + v,
2
where
M = C,
m = −C −1 b,
and
1
v = a − bT C −1 b.
2
I now explicitly write out the MGF of X:
where xT ≡ [x1 , x2 , x3 ] and sT ≡ [s, t, r]. To make the exponent more Gaussian looking, I now
expand the exponent out and complete the square (note that since Σ is symmetric, then so too is
Σ−1 ):
1 1
− (x − µ)T Σ−1 (x − µ) + sT x = − xT Σ−1 x − xT Σ−1 µ − µT Σ−1 x + µT Σ−1 µ + sT x
2 2
1 1
= − xT Σ−1 x + µT Σ−1 x − µT Σ−1 µ + sT x
2 2
1 T −1 1
= − x Σ x + (s + µ Σ )x − µT Σ−1 µ,
T T −1
2 2
where I have used the fact that (xT Σ−1 µ)T = xT Σ−1 µ since this is just a real number. I can now
read off a, b and C:
1
a = − µT Σ−1 µ,
2
bT = sT + µT Σ−1
and
C = −Σ−1 ,
so that
b = (bT )T = s + Σ−1 µ
and
−1
C −1 = −Σ−1 = −Σ.
Finally, the exponent can be re-expressed as
1
− (x − m̃)T M̃ (x − m̃) + ṽ,
2
where
m̃ = Σ(s + Σ−1 µ),
M̃ = Σ−1
and
1 1
ṽ = − µT Σ−1 µ + (sT + µT Σ−1 )Σ(s + Σ−1 µ)
2 2
T
s Σs
= sT µ + ,
2
so that the integral becomes:
Z
1 1 T −1
MX (s, t, r) = exp(ṽ) exp − (x − m̃) Σ (x − m̃) d3 x
(2π)3/2 |Σ|1/2 R3 2
T
s Σs
= exp sT µ + ∀s ∈ R3 .
2
I have used the fact that the integral is that of a Gaussian integrated over its entire domain, so
that the integral evaluates to 1. Note that we probably could have guessed this form of the MGF
of X, since it is the vector analogue of the 1 dimensional case: MX (s) = exp{sµ + σ 2 s2 /2}, as
found in Problem 5 of 6.1.6 in the book.
118 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
The specified values of the mean vector and covariance matrix are:
1
µ = 2 ,
0
and
9 1 −1
Σ = 1 4 2 ,
−1 2 4
and plugging in these specific values into the equation I derived above, and multiplying the matrices,
I finally arrive at:
9 2
MX (s, t, r) = exp s s + 1 + 2t(t + 1) + 2r − rs + st + 2rt .
2
Problem 17. Let Ai , i = 1, 2, 3, 4, be the event that the ith component fails, so that the a failure
occurs under the event ∪41=i Ai . We can thus obtain an upper limit on the event that failure occurs
using the union bound:
4
! 4
[ X
P Ai ≤ P (Ai )
1=i i=1
≤ 4pf
1
= .
25
Problem 18.
iid
(a) First note that for the random position of a node (Xi , Yi ), X1 , X2 , . . . , Xn , Y1 , Y2 , . . . , Yn ∼
U nif (0, 1). Let us call the node under consideration node j, and let the set S be defined as
S ≡ {1, 2, . . . , n} − {j}. The probability that the node is isolated, pd , is:
!
\
pd = P (Xj − Xi )2 + (Yj − Yi )2 > r2
i∈S
Z !
1Z 1 \
= P (Xi − Xj )2 + (Yi − Yj )2 > r2 Xj = xj , Yj = yj fXj Y j (xj , yj )dxj dyj
0 0 i∈S
Z !
1Z 1 \
= P (Xi − xj )2 + (Yi − yj )2 > r2 fXj (xj )fYj (yj )dxj dyj
0 0 i∈S
Z 1Z 1 Y
= P (Xi − xj )2 + (Yi − yj )2 > r2 fXj (xj )fYj (yj )dxj dyj
0 0 i∈S
Z 1Z 1Y
= 1 − P (Xi − xj )2 + (Yi − yj )2 ≤ r2 fXj (xj )fYj (yj )dxj dyj
0 0 i∈S
Z Z 1
1 n−1
= 1 − P (X1 − xj )2 + (Y1 − yj )2 ≤ r2 fXj (xj )fYj (yj )dxj dyj ,
0 0
where in the third and fourth lines I have used the fact that the random variables are inde-
pendent, and in the last line, I have used symmetry. I now must compute P ((X1 − xj )2 +
119
.
y1
( xj , y j )
.
0
(xj , yj )
0 1
x1
(Y1 − yj )2 ≤ r2 ). If the given point (xj , yj ) is near the middle of the square (as in the upper
point in Fig. 6.1) then the probability of this event is simply the area of this circle (shaded
grey region). However, we notice that if (xj , yj ) is near the edge of the square, part of the
shaded circle will get cutoff. In fact, if (xj , yj ) is exactly at one of the corners of the unit
square, for example, at (0, 0) as in Fig. 6.1, then the amount of shaded area is minimized at
πr2 /4. Thus, for any given (xj , yj ), P ((X1 − xj )2 + (Y1 − yj )2 ≤ r2 ) ≥ πr2 /4. Therefore,
Z 1Z 1 n−1
πr2
pd ≤ 1− fXj (xj )fYj (yj )dxj dyj
0 0 4
n−1
πr2
= 1−
4
(b) Let Ai be the event that the ith node is isolated. Then the probability we seek is:
n
! n
[ X
P Ai ≤ P (Ai )
i=1 i=1
n
X
= pd
i=1
n−1
πr2
=n 1−
4
Problem 19. For X ∼ Geom(p), E[X] = 1/p, so that the Markov inequality is:
E[X] 1
P (X ≥ a) ≤ = .
a pa
120 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
a=1 a=2
10 10
Markov upper bound Markov upper bound
Exact Exact
8 8
P (X ≥ a)
P (X ≥ a)
6 6
4 4
2 2
0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
p p
Figure 6.2: Comparison of Markov upper bound to exact probability for Problem 19.
The Markov upper bound is greater than or equal to the exact probability for a ≥ 1 and 0 < p < 1
as shown for a few values of a in Fig. 6.2
Problem 20.
V ar[X] 1−p
P (|X − E[X]| ≥ b) ≤ =
b2 pb2
Problem 21.
σ2 σ2
P (X ≥ a) = P X + ≥a+
a a
2
2 2 !
σ σ2
=P X+ ≥ a+
a a
h 2
i
E (X + σa )2
≤ 2 Markov0 s Inequality
2
a + σa
σ4
σ2 + a2
= σ2 2
a2 (1 + a2
)
σ2
=
a2 + σ 2
Problem 22.
121
(a)
(b)
225 45
P (X ≥ 120) ≤ 2
=
225 + 120 293
iid
Problem 23. We know from Problem 11 that if X1 , X2 , . . . , Xn ∼ Exp(λ), then Y ≡ X1 + X2 +
. . . + Xn ∼ Gamma(n, λ). The relevant Chernoff bound is given by:
P (Y ≥ a) ≤ min{e−sa MX (s)},
s>0
where, in this case the MX (s) is the MGF for a Gamma(n, λ) distribution. This MGF was solved
for in Problem 10, and is given by
n
λ
MX (s) = for s < λ,
λ−s
and therefore we must minimize the objective function over 0 < s < λ. Let the optimal value
be called s? . I solve for s? in the standard calculus manner by setting the derivative equal 0. I
then check to make sure that this optimal value is within the interval (0, λ). The derivative of the
objective can be found easily with the chain rule:
n n n
d −sa λ −sa λ −sa λ 1
e = −ae + ne ,
ds λ−s λ−s λ−s λ−s
and setting this equal to zero and solve for s? results in
n
s? = λ − .
a
As stipulated in the problem a > n/λ, which means that n/a is positive and less than λ. Thus we
have that s? ∈ (0, λ) as required.
The desired bound is therefore
?
P (Y ≥ a) ≤ e−s MX (s? )
n
−λa+n λa
=e .
n
We can understand the behavior of this function as n → ∞ by expanding the exponential in
powers of 1/n:
n n
−λa+n λa −λa 1 1
e =e 1+O (λa)n
n n n
n n+1 !
λa 1
= e−λa +O ,
n n
and we thus see that the upper bound goes to 0 exponentially fast as n goes to infinity.
122 CHAPTER 6. METHODS FOR MORE THAN TWO RANDOM VARIABLES
where in the third line I have used Hölder’s inequality, E[|U V |] ≤ E[|U |α ]1/α E[|V |β ]1/β , with
1 < α, β < ∞ and 1/α + 1/β = 1, and where I have specifically chosen α = p/(p − 1) and β = p.
Using the inequality provided in the book,
and multiplying both sides of this equation by E[|X + Y |p ](1−p/p) (which we can do without flipping
the inequality sign since we know that this quantity is positive) yields the desired result:
1 1 1
E[|X + Y |p ] p ≤ E[|X|p ] p + E[|Y |p ] p .
Problem 25.
(a)
d2
(x − x3 ) = −6x
dx2
=⇒ (
+ (convex) for x < 0
g 00 (x) =
− (concave) for x > 0,
Since we know X is a positive random variable, by Jensen’s inequality, we have:
(b)
d2 √ 1
2
(x ln x) =
dx 2x
00
=⇒ g (x) > 0 (convex) for x > 0 =⇒
√ p √
E[X ln X] ≥ E[X] ln E[X] = 10 ln 10
(c) The function is a typical absolute value function, an upward v shape hitting y = 0 at x = 2,
which is clearly convex, since a straight line drawn from any 2 points on the graph is always
above the graph. Therefore, I have that E[|2 − X|] ≥ |2 − E[X]| = 8.
Problem 26. Taking the second derivative, we have that d2 /dx2 (x3 − 6x2 ) = 6x − 12. Setting to
zero and solving for x, I find that the second derivative is negative for x < 2 and positive for x > 2.
Since the range of X is (0, 2), we have that g(x) = x3 − 6x2 is concave in this interval. By Jensen’s
inequality, this implies that E[Y ] = E[g(X)] ≤ g(E[X]) = E[X]3 − 6E[X]2 = 1 − 6 = −5.
Chapter 7
123
124 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES
Problem 1.
(a)
1
E[Mn ] = E[X1 + . . . + Xn ]
n
1
= nE[X1 ]
n
1
=
2
1
V ar[Mn ] = V ar[X1 + . . . + Xn ]
n2
1
= 2 nV ar[X1 ] (independence)
n
1
=
12n
(b)
1 1 1
P Mn − ≥ = P |M − E[M ]| ≥
2 100
n n
100
V ar[Mn ]
≤
1 2
100
2500
=
3n
(c)
1 2500
lim P |Mn − E[Mn ]| ≥ ≤ lim =0
n→∞ 100 n→∞ 3n
=⇒
1 1
lim P Mn − ≥ =0
n→∞ 2 100
Problem 2. Let X1 , X2 , . . . , X365 be the number of accidents on day 1, day 2, . . ., day 365, so that
iid
the total number of accidents in the year is Y = X1 + . . . + X365 . We know that p 365 ∼
X1 , X2 , . . . , X√
P oiss(λ), where λ = 10 accidents/day, so that µ = E[Xi ] = λ = 10 and σ = V ar[Xi ] = λ =
√
10. Using the central limit theorem, I have
Y − 365 · 10 3800 − 365 · 10
P (Y > 3800) = P √ > √
365 · 10 365 · 10
3800 − 365 · 10
= P Z365 > √
365 · 10
3800 − 365 · 10
= 1 − P Z365 ≤ √
365 · 10
3800 − 365 · 10
≈1−Φ √
365 · 10
−3
≈ 6.5 × 10 .
125
Problem 3. Let the random variable, Xi be 0 if the ith bit is not received in error and 1 if it
iid
is. Notice that X1 , X2 , . . . , X1000 ∼ Bern(0.1), and let theptotal number
p of errors, Y√, be Y =
X1 + . . . + X1000 . Note that µ = E[Xi ] = p = 0.1 and σ = V ar[Xi ] = p(1 − p) = 0.09. We
seek the probability of decoding failure, in other words P (Y > 125):
Problem 4. Let the random variable, Xi be 0 if the ith student does not have a car and 1 if the ith
iid
student does have a car. Notice that X1 , X2 , . . . , X50 ∼ Bern(0.5), and
p let the totalpnumber of cars,
Y , be Y = X1 + . . . + X50 . Note that µ = E[Xi ] = p = 0.5 and σ = V ar[Xi ] = p(1 − p) = 0.5.
We seek the probability that there are not enough car spaces, in other words P (Y > 30):
Problem 5. Let N , a random variable, be the number of jobs processed in 7 hours (420 mins).
We seek the probability that the number of jobs processed in 7 hours is less than or equal to 40,
P (N ≤ 40). This can be rephrased as the probability that the total time to processes 40 jobs is
greater than or equal to 7 hours:
Problem 6. Let Xi be the number of heads flipped on toss i, so that the total proportion of
iid
heads out of n tosses, X, is X = (X1 + . . . + Xn )/n. Notice that X1 , . . . , Xn ∼ Bern(0.5), so
126 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES
p √
that µ = E[Xi ] = p = 0.5 and σ = V ar[Xi ] = 0.52 = 0.5. To be at least 95% sure that
0.45 ≤ X ≤ 0.55, we have that:
0.95 ≤ P (0.45 ≤ X ≤ 0.55)
0.45n − n · 0.5 X1 + . . . + Xn − n · 0.5 0.55n − n · 0.5
=P √ ≤ √ ≤ √
n · 0.5 n · 0.5 n · 0.5
√ √
= P −0.1 n ≤ Zn ≤ 0.1 n
√ √
≈ Φ(0.1 n) − Φ(−0.1 n)
√
= 2Φ(0.1 n) − 1.
√
Thus, I have that 0.95 . 2Φ(0.1 n) − 1. Applying the inverse normal CDF function to this
inequality, I arrive at: & 2 '
1.95
n & 100 Φ−1 = 385. (7.1)
2
p
Problem 7. Note that X1 , X2 , . . . , Xn are iid with µ = E[Xi ] = 0 and σ = V ar[Xi ] = 2, so we
can use the CLT. To be at least 95% sure that the final estimate is within 0.1 units of q, we require:
0.95 ≤ P (q − 0.1 ≤ Mn ≤ q + 0.1)
X1 + . . . + Xn + nq
= P q − 0.1 ≤ ≤ q + 0.1
n
= P ((q − 0.1)n − nq ≤ X1 + . . . + Xn ≤ (q + 0.1)n − nq)
(q − 0.1)n − nq X1 + . . . + Xn (q + 0.1)n − nq
=P √ ≤ √ ≤ √
2 n 2 n 2 n
√ √
−0.1 n 0.1 n
=P ≤ Zn ≤
2 2
√ √
0.1 n −0.1 n
≈Φ −Φ
2 2
√
0.1 n
= 2Φ − 1.
2
√
We therefore have that 0.95 . 2Φ(0.1 n/2) − 1. Applying the inverse normal CDF function to
this inequality, I arrive at: & 2 '
1.95
n & 400 Φ−1 = 1537. (7.2)
2
Problem 8. To solve this problem, I first compute the limit of exp[n(x − 1)]/{1 + exp[n(x − 1)]}
for x > 0 as n goes to ∞. Notice that this function has different behavior for x = 1 (in which case
the limit evaluates easily to 1/2), 0 < x < 1 (in which case the limit evaluates easily to 0) and for
x > 1 (in which case the numerator and denominator evaluate to infinity). Using L’hopital’s rule
in this case I find that the limit evaluates to 1. Therefore, I have that:
(
en(x−1)
limn→∞ 1+en(x−1) for x > 0 0 for −∞ < x < 1
lim FXn (x) = = 12 for x = 1
n→∞ 0 otherwise
1 for x > 1.
For the “random variable”, X, that takes on a value of 1 with probability 1, the CDF is:
(
0 for −∞ < x < 1
FX (x) =
1 for x ≥ 1
127
Thus, we see that limn→∞ FXn (x) = FX (x) everywhere FX (x) is continuous (i.e, R − {1}), and
d
hence Xn −
→ X.
Problem 9. To solve this problem, I first state without proof the following 2 limits:
enx + xenx
lim = x for 0 ≤ x ≤ 1,
n→∞ 1 + n+1 en
n
and
enx + enx
lim =1 for x > 1.
n→∞ 1 + n+1 en
n
I therefore have that:
0 for x < 0
enx +xenx
0 for x < 0
lim FXn (x) = limn→∞ 1+( n+1
for 0 ≤ x ≤ 1 = x for 0 ≤ x ≤ 1
n )
en
n→∞
limn→∞ nx
e +e nx
for x > 1 1 for x > 1,
1+( n+1
n )
en
d
which is the same CDF as a U nif (0, 1) distribution. Hence, Xn −
→ X for X ∼ U nif (0, 1).
Problem 10.
(a)
lim P (|Xn − 0| ≥ ) = lim P (Xn ≥ ) (since Xn ≥ 0)
n→∞ n→∞
(
1
2 for ≤ n
= n
0 for > n
1
= lim 2
n→∞ n
=0
p
=⇒ Xn →
− 0
(b)
lim E[|Xn − 0|r ] = lim E[Xnr ] (since Xn ≥ 0)
n→∞ n→∞
1 r
= lim n
n→∞ n2
= lim nr−2
n→∞
= 0 (for 1 ≤ r < 2)
Lr
=⇒ Xn −→ 0 (for 1 ≤ r < 2)
(c) For r ≥ 2,
lim E[|Xn − 0|r ] = lim E[Xnr ] (since Xn ≥ 0)
n→∞ n→∞
1 r
= lim n
n→∞ n2
= lim nr−2 ,
n→∞
which converges to 1 for r = 2 and diverges for r > 2. Therefore, Xn does not converge to 0
in the rth mean for r ≥ 2.
128 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES
P∞
(d) To solve this problem I use Theorem 7.5 in the book, and must thus show that n=1 P (|Xn | >
) (for all > 0) is finite:
∞
X ∞
X
P (|Xn | > ) = P (Xn > )
n=1 n=1
∞
X 1
=
n2
n=de
X∞
1
≤
n2
n=1
π2
=
6
< ∞,
where in the first line I have used the fact that Xn is always greater than or equal to zero.
n+δ
n
x 10−x
PXn (x) = 2n+δ
,
10
for x = 0, 1, . . . 10 (and 0 otherwise). Since X, X10 , X11 , . . . are non-negative random integers (for
X ∼ Bin(10, 0.5)), by Theorem 7.1 in the book, we need only prove that limn→∞ PXn (x) = PX (x)
to prove convergence in distribution. As for the RHS of this equation, for X ∼ Bin(10, 0.5), the
PMF is given by:
10
PX (x) = (0.5)10 ,
x
n+δ
n
x 10−x
lim PXn (x) = lim 2n+δ
n→∞ n→∞
10
n+δ n 2n + δ −1
= lim lim lim .
n→∞ x n→∞ 10 − x n→∞ 10
The first limit can be found easily by expanding the factorial in the numerator:
n+δ 1
lim = lim (n + δ)(n + δ − 1) . . . (n + δ − x + 1)
n→∞ x x! n→∞
1
= lim nx + O nx−1
x! n→∞
nx
= lim ,
n→∞ x!
129
and the remaining 2 limits can be worked out similarly. Plugging these limits in, I have that:
−1
nx n10−x (2n)10
lim PXn (x) = lim lim lim
n→∞ n→∞ x!n→∞ (10 − x)! n→∞ 10!
( −1 )
n x n 10−x (2n) 10
= lim · ·
n→∞ x! (10 − x)! 10!
10
= (0.5)10 .
x
d
Thus limn→∞ PXn (x) = PX (x), and by Theorem 7.1 Xn −
→ X.
let Yn be defined analogously (and let all Xi s be independent from all Yi s, so that Xn is independent
d
of Yn ). Moreover, let X ∼ N (0, 1) and let Y = X. Now, from the CLT, we know that Xn −
→X
d
and Yn −→Y.
From the CLT, we also know that in the limit that n → ∞, Xn + Yn is simply the sum of
two independent standard normal random variables, so that in this limit Xn + Yn ∼ N (0, 2).
Also, since X + Y = 2X, we have that X + Y ∼ N (0, 4), since for X ∼ N (E[X ], V ar[X ]),
Y = aX + b ∼ N (aE[X ] + b, a2 V ar[X ]) (see Sec. 6.1.5 from the book). Thus, in this example, I
d d
have that Xn −
→ X and Yn −
→ Y , but Xn + Yn does not converge in distribution to X + Y .
d
Problem 13. Xn −
→ 0 since:
Z ∞
n −nx
lim P (|Xn | ≥ ) = lim 2 e dx (by symmetry)
n→∞ n→∞ 2
1
= lim n
n→∞ e
= 0.
Problem 14. This can easily be proven by realizing that Xn is never negative (so |Xn | = Xn ),
and by re-expressing the integral over the PDF in terms of an indicator functions depending on
whether > 1/n (in which case the lower bound of the integral is ) or whether ≤ 1/n (in which
case the integral evaluates to 1):
n−1
!
1 X
Xn = Yi Y1+1 + Yn Y1 .
n
i=1
To solve this problem, I will use Chebyshev’s inequality, and will thus need to compute E[Xn ] and
V ar[Xn ]. Computing E[Xn ]:
n−1
!
1 X
E[Xn ] = E[Yi Y1+1 ] + E[Yn Y1 ]
n
i=1
n−1
!
1 X
= E[Yi ]E[Y1+1 ] + E[Yn ]E[Y1 ]
n
i=1
n−1
!
1 X
2 2
= µ +µ
n
i=1
= µ2 ,
where in the first line I have used the linearity of expectation and in the second I have used the
fact that all Y s are independent.
Solving for V ar[Xn ] is slightly more tricky. To do this, I will first need to compute Cov[Yi Yi+1 , Yi+1 Yi+2 ]
for i = 1, 2, . . . n − 2 (I will also need to compute Cov[Yn−1 Yn , Yn Y1 ] and Cov[Yn Y1 , Y1 Y2 ], but it
is not difficult to show that the following computation gives the same answer for these 2 covari-
ances) and V ar[Yi Yi+1 ] for i = 1, 2, . . . n − 1 (I will also need to compute V ar[Yn Y1 ] but, again,
it is not difficult to show that the following computation gives the same answer for this variance).
Computing the covariance:
Cov[Yi Yi+1 , Yi+1 Yi+2 ] = E[Yi Yi+1 Yi+1 Yi+2 ] − E[Yi Yi+1 ]E[Yi+1 Yi+2 ]
2
= E[Yi ]E[Yi+1 ]E[Yi+2 ] − E[Yi ]E[Yi+1 ]2 E[Yi+2 ]
= µ2 (E[Yi+1
2
] − E[Yi+1 ]2 )
= µ2 σ 2 ,
where in the second line I have used independence. Now I compute the variance:
where in the second and third lines I have used independence. I now compute V ar[Xn ]:
"n−1 #
1 X
V ar[Xn ] = 2 V ar Yi Y1+1 + Yn Y1
n
i=1
n−1
X n−2
X
1
= V ar[Yi Y1+1 ] + V ar[Yn Y1 ] + 2 Cov[Yi Yi+1 , Yi+1 Yi+2 ] + 2Cov[Yn−1 Yn , Yn Y1 ]
n2
i=1 i=1
!
+ 2Cov[Yn Y1 , Y1 Y2 ]
1
= [n(σ 4 + 2σ 2 µ2 ) + 2n(µ2 σ 2 )]
n2
σ2 2
= (σ + 4µ2 ).
n
For the summation of the covariances, I have only summed over the covariances of adjacent pairs
of Yi Yi+1 , since pairs that are 2 or more away from each other have zero covariance since they
are independent (since they do not share any Y random variables). To see why this is the proper
summation over the covariances, I illustrate the summation in a matrix form for X5 below. We must
sum all off diagonal terms, however, only adjacent pairs contribute non zero covariance, indicated
by the spades in the figure. It is not difficult to see that my summation corresponds exactly to
adding the cells containing spades in this figure.
Y1 Y2 Y2 Y3 Y3 Y4 Y4 Y5 Y5 Y1
Y1 Y2 ♠ ♠
Y2 Y3 ♠ ♠
Y3 Y4 ♠ ♠
Y4 Y5 ♠ ♠
Y5 Y1 ♠ ♠
Since probabilities cannot be less than 0, I conclude that limn→∞ P (|Xn − µ2 | ≥ ) = 0, so that
p
Xn →− µ2 .
Pn
Problem 16. Using some simple algebra, since Xn = (Πni=1 Yi )1/n , I have that ln Xn = 1
n i=1 ln Yi .
132 CHAPTER 7. LIMIT THEOREMS AND CONVERGENCE OF RANDOM VARIABLES
p
This therefore implies that ln Xn →− γ. Now, by the Continuous Mapping Theorem (Theorem 7.7 in
p p
the book), since exp(·) is a continuous function, exp(ln Xn ) → − eγ .
− exp(γ), or in other words: Xn →
Problem 17. To solve this problem, I compute E[|Yn − λ|2 ], keeping in mind that for a P oiss(λ)
distribution, E[X] = V ar[X] = λ:
so that:
lim E[|Xn + Yn − (X + Y )|r ] ≤ lim E[|(Xn − X)|r ]1/r + lim E[|(Yn − Y )|r ]1/r
n→∞ n→∞ n→∞
1/r 1/r
= lim E[|(Xn − X)|r ] + lim E[|(Yn − Y )|r ]
n→∞ n→∞
= 0,
Lr Lr
where the last line follows since Xn −→ X and Yn −→ Y . Since, |Xn + Yn − (X + Y )|r ≥ 0,
Lr
E[|Xn + Yn − (X + Y )|r ] ≥ 0, so that the inequality must hold with equality, and thus Xn + Yn −→
X +Y.
P∞
Problem 19. To solve this problem, I utilize Theorem 7.5 in the book and show that n=1 P (|Xn | >
) (for all > 0) is finite:
133
∞
X ∞
X
P (|Xn | > ) = P (Xn > )
n=1 n=1
X∞ Z ∞ n2 x2
= n2 xe− 2 dx
n=1
X∞
n2 2
= e− 2 ,
n=1
where in the first line I have used the fact that the random variable Xn is never negative and in
the third line I have solved the integral with a substitution of u = n2 2 /2.
Now, it is not difficult to show that for positive µ and n ≥ 1, as we have in this case, that
−x 2µ 2
e ≤ e−xµ , and therefore, each term in the above summation is ≤ e−n /2 :
∞
X ∞
X n2
P (|Xn | > ) ≤ e− 2
n=1 n=1
2
e− 2
= 2
1 − e− 2
< ∞ (for > 0).
Therefore, we cannot simply appeal to Theorem 7.5 and must thus use Theorem 7.6. That is, we
must show that for any > 0, limm→∞ P (Am ) = 1, where the set Am is defined in the book. I
show this for 0 < < 1. For this interval, from the definition of Am :
Am = {|Xn | < , ∀n ≥ m}
= {Xn < , ∀n ≥ m}
= {Xn = 0, ∀n ≥ m}.
where in the fifth line I have used the fact that if Ym−1 Ym−2 . . . Y1 = 0, then at least 1 Yi (i =
1, . . . , m − 1) is 0. Since the random variables Ym Ym−1 . . . Y1 , Ym+1 Ym . . . Y1 , . . . are all products
of Ym−1 Ym−2 . . . Y1 , given that Ym−1 Ym−2 . . . Y1 = 0, we know for sure that all random variables
Ym Ym−1 . . . Y1 , Ym+1 Ym . . . Y1 , . . . are 0. In the seventh line I have used De Morgan’s law, and in
the eighth I have used independence.
The product is easily solved:
m−1
Y k 1 2 m−2 m−1 1
= · ... · = ,
k+1 2 3 m−1 m m
k=1
where all denominators have cancelled out with the next numerator except for the last one. I
therefore have that
1
lim P (Am ) = lim 1 −
m→∞ m→∞ m
= 1,
a.s.
and therefore, by Theorem 7.6 Xn −−→ 0.
Chapter 8
135
136 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
Problem 1.
(a) Using the formulas for the sample mean, sample variance and sample standard deviation, I
find that:
X̄ ≈ 164.3 lbs,
S 2 ≈ 383.7 lbs2 ,
and
S ≈ 19.59 lbs.
Problem 2. To calculate the bias of this estimator, I first compute its expectation:
!2
X n
1
E[Θ̂] = E Xk
n
k=1
Xn X
1
= 2E Xk2 + Xi Xj
n
k=1 i,j:i6=j
n
1 X X
= 2 E[Xk2 ] + E[Xi ]E[Xj ]
n
k=1 i,j:i6=j
1
= 2 n(σ 2 + E[Xk ]2 ) + (n2 − n)E[Xk ]2
n
1
= 2 n(σ 2 + µ2 ) + (n2 − n)µ2
n
σ2
= + µ2 ,
n
P
where the notation i,j:i6=j refers to a sum over all pairs of i, j (i, j = 1, . . . , n) except for the pairs
where i = j. In the third line I have used the linearity of expectation and independence. The bias
is thus:
σ2
B(Θ̂) = E[Θ̂] − θ = E[Θ̂] − µ2 = ,
n
Problem 3.
E[Θ̂n ] = E[12X̄ − 6]
" n
#
12 X
=E Xi − 6
n
i=1
n
12 X
= E[Xi ] − 6
n
i=1
Xn
12 θ θ 1
= − + −6
n 3 4 2
i=1
θ θ 1
= 12 − + −6
3 4 2
= θ.
(b) I will use Chebyshev’s inequality to show that this is a consistent estimator and I will therefore
need to compute V ar[Θ̂n ]. To do this, I first compute E[Xi2 ]:
Z 1
2 1
E[Xi ] = θ x− + 1 x2 dx
0 2
θ θ 1
= − + .
4 6 3
Now I compute E[Θ̂n ]:
E[Θ̂2n ] = E (12X̄ − 6)2
!2
2 Xn Xn
12 12
=E 2 Xi − 6 · 2 · Xi + 36
n n
i=1 i=1
Xn X n
144 2 144 X
= 2 E[Xi ] + E[Xi ]E[Xj ] − E[Xi ] + 36
n n
i=1 i,j:i6=j i=1
144
= 2 nE[Xi2 ] + (n2 − n)E[Xi ]2 − 144E[Xi ] + 36,
n
P
where the notation i,j:i6=j refers to a sum over all pairs of i, j (i, j = 1, . . . , n) except
for the pairs where i = j. In this derivation, I have used the linearity of expectation and
independence. Plugging in E[Xi2 ] and E[Xi ] and simplifying, I find that
12 1
E[Θ̂2n ] = + θ2 1 − ,
n n
so that
p
To show that Θ̂n is a consistent estimator, I must show that Θ̂n →
− θ. Using Chebyshev’s
inequality, I have:
lim P (|Θ̂n − θ| ≥ ) = lim P (|Θ̂n − E[Θ̂n ]| ≥ )
n→∞ n→∞
V ar[Θ̂n ]
≤ lim
n→∞ 2
12 − θ2
= lim
n→∞ n2
= 0.
Since probabilities cannot be negative, I have that limn→∞ P (|Θ̂n − θ| ≥ ) = 0 for all > 0,
and thus Θ̂n is consistent.
(c) Since I have already computed the variance and bias of Θ̂n , computing the mean squared
error is easy:
12 − θ2
M SE(Θ̂n ) = V ar[Θ̂n ] + B[Θ̂n ]2 = .
n
Problem 4.
4
Y
L(x1 , . . . , x4 ; p) = PXi (xi ; p)
i=1
4
Y
= p(1 − p)xi −1
i=1
= p (1 − p)2−1 (1 − p)3−1 (1 − p)3−1 (1 − p)5−1
4
= p4 (1 − p)9
Problem 5.
4
Y
L(x1 , . . . , x4 ; θ) = fXi (xi ; θ)
i=1
4
Y
= θe−θxi
i=1
4 −2.35θ −1.55θ −3.25θ −2.65θ
=θ e e e e
4 −9.8θ
=θ e
Problem 6. Since log(·) is a monotonic increasing function on R, argmaxθ∈R L(x; θ) = argmaxθ∈R log L(x; θ).
This can easily be proven by considering the definition of a strictly monotonic function.
Problem 7.
(a) For a single data point, X, and our estimator Θ̂ (which is a function, f , of X) of σ 2 , we have
that E[Θ̂] = E[f (X)] = σ 2 , where the last equality follows because we want the estimator to
be unbiased. Therefore, we are searching for a function such that:
Z ∞
1 x2
√ f (x)e− 2σ2 dx = σ 2 .
−∞ 2πσ
Since the PDF is that of N (0, σ 2 ), (i.e., it has mean zero), it is clear that the function that
satisfies this equation is f (x) = x2 , and therefore Θ̂ = X 2 .
139
(b)
1 x2
ln L(x; σ 2 ) = − ln 2π − ln σ − 2
2 2σ
(c) Taking the derivative of the above equation with respect to σ,
∂ ln L 1 x2
= − + 3,
∂σ σ σ
setting equal to zero,
1 x2
0=− + 3 ,
σ̂M L σ̂M L
Problem 8.
(a)
n
Y
L(x1 , . . . , xn ; λ) = PXi (xi ; λ)
i=1
Yn
e−λ λxi
=
xi !
i=1
n
Y
Pn 1
= e−nλ λ i=1 xi
xi !
i=1
`(λ) = ln L(x1 , . . . , xn ; λ)
Pn n
X
−λn
= ln e + ln λ i=1 xi
+ ln xi !−1
i=1
n
X n
X
= −λn + xi ln λ − ln xi !.
i=1 i=1
that is, the maximum likelihood estimate of λ is simply the sample mean.
Problem 9. To solve for the CDF for the ith order statistic, let us assume that X1 , X2 , . . . , Xn
are a random sample from a continuous distribution with CDF, FX (x). I fix a value x ∈ R, and
define the indicator random variable, Ij , by
140 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
(
1 if Xj ≤ x
Ij (Xj ) =
0 if Xj > x,
where Ij = 1 is a “success” and Ij = 0 is a “failure.” Note that, since all Xj s are iid, the probability
of a success, P (Xj ≤ x), is the same for each trial and is given by FX (x). Therefore, I have that
iid P
Ij ∼ Bern(FX (x)). I now define the random variable, Y = nj=1 Ij , and since this is the sum of n
independent Bernoulli random trials, it has a distribution: Y ∼ Bin(n, FX (x)).
Now, given that Y ∼ Bin(n, FX (x)), the quantity, P (Y ≥ i) is therefore the probability that
there are at least i successes out of n trials. Given our definition of “success”, and given that the
number of trials n is simply the number of observations, this can be re-phrased as the probability
that there are at least i observations out of n with values less than or equal to x.
We desire to find P (X(i) ≤ x), the probability that the ith biggest observation out of n obser-
vations has a value less than or equal to x. In other words, we desire to find the probability that
there are at least i observations out of n with a value less than or equal to x. Notice that this is
exactly P (Y ≥ i), so that:
Problem 10. Let region 1 be defined as the interval (−∞, x], region 2 as the interval (x, x + δ],
(where δ is a small positive number) and region 3 as the interval (x + δ, ∞). By the definition of
the PDF, the probability that the ith order statistic is in region 2 is given by P (x < X(i) ≤ x + δ) ≈
fX(i) (x)δ. In other words, for δ small enough, P (x < X(i) ≤ x + δ) is the probability that, out of n
samples, there are i − 1 samples in region 1, one in region 2 and n − i in region 3.
Now, since all samples are iid from a distribution with PDF fX (x) and CDF FX (x), the
probability that a sample lands in region 1, is
p1 = P (X ≤ x) = FX (x),
in region 2 is
p2 = P (x ≤ X ≤ x + δ) ≈ fX (x)δ,
and in region 3 is
p3 = P (X > x + δ) = 1 − FX (x + δ) ≈ 1 − FX (x).
Notice that if we define si as the event that a sample, out of n samples, lands in region i (with
associated probability, pi ), this is precisely a multinomial experiment with 3 possible outcomes.
Thus the probability that out of n samples (trials), there are i − 1 in region 1, one in region 2 and
n − i in region 3 is given by:
n!
pi−1 p2 p3n−1 .
(i − 1)!(n − 1!) 1
However, this is precisely the probability fX(i) (x)δ. Therefore, I have that:
n!
fX(i) (x)δ = pi−1 p2 pn−1
(i − 1)!(n − 1!) 1 3
n!
= [FX (x)]i−1 fX (x)δ[1 − FX (x)]n−1 .
(i − 1)!(n − 1!)
Canceling the δ from both sides of the equation gives the desired result.
141
Problem 11. Since n is relatively large, the variance is known, and we would like an approximate
confidence interval for θ = E[Xi ], we can calculate the confidence interval by employing the CLT
√
and by using n(X̄ − θ) as the pivotal quantity. This computation is done in the book and the
interval is given by:
σ σ
X̄ − z α2 √ , X̄ + z α2 √ .
n n
The quantities in this interval are X̄ = 50.1, σ = 9, n = 100 and z α2 = z 0.05 = 1.96. Using
2
these values, I find that the 95% confidence interval is given by: [48.3, 51.9]. Note that z α2 can be
computed in Python with scipy.stats.norm.ppf(1-alpha/2).
Problem 12. In this problem, we choose a random sample of size n from a population, X1 , X2 , . . . , Xn ,
where these random variables are iid Bern(θ), where Xi is 1 if the ith voter intends to vote for
Candidate A, and 0 otherwise.
(a) We require there to be at least a 90% probability that the sample proportion, X̄ is within 3
percentage points of the actual proportion, θ. In math, this is:
and algebraically manipulating the argument, it is easy to show that the 90% confidence
interval we require is:
[X̄ − 0.03, X̄ + 0.03].
Following along Example 8.18 in the book, utilizing the CLT, and obtaining a conservative
estimate for the interval by using σmax , which for a Bernoulli distribution is 1/2 (since we do
not actually know σ), the proper interval is given by:
z α2 z α2
X̄ − √ , X̄ + √ .
2 n 2 n
Comparing this interval with the one above, we see that:
z α2
√ = 0.03.
2 n
Thus, we require n to be at least:
& 2 '
z α2
= 748,
2 · 0.03
(b) Using the same formula as above, but with z α2 = z 0.01 ≈ 2.58, I find that n must be at least
2
1849.
Problem 13. For this problem, since n is relatively large, I use the standard approximate con-
fidence interval derived using the CLT. The variance, however, is unknown, but since n is large
should be well approximated by the sample variance, S 2 . The proper confidence interval is thus:
S S
X̄ − z 2
α √ , X̄ + z 2
α √ ,
n n
and using n = 100, X̄ = 110.5, S 2 = 45.6, and z 0.05 = 1.96, I find the 95% confidence interval for
2
the distribution mean to be approximately be: [109.2, 111.8].
142 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
Problem 14.
(a) For an n = 36 random sample from N (µ, σ 2 ), with µ and σ 2 unknown, the proper pivotal
√
quantity to use to estimate µ is T = (X̄ − µ/(S/ n)), which because it has a T distribution,
results in a confidence interval of:
S S
X̄ − t α2 ,n−1 √ , X̄ + t α2 ,n−1 √ ,
n n
as shown in the book. For the desired confidence levels (90%, 95%, 99%), the appropriate
t values are: t0.1,35 ≈ 1.69, t0.05,35 ≈ 2.03, t0.01,35 ≈ 2.72, and the corresponding confidence
intervals are: [34.8, 36.8], [34.6, 37.0] and [34.2, 37.4]. We see that as the confidence level
increases, the width of the interval gets wider since we desire more confidence that the actual
value of µ is encompassed by that random interval. Note that t α2 ,n−1 can be computed in
Python with scipy.stats.t.ppf(1-alpha/2, n-1).
(b) The proper pivotal quantity to use to estimate σ 2 is Q = (n − 1)S 2 /σ 2 , which because it has
a χ2 distribution, results in a confidence interval of:
" #
(n − 1)S 2 (n − 1)S 2
, ,
χ2α ,n−1 χ21− α ,n−1
2 2
as shown in the book. Computing the proper χ2α ,n−1 and χ21− α ,n−1 values, I find the following
2 2
90%, 95% and 99% confidence intervals for σ 2 : [8.78, 19.47], [8.22, 21.3] and [7.26, 25.4]. Again,
we see that as the confidence level increases, the width of the interval gets wider since we desire
more confidence that the actual value of σ 2 is encompassed by that random interval. Note
that χ2α ,n−1 can be computed in Python with scipy.stats.chi2.ppf(1-alpha/2, n-1).
2
Problem 15.
(a) We recognize that since the data are drawn iid from a normal distribution, since σ 2 is known,
and since the hypotheses are of the form Ho : µ = µo and HA : µ 6= µo , this is a 2-sided
√
z-test, as outlined in Table 8.2 in the book. Thus, if the statistic W = (X̄ − µo )/(σ/ n)
satisfies |W | ≤ z α2 then we fail to reject the null hypothesis, otherwise we reject it in favor of
the alternative hypothesis
Computing X̄ and W , I find:
X̄ ≈ 5.96,
and
W ≈ 2.15.
At a level of α = 0.05, the proper threshold is z0.025 ≈ 1.96. Since W > z α2 , we reject Ho in
favor of HA at a significance level of 0.05.
(b) In this case, since the data are drawn iid from a normal distribution with known variance,
the proper (1 − α)100% confidence interval to use as shown in Section 8.3.3 of the book is:
σ σ
X̄ − z 2 √ , X̄ + z 2 √ ,
α α
n n
which, when plugging in the particular values for this problem results in a 95% confidence
interval of approximately [5.08, 6.84].
143
The value µo = 5 is not within this interval. As shown in Section 8.4.3 of this book, for this
√
type of hypothesis test, since we accept Ho at a level of α if |(X̄ − µo )/(σ/ n)| ≤ z α2 , this
results in the condition that we accept Ho if:
σ σ
µo ∈ X̄ − z 2
α √ , X̄ + z 2
α √ .
n n
I.e., for this test, if µo is in the (1 − α)100% confidence interval, we accept Ho at a level
of α, otherwise we do not. Since µo = 5 is not in the calculated confidence interval, this
corresponds to rejecting Ho in favor of HA , which is indeed what we found above.
Problem 16.
(a) As with the previous problem, since the data are drawn iid from a normal distribution with
known variance, the proper (1 − α)100% confidence interval to use as shown in Section 8.3.3
of the book is:
σ σ
X̄ − z α2 √ , X̄ + z α2 √ .
n n
For this problem X̄ ≈ 17.0, z α2 = z0.05 ≈ 1.64, so that the 90% confidence interval is approx-
imately [16.45, 17.55]. The value of µo is not included in this interval, which, as explained
above, means that we reject the null hypothesis at a significance level of α = 0.1.
√
(b) As shown in Section 8.4.3 of the book, the proper test statistic to use is W = (X̄ −µo )/(σ/ n),
and if |W | ≤ z α2 (see Table 8.2) we cannot reject the null hypothesis. For this problem, W ≈ 3,
z α2 = z0.05 ≈ 1.64, and therefore we reject Ho at a significance level of α = 0.1.
Problem 17. In this problem, the random sample comes from an unknown distribution with
unknown variance and with a rather large n (n = 150). Since the hypotheses we would like to test
correspond to Ho : µ = 50 and HA : µ > 50, this will most likely be a 1-sided z-test (using the
√
sample variance). Here I work out the test explicitly. Using W as my statistic, (X̄ − µo )/(S/ n),
If Ho is true, then we would expect X̄ ≈ µo and W ≈ 0. On the other hand, if HA is true we expect
X̄ > µo and W > 0. Therefore I employ the following test: if W ≤ c I fail to reject Ho , while if
W > c I reject Ho in favor of HA .
To solve for c I must bound the probability of making a Type I error:
Therefore, the critical value, c, occurs at equality: 1 − Φ(c) = α, or in other words c = zα . Thus,
if W ≤ zα I fail to reject Ho , while if W > zα I reject Ho in favor of HA .
For this problem X̄ = 52.28, S 2 = 30.9, and so W ≈ 5.02, while zα = z0.05 ≈ 1.64, so that I
reject Ho in favor of HA at a significance level of 0.05.
Problem 18. In this problem, the random sample comes from a normal distribution with unknown
variance, and the hypotheses we are testing are of the form Ho : µ ≥ µo and HA : µ < µo , and I
therefore use a 1-sided t-test. As indicated in Table 8.4, we fail to reject Ho if W ≥ −tα,n−1 . For
this problem,
27.72 + 22.24 + 32.86 + 19.66 + 35.34
X̄ = ≈ 27.56,
5
144 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
n
1 X
S̄ 2 = (Xi − X̄)2 ≈ 44.84,
n−1
i=1
so that
X̄ − µ 27.56 − 30
W = √ ≈√ √ ≈ −0.81.
S/ n 44.84/ 5
Also, for this problem −tα,n−1 = −t0.05,4 ≈ −2.13. Since W ≥ −tα,n−1 , we fail to reject the null
hypothesis at a level of 0.05.
Problem 19. Since the random sample is drawn from an unknown distribution with unknown
variance, but the sample number is relatively large (n = 121), I can use a z-test with the sample
variance. Moreover, the hypotheses we are testing are of the form Ho : µ = µo and HA : µ < µo .
√
I therefore use a 1-sided z-test, where if W < −zα with W = (X̄ − µo )/(S/ n), then I reject the
null hypothesis at a significance level of α (see table 8.4).
The p-value is the probability of making a Type I error when the statistic threshold is set to
that which was observed (w1 ), in this case w1 ≈ −0.81. Thus, the p-value for this problem is:
Problem 20.
(a) We would like to test the hypotheses that Ho : θ ≥ 0.1 and HA : θ < 0.1, which, since equality
gives us a worse case scenario (as shown in Section 8.4.3 of the book), can be simplified to:
Ho : θ = θo = 0.1
HA : θ < θo .
(b) If we let Xi = 1 if the ith student has allergies, and 0 otherwise, we see that Xi ∼ Bern(θ),
so that E[Xi ] = θ and V ar[Xi ] = θ(1 − θ). Now, under the null hypothesis, and under the
CLT (since n is large), I have that
X̄n − θo n
p ∼ N (0, 1).
nθo (1 − θo )
This is a convenient test statistic to use since I have its distribution, and since if the alternative
hypothesis is true, X̄ will be small (and so will the statistic), while if the null hypothesis is
true, X̄ will be large (and so will the statistic). This suggests the following test: if
X̄n − θo n
p <c
nθo (1 − θo )
then reject the null hypothesis in favor of the alternative hypothesis, while if
X̄n − θo n
p ≥ c,
nθo (1 − θo )
145
(c) Since the p-value is the lowest significance level α that results in rejecting the null hypothesis,
at a level of α = 0.05, we cannot reject the null hypothesis.
Problem 21.
(a) Using the equations for simple linear regression I have the following:
n
1X −5 − 3 + 0 + 2 + 1
x̄ = xi = = −1,
n 5
i=1
n
1X −2 + 1 + 4 + 6 + 3
ȳ = yi = = 2.4,
n 5
i=1
n
X
sxx = (xi − x̄)2 = (−5 + 1)2 + (−3 + 1)2 + (0 + 1)2 + (2 + 1)2 + (1 + 1)2 = 34,
i=1
n
X
sxy = (xi − x̄)(yi − ȳ) = (−5 + 1)(−2 − 2.4) + (−3 + 1)(1 − 2.4) + (0 + 1)(4 − 2.4)
i=1
+ (2 + 1)(6 − 2.4) + (1 + 1)(3 − 2.4) = 34,
sxy 34
β̂1 = = = 1,
sxx 34
and
β̂0 = ȳ − β̂1 x̄ = 2.4 − 1(−1) = 3.4.
The regression line is given by ŷ = β̂0 + β̂1 x, and therefore
ŷ = 3.4 + x.
146 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
so that
s2xy 342
r2 = = ≈ 0.91.
sxx syy 34 · 37.2
Problem 22.
(a) Using the equations for simple linear regression I have the following:
n
1X 1+3
x̄ = xi = = 2,
n 2
i=1
n
1X 3+7
ȳ = yi = = 5,
n 2
i=1
n
X
sxx = (xi − x̄)2 = (1 − 2)2 + (3 − 2)2 = 2,
i=1
n
X
sxy = (xi − x̄)(yi − ȳ) = (1 − 2)(3 − 5) + (3 − 2)(7 − 5) = 4
i=1
sxy 4
β̂1 = = = 2,
sxx 2
and
β̂0 = ȳ − β̂1 x̄ = 5 − 2 · 2 = 1.
The regression line is given by ŷ = β̂0 + β̂1 x, and therefore
ŷ = 1 + 2x.
147
ŷ1 = 1 + 2 · 1 = 3
ŷ2 = 1 + 2 · 3 = 7.
e1 = y1 − ŷ1 = 3 − 3 = 0
e2 = y2 − ŷ2 = 7 − 7 = 0.
Pn
As a check, we know that i=1 ei = 0, which is indeed the case.
n
X
syy = (yi − ȳ)2 = (3 − 5)2 + (7 − 5)2 = 8,
i=1
so that
s2xy 42
r2 = = = 1.
sxx syy 2·8
(e) Since there are only 2 data points in the training set, the regression line that minimizes the
sum of squared errors goes exactly through those 2 points, and thus r2 = 0. This is a good fit
to the training data, however, it will probably not generalize well to new, unseen data, and
is probably therefore a poor predictive model.
Problem 23.
(a) According to this model, Yi ∼ N (βo + β1 xi , σ 2 ). To solve for the distribution of β̂1 , note that
it is fairly easy to find the distribution of a sum (or a linear combination) of independent
normal random variables. However, due to the fact that each Yi is in the sum that comprises
Ȳ , clearly the term, Sxy is not a sum of independent random variables. In order to express
the formula for β̂1 as a linear combination of the Yi s (which are independent), I expand the
148 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
formula for β̂1 and group each Yi term. With ci ≡ (xi − x̄), I have:
Sxy
β̂1 =
sxx
n
1 X
= ci (Yi − Ȳ )
sxx
i=1
(
1 1 1
= c1 Y1 − (Y1 + Y2 + . . . + Yn ) + c2 Y2 − (Y1 + Y2 + . . . + Yn )
sxx n n
)
1
+ . . . + cn Yn − (Y1 + Y2 + . . . + Yn )
n
(
1 h c1 c2 cn i h c1 c2 cn i
= Y1 c1 − − − ... − + Y2 c2 − − − ... −
sxx n n n n n n
)
h c1 c2 cn i
+ . . . + Yn cn − − − ... −
n n n
1 Xn Xn
= Y1 (x1 − x̄) − 1 (xj − x̄) + Y2
1
(x1 − x̄) −
1
(xj − x̄)
sxx n sxx n
j=1 j=1
1 1X
n
+ . . . + Yn (x1 − x̄) − (xj − x̄)
sxx n
j=1
X n 1 X n
= Yi (xi − x̄) − 1 (xj − x̄)
sxx n
i=1 j=1
n
X 1
= Yi (xi − x̄)
sxx
i=1
n
X
= Ui .
i=1
Now, since each Ui is a normal random variable (Yi ) multiplied by a constant, the distribution
for each Ui is given by:
1 2 1 2
Ui ∼ N (βo + β1 xi ) (xi − x̄), σ 2 (xi − x̄) .
sxx sxx
Note that now, β̂1 is a sum of independent, normal random variables, so the distribution for
β̂1 is simply normal, where the mean is the sum of the means and where the variance is the
sum of the variances:
n n
!
X 1 X 1
β̂1 ∼ N (βo + β1 xi ) (xi − x̄), σ 2 2 (xi − x̄)2 .
sxx sxx
i=1 i=1
149
(b) Before I show that β̂1 is unbiased, first note the following:
n
X
sxx = (xi − x̄)2
i=1
n
X
= (x2i − 2x̄xi + x̄2 )
i=1
n
X n n
1X X
= x2i − 2x̄n xi + x̄2
n
i=1 i=1 i=1
n
X
= x2i − nx̄2
i=1
n
X n
1X
= x2i − nx̄ xi
n
i=1 i=1
n
X
= (x2i − x̄xi )
i=1
n
X
= xi (xi − x̄).
i=1
Now, it can be shown that β̂1 is unbiased by simplifying the expectation of β̂1 as given above:
n
X 1
E[β̂1 ] = (βo + β1 xi ) (xi − x̄)
sxx
i=1
n n
βo X β1 X
= (xi − x̄) + xi (xi − x̄)
sxx sxx
i=1 i=1
βo
= (nx̄ − nx̄) + β1
sxx
= β1 .
(c) The variance can be further simplified by canceling out a factor of sxx :
n
σ2 X σ2
V ar[β̂1 ] = 2 (xi − x̄)2 = .
sxx sxx
i=1
Problem 24.
Yi term:
β̂o = Ȳ − β̂1 x̄
1 1 1 1
= (Y1 + Y2 + . . . + Yn ) − x̄ Y1 (x1 − x̄) + Y2 (x2 − x̄) + . . . + Yn (xn − x̄)
n sxx sxx sxx
1 1 1 1 1 1
= Y1 − x̄ (x1 − x̄) + Y2 − x̄ (x2 − x̄) + . . . + Yn − x̄ (xn − x̄)
n sxx n sxx n sxx
Xn
1 1
= Yi − x̄ (xi − x̄)
n sxx
i=1
n
X
= Ui .
i=1
As in the previous problem, since each Ui is a normal random variable (Yi ) multiplied by a
constant, the distribution for each Ui is given by:
2 !
1 1 1 1
Ui ∼ N (βo + β1 xi ) − x̄ (xi − x̄) , σ 2 − x̄ (xi − x̄) .
n sxx n sxx
Also, as above, β̂o is now a sum of independent, normal random variables, so the distribution
for β̂o is simply normal, where the mean is the sum of the means and where the variance is
the sum of the variances:
n
X Xn 2 !
1 1 2 1 1
β̂o ∼ N (βo + β1 xi ) − x̄ (xi − x̄) , σ − x̄ (xi − x̄) .
n sxx n sxx
i=1 i=1
Pn
(b) To show that β̂o is unbiased, I can simplify E[β̂o ] as found above (using sxx = i=1 xi (xi − x̄)
as found in the previous problem):
n
X
1 1
E[β̂o ] = (βo + β1 xi ) − x̄ (xi − x̄)
n sxx
i=1
n
X n n n
βo βo x̄ X β1 X β1 x̄ X
= − (xi − x̄) + xi − xi (xi − x̄)
n sxx n sxx
i=1 i=1 i=1 i=1
βo x̄
= βo − (nx̄ − nx̄) + β1 x̄ − β1 x̄
sxx
= βo .
Pn 1
(c) For any i = 1, 2, . . . , n, using β̂1 = i=1 Yi sxx (xi − x̄) (as derived in the previous problem),
151
I have that:
h i X n
1
Cov β̂1 , Yi = Cov Yj (xj − x̄), Yi
sxx
j=1
Xn
1
= Cov Yj (xj − x̄), Yi
sxx
j=1
n
X 1
= (xj − x̄)Cov [Yj , Yi ]
sxx
j=1
1
= (xi − x̄)V ar [Yi , Yi ]
sxx
(xi − x̄)σ 2
= ,
sxx
where in the second to last line I have used the fact that all Yi s are independent, so that
Cov[Yi , Yj ] = 0 for i 6= j.
Pn 1
(d) Again, using β̂1 = i=1 Yi sxx (xi − x̄), I have that:
h i X n
1 1 Xn
Cov β̂1 , Ȳ = Cov Yi (xi − x̄), Yj
sxx n
i=1 j=1
X
1 1
= Cov Yi (xi − x̄), Yj
sxx n
i,j
X (xi − x̄)
= Cov [Yi , Yj ]
nsxx
i,j
Xn
(xi − x̄)
= V ar[Yi ]
nsxx
i=1
n n
!
σ2 X X
= xi − x̄
nsxx
i=1 i=1
σ2
= (nx̄ − nx̄)
nsxx
= 0.
Again, I have used the fact that all Yi s are independent, so that Cov[Yi , Yj ] = 0 for i 6= j.
152 CHAPTER 8. STATISTICAL INFERENCE I: CLASSICAL METHODS
(e) The variance of β̂o can be further simplified to give the desired result:
n
X 2
1 1
V ar[β̂o ] = σ2 − x̄ (xi − x̄)
n sxx
i=1
Xn
2 1 2x̄ x̄2 2
=σ − (xi − x̄) + 2 (xi − x̄)
n2 nsxx sxx
i=1
" n n
#
1 2x̄ X x̄ 2 X
= σ2 − (xi − x̄) + 2 (xi − x̄)2
n nsxx sxx
i=1 i=1
2
1 2x̄ x̄
= σ2 − (nx̄ − nx̄) +
n nsxx sxx
2
2 sxx + nx̄
=σ
nsxx
Pn 2 2
2 i=1 (xi − x̄) + nx̄
=σ
nsxx
Pn 2 2 2
2 i=1 (xi − 2x̄xi + x̄ ) + nx̄
=σ
nsxx
Pn 2
i=1 xi
= σ2 .
nsxx
Chapter 9
153
154 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
PY |X (2|x)fX (x)
fX|Y (x|2) = R ,
PY |X (2|x)fX (x)dx
where
PY |X (2|x) = x(1 − x).
I therefore have that:
x2 (1 − x)2
fX|Y (x|2) = R 1
0 x2 (1 − x)2 dx
= 30x2 (1 − x)2
for 0 ≤ x ≤ 1 and 0 otherwise. As a sanity check, I made sure the posterior integrates to 1.
Problem 2. From Baye’s rule we know that the posterior density for 0 ≤ x ≤ 1 is:
x
fX|Y (x|5) ∝ PY |X (5|x)fX (x) = 3x3 (1 − x)4 ,
x
(and 0 otherwise) where the symbol ∝ means proportional to as a function of x. Therefore, the
MAP estimate is given by
x̂M AP = arg max{3x3 (1 − x)4 },
x∈[0,1]
which can be found setting the derivative of the argument equal to zero and solving for x:
fXY (x, y) x
fX|Y (x|y) = ∝ fXY (x, y),
fY (y)
x + 23 y 2
fY |X (y|x) = R
x + 32 y 2 dy
2x + 3y 2
=
2x + 1
3y 2 − 1
=1+ ,
2x + 1
and thus we see that, to maximize this expression, if 3y 2 − 1 ≤ 0 we need to minimize the second
term, while if 3y 2 − 1 > 0 we need to maximize it. Therefore:
(
1 for y ≤ √13
x̂M L =
0 otherwise.
fY |X (y|x)fX (x)
fX|Y (x|y) = R
fY |X (y|x)fX (x)dx
x
(xy − 2 + 1)(2x
2 + 13 )
= R1 x
0 (xy − 2 + 1)(2x
2 + 13 )dx
2x3 y + xy 3
3 −x − 6
x
+ 2x2 + 1
3
= 2 2 ,
3y + 3
so that
x̂M = E[X|Y = y]
Z 1
1 4 x2 y 4 x2 3 x
= 2 2 2x y + − x − + 2x + dx
3y + 3 0
3 6 3
46y − 37
= .
60(y + 1)
Problem 5.
(a) First note that since X ∼ N (0, 1), W ∼ N (0, 1) (where X and W are independent), we have
that Y = 2X + W ∼ N (0, 5). Now, aY + bX = a(2X + W ) + bX = (2a + b)X + aW which is
normal for all a, b ∈ R, and thus X and Y are jointly normal. Thus, by Theorem 5.4 in the
book, I have that:
Y − µY
X̂M = E[X|Y ] = µX + ρσX .
σY
√
From the distributions of X and Y , I have that µX = µY = 0, σY = 5, σX = 1 and:
Cov[X, Y ] = Cov[X, 2X + W ]
= 2V ar[X] + Cov[X, W ]
= 2V ar[X] (since X, W are independent)
= 2.
156 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
To find Cov[X, Y ], I first solve for E[XY ], again using the law of iterated expectation:
M SE = (1 − ρ2 )V ar[X]
Cov[X, Y ]2
= 1− V ar[X]
V ar[X]V ar[Y ]
!
1 2
6 1
= 1− 1 5 ·
12 · 3
12
1
= .
15
(c) First, note that E[Y 2 ] = V ar[Y ] + (E[Y ])2 = 5/3 + 1 = 8/3. Now, I have that:
Problem 7.
(a) First note that Y ∼ (0, σX 2 + σ 2 ). Also note that aX + bY = (a + b)X + bW which is normally
W
distributed for all a, b ∈ R, and thus X and Y are jointly normal, and so by Theorem 5.4 in
the book:
Y − µY
X̂M = E[X|Y ] = µX + ρσX .
σY
158 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
Cov[X, Y ] Y − µY
X̂M = µX + σX
σX σY σY
σ 2
Y
= q X · σX · q
2 + σ2
σX σX σX2 + σ2
W W
2
σX
= 2 2 Y.
σX + σW
where in the second to last line I have used that X and W are independent, and where I have
used that: E[X 2 ] = σX
2 , E[W 2 ] = σ 2 and E[X] = E[W ] = 0.
W
Problem 8. Following along Example 9.9 of the book, I use the principle of orthogonality to solve
for X̂L . I first note that since E[X] = E[W1 ] = E[W2 ] = 0, E[Y1 ] = E[Y2 ] = 0. We would like the
linear MMSE estimator to be of the form:
Now, using the first of the two orthogonality principle equations I can solve for c:
so that c = 0.
Using the second of the two orthogonality principle equations I may solve for the remaining
constants by noting that
and thus:
Cov[X, Yj ] = Cov[X̂L , Yj ] for j = 1, 2.
I now compute the four covariances, and plug them into the above equations (one for j = 1 and
one for j = 2) to obtain a coupled set of equations in a and b.
159
where I have used the fact that since X, W1 are independent, their covariance is 0. Also,
and
to solve for the linear MMSE estimator, I need to compute the vectors/matrices that go into this
formula. Firstly, it can easily be shown that
E[Y1 ] 0
E[Y ] = = .
E[Y2 ] 0
160 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
Problem 10. To solve this problem, I use the vector formula approach as in the previous problem.
It can easily be shown that
E[Y1 ] 0
E[Y ] = E[Y2 ] = 0 .
E[Y3 ] 0
Also, since E[Yi ] = 0 (for i = 1, 2, 3), the covariance matrix of Y is
E[Y12 ] E[Y1 Y2 ] E[Y1 Y3 ]
CY = E[Y2 Y1 ] E[Y22 ] E[Y2 Y3 ]
E[Y3 Y1 ] E[Y3 Y2 ] E[Y32 ].
161
Therefore, I need to compute all of the pairwise expectations along the diagonal and in the upper
right triangle of the matrix. Notice that all of the pairwise expectations are of the form E[Yi Yj ],
where Yi = ai X + bi Wi , so I derive a general formula for all 6 pairs for easy computation of the
expectations:
where δi,j is the Kronecker-delta function, and where I used independence several times. Using this
formula, it is easy to find that
22 10 10
CY = 10 10 5
10 5 17.
Since E[X] = 0 and E[Yi ] = 0 (for i = 1, 2, 3), the cross-covariance matrix is also easy to
compute
CXY = E[XY1 ] E[XY2 ] E[XY3 ]
= E[X(2X + W1 )] E[X(X + W2 )] E[X(X + W3 )]
= 2E[X 2 ] E[X 2 ] E[X 2 ]
= 10 5 5
Plugging these matrices into the vector formula for X̂L , I finally have:
−1
22 10 10
X̂L = 10 5 5 10 10 5 Y
10 5 17
60 12 5
= Y1 + Y2 + Y3 .
149 149 149
Problem 11.
Cov[X, Y ]
X̂L = (Y − E[Y ]) + E[X],
V ar[Y ]
and so I must compute the various terms in this equation.
The expectations are
3
E[X] = E[Y ] = ,
7
while the covariance is
Plugging these values into the above equation for X̂L , I find that:
3 3
X̂L = − Y + .
4 4
and (
1 for x = 0
PX|Y (x|1) =
0 for x = 1.
This results in the conditional expectations, E[X|Y = 0] = 3/4 and E[X|Y = 1] = 0, which
can be combined into one expression using an indicator random variable:
3
X̂M = 1{Y = 0}.
4
Problem 12.
(a) Following along with the previous problem, I must calculate the terms that go into the formula
for X̂L . The expectations are:
1 1 1
E[X] = + =
3 6 2
and
1 1 2
E[Y ] = + 2 · = ,
3 6 3
163
1
X̂L = E[X] = .
2
(
1
3 for x = 0
PX|Y (x|0) = 2
3 for x = 1,
(
1 for x = 0
PX|Y (x|1) =
0 for x = 1.
and
(
0 for x = 0
PX|Y (x|1) =
1 for x = 1.
Thus, I have the conditional expectations, E[X|Y = 0] = 2/3, E[X|Y = 1] = 0, and E[X|Y =
2] = 1, which can be combined into one expression using an indicator random variable:
X̂M = E[X|Y ]
2
= 1{Y = 0} + 1{Y = 2}.
3
164 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
Ho : X = 1
H1 : X = −1,
where the priors are P (Ho ) = p and P (H1 ) = 1 − p. Note that given Ho , Y = 2 + W so that
Y |Ho ∼ N (2, σ 2 ), and that given H1 , Y = −2 + W so that Y |H1 ∼ N (−2, σ 2 ). The posterior
probability of Ho is thus:
x 1 (y−2)2
P (Ho |Y = y) ∝ fY (y|Ho )P (Ho ) = √ e− 2σ2 p,
2πσ
x 1 (y+2)2
P (H1 |Y = y) ∝ fY (y|H1 )P (H1 ) = √ e− 2σ2 (1 − p).
2πσ
The MAP decision rule for this problem is to accept Ho if the posterior probability under Ho is
greater than or equal to the posterior probability under H1 . In other words, we accept Ho if:
1 (y−2)2 1 (y+2)2
√ e− 2σ2 p ≥ √ e− 2σ2 (1 − p),
2πσ 2πσ
otherwise we accept H1 .
Using some algebra, this rule can be re-written as: if
σ2 1−p
y≥ ln ,
2 p
We can replace “choose H1 ” and “choose Ho ” with our decision rule to compute the conditional
probabilities:
σ2 1 − p
P (choose H1 |Ho ) = P Y < ln X = 1
2 p
σ2 1 − p
= P 2X + W < ln X = 1
2 p
σ2 1−p
=P 2+W < ln
2 p
σ 1−p 2
=Φ ln − .
2 p σ
Likewise, the other conditional probability is given by
σ2 1 − p
P (choose Ho |H1 ) = P Y ≥ ln X = −1
2 p
σ2 1 − p
= P 2X + W ≥ ln X = −1
2 p
σ2 1−p
= P −2 + W ≥ ln
2 p
σ 1−p 2
=1−Φ ln + .
2 p σ
Plugging these conditional probabilities into the formula for the error probability, I find that
σ 1−p 2 σ 1−p 2
Pe = Φ ln − p+ 1−Φ ln + (1 − p).
2 p σ 2 p σ
Problem 15. Using the minimum cost decision method, we should accept Ho if P (Ho |y)C10 ≥
P (H1 |y)C01 . Note that C01 is the cost of choosing Ho (there is no malfunction) given H1 is true
(there is a malfunction). That is, C01 is the cost of missing a malfunction, so that, as specified in
the problem C01 = 30C10 .
The left hand side of the inequality decision rule is therefore:
Since the costs are usually not negative, we see that P (H1 |y)C01 > P (Ho |y)C10 and we thus should
accept H1 , the hypothesis that there is a malfunction.
Problem 16. For X, Y jointly normal, we know from Theorem 5.4 in the book that:
y − µY 2 2
X|Y = y ∼ N µX + ρσX , (1 − ρ )σX .
σY
Since X ∼ N (2, 1) and Y ∼ N (1, 5), using the above formula, I have that the posterior distribution
is X|Y = 1 ∼ N (2, 15/16).
166 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
When choosing a confidence interval, to keep things symmetric, we usually choose the confidence
interval such that α/2 of the probability is in the left tail of the distribution (i.e., P (X < a|Y = 1) =
α/2) and α/2 of the probability is in the right tail of the distribution (i.e., P (X > b|Y = 1) = α/2).
Therefore to find the 90% credible interval, [a, b], I have the following equations:
a−2
0.05 = Φ √ ,
15/4
and
b−2
1 − 0.05 = Φ √ .
15/4
Solving for a and b by using the inverse Gaussian CDF, I find that the 90% credible interval is
approximately [0.41, 3.6].
Problem 17.
(a) The posterior distribution can be found using Bayes’ rule. For x > 0 :
x
fX|Y (x|y) ∝ PY |X (y|x)fX (x)
x
∝ e−x xy xα−1 e−βx
= e−x(β+1) xα+y−1
x
∝ (β + 1)α+y e−x(β+1) xα+y−1 ,
x
while for x ≤ 0, fX|Y (x|y) = 0. The notation ∝ means proportional to as a function of x.
Notice that since α > 0 and y ≥ 0, y + α is therefore greater than 0. Also, since β > 0,
1+β > 0. Therefore, this function is exactly of the form of a Gamma(α+y, β+1) distribution,
and so we know that the normalizing constant is Γ(α + y).
(b) In the previous part of the problem, I showed that X|Y = y ∼ Gamma(α + y, β + 1), and
therefore: (
(β+1)α+y xα+y−1 e−x(β+1)
Γ(α+y) for x > 0
fX|Y (x|y) =
0 otherwise.
(c) For U ∼ Gamma(α, λ), as shown in Section 4.2.4 of the book, E[U ] = α/λ and V ar[X] =
α/λ2 . Therefore, I have
α+Y
E[X|Y ] =
β+1
and
α+Y
V ar[X|Y ] = .
(β + 1)2
Problem 18.
(a) The posterior distribution can be found using Bayes’ rule. For 0 ≤ x ≤ 1 :
x
fX|Y (x|y) ∝ PY |X (y|x)fX (x)
x
∝ xy (1 − x)n−y xα−1 (1 − x)β−1
= xα+y−1 (1 − x)β+n−y−1 ,
while for x < 0 and x > 1, fX|Y (x|y) = 0. Now, since α > 0 and y ≥ 0, α + y > 0. Also, since
β > 0 and n ≥ y, β + n − y > 0. Thus, since this equation, up to a normalization constant,
has the exact same functional form as a Beta(α + y, β + n − y) distribution, the posterior is
given by this distribution.
167
(c) Plugging in my values for α and β into the formulas for the expectation and variance of a
Beta distribution, I find:
α+Y
E[X|Y ] =
α+β+n
and
(α + Y )(β + n − Y )
V ar[X|Y ] = .
(α + β + n)2 (α + β + n + 1)
Problem 19.
Now, the posterior distribution can be found using Bayes’ rule. For 0 ≤ x ≤ 1 :
x
fX|Y (x|y) ∝ PY |X (y|x)fX (x)
x
∝ x(1 − x)y−1 xα−1 (1 − x)β−1
= x(α+1)−1 (1 − x)(β+y−1)−1 ,
while for x < 0 and x > 1, fX|Y (x|y) = 0. Since α > 0, α + 1 > 0. Also, since β > 0,
y − 1 ≥ 0, β + y − 1 > 0. We thus see that up to a normalizing constant, this is the PDF for
a Beta(α + 1, β + y − 1) distribution, and hence X|Y = y ∼ Beta(α + 1, β + y − 1).
(c) Plugging in my values for α and β into the formulas for the expectation and variance of a
Beta distribution, I find:
α+1
E[X|Y ] =
α+β+Y
and
(α + 1)(β + Y − 1)
V ar[X|Y ] = .
(α + β + Y )2 (α + β + Y + 1)
Problem 20.
iid
(a) Since Yi |X = x ∼ Exp(x), I have that
(
xe−xy for y > 0
fYi |X (y|x) =
0 otherwise.
168 CHAPTER 9. STATISTICAL INFERENCE II: BAYESIAN INFERENCE
x
(b) Since X ∼ Gamma(α, β), I have that fX (x) ∝ xα−1 e−βx for x > 0 and fX (x) = 0 otherwise.
Therefore, for x > 0,the posterior is:
x
fX|Y1 ,Y2 ,...,Yn (x|y1 , y2 , . . . , yn ) ∝ fY1 ,Y2 ,...,Yn |X (y1 , y2 , . . . , yn |x)fX (x)
= L(Y ; x)fX (x)
x Pn
∝ xn e−x i=1 yi α−1 −βx
x e
Pn
= xα+n−1 e−x( i=1 yi +β )
,
while for x ≤ 0, fX|Y1 ,Y2 ,...,Yn (x|y1 , y2 , . . . , yn ) = 0. Now, since α and n are greater than 0,
so too is α + n. Further since it is assumed that P Yi |X ∼ Exp(X), it is implicit that all yi s
are greater than 0, and since β > 0, so too is ni=1 yi + β. Therefore, we see that, Pn up to a
normalizing constant, the posterior has the functional Pn form of a Gamma(α + n, i=1 yi + β)
distribution, so that X|Y = y ∼ Gamma(α + n, i=1 yi + β).
(d) For U ∼ Gamma(α, λ), as shown in Section 4.2.4 of the book, E[U ] = α/λ and V ar[X] =
α/λ2 . Therefore, I have
α+n
E[X|Y ] = Pn
i=1 Yi + β
and
α+n
V ar[X|Y ] = Pn .
( i=1 Yi + β)2
Chapter 10
169
170 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
The Python programming language is one of the most popular languages in both academia and
industry. It is heavily used in data science for simple data analysis and complex machine learning.
By most accounts, in the last few years, Python has eclipsed the R programming language in
popularity for scientific/statistical computation. Its popularity is due to intuitive and readable
syntax that can be implemented in a powerful object oriented programming paradigm, if so desired,
as well as being open source. It is for these reasons that I decided to transcribe the Introduction to
Simulation chapter in Pishro-Nik’s Introduction to Probability, Statistics and Random Processes
book into Python.
This entire chapter was written in a Jupyter notebook, an interactive programming environment,
primarily for Python, that can be run locally in a web browser. Jupyter notebooks are ideal for
quick and interactive data analysis, incorporating markdown functionality for clean presentations
and code sharing. If you are a fan of RStudio, you will most likely be fond of Jupyter notebooks.
This entire notebook is available freely at https://github.com/dsrub/solutions_to_probability_
statistics.
Additionally, much of this code was written using the Numpy/SciPy library, Python’s main library
for scientific computation and numerical methods. Numpy has a relatively clear and well docu-
mented API (https://docs.scipy.org/doc/numpy/reference/index.html), a reference which I utilize
almost daily.
I start with a few basic imports, and define several functions I will use throughout the rest of this
chapter.
#define html style element for notebook formatting
from IPython.core.display import HTML
HTML(notebook_style)
#define a few functions I will be using throughout the rest of the notebook
#plotting function
def plot_results(x, y, xlim=None, ylim=None, xlabel=None, ylabel=None, \
title=None, labels=None):
if labels:
plt.plot(x[0], y[0], label=labels[0], linewidth = 2)
plt.plot(x[1], y[1], label=labels[1], linewidth = 2)
plt.legend(loc='upper right')
else:
plt.plot(x, y, linewidth = 2)
if xlim:
plt.xlim(xlim)
if ylim:
plt.ylim(ylim)
if xlabel:
plt.xlabel(xlabel, size = 15)
if ylabel:
plt.ylabel(ylabel, size = 15)
if title:
plt.title(title, size=15)
plt.xticks(fontsize = 15)
plt.yticks(fontsize = 15);
return X
#print a few examples of the RGNs to the screen
p = 0.5
print_vals(draw_bern, p, 1)
X_0 = 0
X_1 = 0
X_2 = 0
X_3 = 0
X_4 = 1
Note that we can directly sample from a Bern(p) distribution with Numpy’s binomial random
number generator (RNG) by setting n = 1 with: np.random.binomial(1, p).
Example 2. (Coin Toss Simulation) Write code to simulate tossing a fair coin to see how the law
of large numbers works.
172 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
Solution: I draw 1000 Bern(0.5) random variables and compute the cumulative average.
#generate data, compute proportion of heads and plot
X = draw_bern(0.5, 1000)
avg = np.cumsum(X)/(np.arange(1000) + 1)
plot_results(np.arange(1000) + 1, avg, xlabel='$N$', ylabel='Proportion of heads')
#reset seed
np.random.seed(0)
else:
U = np.random.uniform(0, 1, n)
X = np.sum(U < p)
173
return X
#print a few examples of the RGNs to the screen
n = 50
p = 0.2
print_vals(draw_bin, n, p, 1)
X_0 = 8
X_1 = 17
X_2 = 3
X_3 = 13
X_4 = 10
Note that we can directly sample from a Bin(n, p) distribution with Numpy’s binomial RNG with:
np.random.binomial(n, p).
Example 4. Write an algorithm to simulate the value of a random variable X such that:
0.35 for x=1
0.15 for x=2
PX (x) =
0.4 for x=3
0.1 for x = 4.
Solution: We can utilize the algorithm presented in the book which divides the unit
interval into 4 partitioned sets and uses a uniformly drawn random variable.
def draw_general_discrete(P, R_X, N):
"""
A pseudo-RNG for any arbitrary discrete PMF specified by R_X and
corresponding probabilities P
"""
F_X = np.cumsum([0] + P)
X_arr = []
U_arr = np.random.uniform(0, 1, size = N)
for U in U_arr:
X = R_X[np.sum(U > F_X)-1]
return X_arr
X_0 = 2
X_1 = 4
X_2 = 3
X_3 = 3
X_4 = 4
Note that we can directly sample from a discrete PMF using Numpy’s multinomial RNG. A multi-
nomial distribution is the k dimensional analogue of a binomial distribution, where k > 2. The
multinomial distribution is a distribution over random vectors, X (of size k), where the entries
in the vectors can take on values from 0, 1, . . . n, subject to X1 + X2 + . . . + Xk = n, where Xi
represents the ith component of X.
If a binomial random variable represents the number of heads we flip out of n coin tosses (where
the probability of heads is p), then a multinomial random variable represents the number of times
we roll a 1, the number of times we roll a 2, . . ., the number of times we roll a k, when rolling
a k sided die n times. For each roll, the probability of rolling the ith face of the die is pi (where
∑k
i=1 pi = 1). We store the value for the number times we roll the i
th face of the die in X . To
i
denote a random vector drawn from a multinomial distribution, the notation, X ∼ M ult(n, p), is
typical, where p denotes the k dimensional vector with the ith component of p given by pi .
To directly sample from a discrete PMF with (ordered) range array R_X and associated prob-
ability array P we can use Numpy’s multinomial RNG function by setting n = 1 (one roll).
To sample one time we can use the code: X = R_X[np.argmax(np.random.multinomial(1,
pvals=P))], and to sample N times, we can use the code: X = [R_X[np.argmax(x)] for x in
np.random.multinomial(1, pvals=P, size=N)].
Additionally, to sample from an arbitrary discrete PMF, we can also use Numpy’s choice function,
which samples randomly from a specified list, where each entry in the list is sampled according to a
specified probability. To sample N values from an array R_X, with corresponding probability array
P, we can use the code: X = np.random.choice(R_X, size=N, replace=True, p=P). Make sure
to specify replace=True to sample with replacement.
Example 5. (Exponential) Generate an Exp(1) random variable.
Solution: Using the method of inverse transformation, as shown in the book, for a
strictly increasing CDF, F , the random variable X = F −1 (U ), where U ∼ U nif (0, 1),
has distribution X ∼ F . Therefore, it is not difficult to show that,
1
− ln(U ) ∼ Exp(λ),
λ
return X
175
X_0 = 2.4838379957
X_1 = 0.593858616083
X_2 = 0.53703944167
X_3 = 0.0388069650697
X_4 = 1.23049637556
Note that we can directly sample from an Exp(λ) distribution with Numpy’s exponential RNG
with: np.random.exponential(lam).
Example 6. (Gamma) Generate a Gamma(20, 1) random variable.
Solution: If X1 , X2 , . . . , Xn are drawn iid from an Exp(λ) distribution, then Y = X1 +
X2 + . . . + Xn ∼ Gamma(n, λ). Therefore, to generate a Gamma(n, λ) random variable,
we need only to generate n independent Exp(λ) random variables and add them.
def draw_gamma(alpha, lam, N):
"""
A Gamma(n, lambda) pseudo-RNG using the method of inverse transformation
"""
n = alpha
if N > 1:
U = np.random.uniform(0, 1, size = (N, n))
X = np.sum((-1/lam)*np.log(U), axis = 1)
else:
U = np.random.uniform(0, 1, size = n)
X = np.sum((-1/lam)*np.log(U))
return X
#print a few examples of the RGNs to the screen
alpha = 20
lam = 1
print_vals(draw_gamma, alpha, lam, 1)
X_0 = 17.4925086879
X_1 = 20.6155480241
X_2 = 26.9115218192
X_3 = 22.3654600391
X_4 = 22.331744631
Note that we can directly sample from a Gamma(n, λ) distribution with Numpy’s gamma RNG
with: np.random.gamma(shape, scale).
Example 7. (Poisson) Generate a Poisson random variable. Hint: In this example, use the fact
that the number of events in the interval [0, t] has Poisson distribution when the elapsed times
between the events are Exponential.
Solution: As shown in the book, we need only to continuously generate Exp(λ) variables
176 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
and count the number of draws it takes for the sum to be greater than 1. The Poisson
random variable is then the count minus 1.
def draw_poiss(lam, N):
"""
A Poiss(lambda) pseudo-RNG
"""
X_list = []
for _ in range(N):
summ = 0
count = 0
while summ <= 1:
summ += draw_exp(lam, 1)
count += 1
X_list.append(count-1)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 0
X_1 = 2
X_2 = 2
X_3 = 1
X_4 = 2
Note that we can directly sample from a P oiss(λ) distributions with Numpy’s: np.random.poisson(lam)
function.
Example 8. (Box-Muller) Generate 5000 pairs of normal random variables and plot both his-
tograms.
Solution: Using the Box-Muller transformation as described in the book:
def draw_gaus_pairs(N):
"""
An N(0, 1) pseudo-RNG to draw N pairs of indepedent using the Box-Muller
transformation
"""
U1 = np.random.uniform(size = N)
U2 = np.random.uniform(size = N)
Z1 = np.sqrt(-2*np.log(U1))*np.cos(2*np.pi*U2)
Z2 = np.sqrt(-2*np.log(U1))*np.sin(2*np.pi*U2)
177
#generate data
Z1_arr, Z2_arr = draw_gaus_pairs(5000)
#plot histograms
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
#reset seed
178 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
np.random.seed(0)
Note that we can directly sample from a N (0, 1) distribution with Numpy’s normal RNG with:
np.random.randn(d0, d1, ..., dn), where d0, d1, …, dn are the dimensions of the desired output
array.
Exercise 1. Write Python programs to generate Geom(p) and P ascal(m, p) random variables.
Solution: As in the book, I generate Bern(p) random variables until the first success
and count the number of draws to generate a Geom(p) random variable. To generate
a P ascal(m, p) random variable, I generate Bern(p) random variables until I obtain m
successes and count the number of draws.
def draw_geom(p, N):
"""
A Geom(p) pseudo-RNG
"""
X_list = []
for _ in range(N):
count = 0
X = 0
while X == 0:
X = draw_bern(p, 1)
count += 1
X_list.append(count)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 15
X_1 = 1
X_2 = 1
X_3 = 8
X_4 = 2
def draw_pascal(m, p, N):
"""
A Pascal(m, p) pseudo-RNG
"""
X_list = []
for _ in range(N):
count_succ = 0
count = 0
while count_succ < m:
X = draw_bern(p, 1)
count_succ += X
count += 1
X_list.append(count)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 17
X_1 = 10
X_2 = 7
X_3 = 3
X_4 = 4
Note that we can directly sample from Geom(p) and P ascal(m, p) distributions with Numpy’s
np.random.geometric(p) and np.random.negative_binomial(n, p) functions respectively.
Exercise 2. (Poisson) Use the algorithm for generating discrete random variables to obtain a
Poisson random variable with parameter λ = 2.
Solution:
from scipy.misc import factorial
for _ in range(N):
P = np.exp(-lam)
i = 0
U = np.random.uniform()
while U >= P:
i += 1
P += np.exp(-lam)*lam**i/(factorial(i)+0)
X_list.append(i)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 3
X_1 = 0
X_2 = 5
X_3 = 2
X_4 = 5
Exercise 3. Explain how to generate a random variable with the density
√
f (x) = 2.5x x
X_0 = 0.8178201131579468
X_1 = 0.8861754700680049
X_2 = 0.27369087549414306
X_3 = 0.6033871249144047
X_4 = 0.4285059109745954
181
Exercise 4. Use the inverse transformation method to generate a random variable having distri-
bution function
x2 + x
FX (x) = ,
2
for 0 ≤ x ≤ 1.
Solution: By inverting the CDF, we have that.
√
1 1
FX−1 (x) =− + + 2x,
2 4
for 0 ≤ x ≤ 1.
def draw_dist4():
"""
A pseudo-RNG for the distribution in Exercise 4
"""
U = np.random.uniform()
return -0.5 + np.sqrt(0.25 + 2*U)
X_0 = 0.417758353296
X_1 = 0.198180089883
X_2 = 0.441257859881
X_3 = 0.538521058539
X_4 = 0.115056902
Exercise 5. Let X have a standard Cauchy distribution. function
1 1
FX (x) = arctan x + .
π 2
Assuming you have U ∼ U nif (0, 1), explain how to generate X. Then, use this result to produce
1000 samples of X and compute the sample mean. Repeat the experiment 100 times. What do you
observe and why?
Solution: The inverse CDF is given by FX−1 (x) = tan[π(x − 1/2)].
def draw_stand_cauchy(N):
"""
A standard Cauchy pseudo-RNG using the method of inverse transformation
"""
U = np.random.uniform(size = N)
X = np.tan(np.pi*(U - 1/2))
if N == 1: return X[0]
else: return X
182 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
X_0 = 0.691013110859
X_1 = 0.212342443875
X_2 = -0.907695727473
X_3 = 0.0731660554841
X_4 = -3.28946953204
#plot means for 100 trials
#reset seed
np.random.seed(0)
We see that the means for each trial vary wildly. This is because the Cauchy distribution actually
has no mean.
Exercise 6. (The Rejection Method) When we use the Inverse Transformation Method, we need
a simple form of the CDF, F (x), that allows direct computation of X = F −1 (U ). When F (x)
doesn’t have a simple form but the PDF, f (x), is available, random variables with density f (x)
can be generated by the rejection method. Suppose you have a method for generating a random
variable having density function g(x). Now, assume you want to generate a random variable having
183
density function f (x). Let c be a constant such that f (y)/g(y) ≤ c (for all y). Show that the
following method generates a random variable, X, with density function f (x).
f (Y )
1) initialize U and Y such that U > cg(Y )
f (Y ) {
repeat until U ≤ cg(Y )
4) Set X = Y
Solution:
Firstly, as a technical matter, note that c ≥ 1, which can be shown by integrating both sides of
f (y) ≤ cg(y).
We see that this algorithm keeps iterating until it outputs a random variable Y , given that we know
f (Y ) f (Y )
that U ≤ cg(Y ) . Therefore, the goal is to show that the random variable Y |U ≤ cg(Y ) has PDF f (y)
( )
f (Y )
(or equivalently CDF F (y)). In other words, we must show that P Y ≤ y U ≤ cg(Y ) = F (y). I
show this with Baye’s rule:
( )
f (Y )
(
f (Y )
) P U≤ cg(Y ) Y ≤ y P (Y ≤ y)
P Y ≤ y U ≤ = ( )
cg(Y ) P U≤ f (Y )
cg(Y )
( )
f (Y )
P U≤ cg(Y ) Y ≤ y G(y)
= (
f (Y )
) .
P U ≤ cg(Y )
( ) ( )
f (Y ) f (Y )
Thus, we must calculate the quantities: P U ≤ cg(Y ) Y ≤ y and P U ≤ cg(Y ) .
( ) ( )
f (Y ) f (y)
P U≤ Y = y = P U ≤ Y = y
cg(Y ) cg(y)
( )
f (y)
=P U ≤
cg(y)
( )
f (y)
= FU
cg(y)
f (y)
= ,
cg(y)
where in the second line I have used that U and Y are independent and in the fourth I have used
the fact that for a uniform distribution FU (u) = u. Notice that the requirement that f (y)/g(y) ≤ c
(for all y) is crucial at this step. This is because f (y)/g(y) ≤ c =⇒ c > 0 (since f (y) and g(y)
are positive), so that 0 < f (y)/cg(y) ≤ 1. If this condition did not hold, then the above expression
184 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
f (y)
would be min{1, cg(y) }, for positive c and 0 for negavtive c, which would interfere with the rest of
the derivation.
( )
f (Y )
I may now calculate P U ≤ cg(Y ) :
( ) ∫ ∞ ( )
f (Y ) f (Y )
P U≤ = P U≤ Y = y g(y)dy
cg(Y ) −∞ cg(Y )
∫ ∞
f (y)
= g(y)dy
−∞ cg(y)
∫
1 ∞
= f (y)dy
c −∞
1
= .
c
( )
( ) f (Y )
f (Y ) P U≤ cg(Y ) , Y ≤y
P U≤ Y ≤y =
cg(Y ) G(y)
∫∞ ( )
f (Y )
−∞ P U≤ cg(Y ) , Y ≤ y Y = v g(v)dv
=
G(y)
( )
∫∞ f (Y )
−∞ P U≤ cg(Y ) Y ≤ y, Y = v P (Y ≤ y|Y = v)g(v)dv
= ,
G(y)
where in the second line I have used the law of total probability, and in the third line I have used
the definition of conditional probability. Note that:
{
1 for v ≤ y
P (Y ≤ y|Y = v) =
0 for v > y,
and thus
( )
∫y f (Y )
(
f (Y )
)
−∞ P U≤ cg(Y ) Y ≤ y, Y = v g(v)dv
P U≤ Y ≤y =
cg(Y ) G(y)
( )
∫y f (Y )
−∞ P U≤ cg(Y ) Y = v g(v)dv
=
G(y)
∫y f (v)
−∞ cg(v) g(v)dv
=
G(y)
1
c F (y)
= ,
G(y)
where in the second line I have used the fact that conditioning on Y = v already implies that Y ≤ y
since we only consider values
( of v less than or
) equal to y in the integration. In the third line I have
f (Y )
used the expression for P U ≤ cg(Y ) Y = y that we derived above.
185
( )
( ) f (Y )
f (Y ) P U≤ cg(Y ) Y ≤ y G(y)
P Y ≤ y U ≤ = ( )
cg(Y ) P U ≤ cg(Y f (Y )
)
1
F (y)
c
G(y)
=
G(y)
1
c
= F (y),
Since f (x)/g(x) (where f (x) is the PDF of the Beta and g(x) is the PDF of the uniform) needs to
be smaller than c for all x in the support of these distributions, a fine value of c to use would be
2.5 since it is evident from the plot that this value satisfies the requirement. The book uses the
186 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
smallest possible value of c, i.e., the max of the Beta(2, 4) distribution, which it derives analytically
and finds to be 135/64 ≈ 2.11. It is not necessary to use the smallest value of c, but will certainly
help the speed of the algorithm since the algorithm only stops when U ≤ f (Y )/cg(Y ). I will stick
with the value of 2.5 just to illustrate that the algorithm works for this value as well.
def draw_beta_2_4(N):
"""
A Beta(2, 4) pseudo-RNG using the rejection method
"""
c = 2.5
X_list = []
for _ in range(N):
U = 1
f_Y = 0
g_Y =1
f_Y = 20*Y*(1-Y)**3
g_Y = 1
X_list.append(Y)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 0.4236547993389047
X_1 = 0.07103605819788694
X_2 = 0.11827442586893322
X_3 = 0.5218483217500717
X_4 = 0.26455561210462697
Note that we can directly sample from a Beta(α, β) distribution with Numpy’s beta RNG with:
np.random.beta(a, b).
Exercise 8. Use the rejection method to generate a random variable having the Gamma(5/2, 1)
density function. Hint: Assume g(x) is the PDF of the Gamma(1, 2/5).
Solution: Note that there is a mistake in the phrasing of the question in the book.
The PDF for g(x) should be Gamma(1, 2/5), not Gamma(5/2, 1). Also note that we
cannot use the method that we used in Example, 6. since in this case α is not an
integer (however, we can use that method to draw from g(x)). I first visualize these
distributions so we can get a handle on what we are dealing with.
187
1.6587150033103788
As a sanity check, this value is very close to the analytically derived value in the book, which is
( )3/2
10 5
√
3 π 2 e−3/2 ≈ 1.6587162. Therefore, I set the value of c to be 1.7, and use the function I wrote
in Example. 6, draw_gamma(alpha, lam, N), to draw from g(x).
def draw_gamma_2(alpha, lam, N):
"""
A Gamma(5/2, 1) pseudo-RNG using the rejection method
"""
c = 1.7
X_list = []
for _ in range(N):
U = 1
f_Y = 0
188 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
g_Y =1
f_Y = (4/(3*np.sqrt(np.pi)))*(Y**1.5)*np.exp(-Y)
g_Y = 0.4*np.exp(-0.4*Y)
X_list.append(Y)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 1.96233211971
X_1 = 1.22716649756
X_2 = 2.55754781375
X_3 = 0.900161721137
X_4 = 3.89706921546
Exercise 9. Use the rejection method to generate a standard normal random variable. Hint:
Assume g(x) is the PDF of the exponential distribution with λ = 1.
Solutuion As in the book, to solve this problem, I use the rejection method to sample
from a half Gaussian:
2 x2
f (x) = √ e− 2 ,
2π
X_list = []
for _ in range(N):
U = 1
189
f_Y = 0
g_Y =1
f_Y = (2/np.sqrt(2*np.pi))*np.exp(-(Y**2)/2)
g_Y = np.exp(-Y)
X_list.append(Y*(1-2*Q))
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 1.1538237197
X_1 = -2.28234324111
X_2 = -0.426012274543
X_3 = -1.40884434358
X_4 = -0.421092193245
Exercise 10. Use the rejection method to generate a Gamma(2, 1) random variable conditional on
its value being greater than 5. Hint: Assume g(x) is the density function of exponential distribution.
Solution As in the book, I use an Exp(0.5) conditioned on X > 5 as the distribution
for g(x). It is not difficult to show by integrating the PDF of this distribution that
G−1 (x) = 5 − 2 ln(1 − x) (where G is the CDF). I therefore use the method of inverse
transformation to first draw a random variable from this distribution (Y ). Note that
for U ∼ U nif (0, 1), 1 − U ∼ U nif (0, 1), and therefore the formula for G−1 (U ) can be
simplified to 5 − 2 ln(U ). I then use the rejection method to sample from the desired
distribution. By maximizing f (x)/g(x), the book shows that c must be greater than
5/3, and I therefore use c = 1.7.
def draw_gamma_2_1_cond_5(N):
"""
A Gamma(2, 1) conditional on X>5 pseudo-RNG using the rejection method
"""
c = 1.7
X_list = []
for _ in range(N):
U = 1
190 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
f_Y = 0
g_Y =1
f_Y = Y*np.exp(5-Y)/6
g_Y = np.exp((5-Y)/2)/2
X_list.append(Y)
if N == 1:
return X_list[0]
else:
return X_list
X_0 = 6.76250850879
X_1 = 5.73497460514
X_2 = 5.14665551227
X_3 = 5.8087003199
X_4 = 5.66723645483
Notice that, as required, the random variables are all > 5.
As a final check to close this chapter, I draw samples from most of the RNG functions that I
implemented above, compute the corresponding PMFs/PDFs, and compare to the theoretical dis-
tributions. I first check the discrete distributions, and I start by writing a function that will compute
the empirical PMFs. Note that the phrase, “empirical PMF”, (and “empirical PDF”) is standard
terminology to refer to the probability distribution associated with a sample of data. Formally, for
a collection of data, {xi }Ni=1 , they are given by
1 ∑N
PX (x) = I(x = xi )
N i=1
1 ∑N
fX (x) = δ(x = xi )
N i=1
for the empirical PDF (where I(·) is the indicator function, and δ(·) is the delta function).
def compute_PMFs(counts, xrange):
"""
Compute empirical PMFs from a specified array of random variables,
and a specified range
"""
191
count_arr = []
xrange2 = range(np.max([np.max(xrange), np.max(counts)])+1)
for i in xrange2:
count_arr.append(np.sum(counts==i))
pmf = np.array(count_arr)/np.sum(np.array(count_arr))
return pmf[np.min(xrange):np.max(xrange)+1]
I now compute the theoretical distributions, generate the data and compute the empirical distribu-
tions.
from scipy.stats import bernoulli, binom, poisson, geom, nbinom
#set seed for reproducibility
np.random.seed(1984)
for i, ax in enumerate(ax_arr):
ax.plot(x_ranges[i], numpy_y[i], 'bo', ms=8, label='Theoretical Dist.', \
alpha=.8)
ax.vlines(x_ranges[i], 0, numpy_y[i], colors='b', lw=5, alpha=0.5)
ax.plot(x_ranges[i], my_y[i], 'bo', ms=8, label='Empirical Dist.', \
color='green', alpha=.8)
ax.legend(loc=legend_loc[i], fontsize=11)
192 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
if i in [3, 4, 5]:
ax.set_xlabel('$X$', size = 15)
if i in [0, 3]:
ax.set_ylabel('PMF', size = 15)
We see that the empirical distributions match almost perfectly with the theoretical distributions,
with even better correspondence for larger N .
I now check some of the continuous RNG functions that I implemented in this chapter. I first start
by computing the theoretical distributions and generating the data.
from scipy.stats import expon, gamma, cauchy, beta, norm
#reset seed
np.random.seed(0)
I now plot normalized histograms of the data and compare to the theoretical distributions. Again,
we see almost perfect correspondence between the empirical and theoretical distributions. The
correspondence becomes even better with larger values of N .
#plot theoretical and empirical PDFs
names = ['Exp(1) (inverse trans.)', 'Gamma(20, 1) (inverse trans.)', \
'Cauchy(0, 1) (inverse trans.)', 'Beta(2, 4) (rejection)', \
'Gamma(5/2, 1) (rejection)', 'N(0, 1) (rejection)']
bin_arr = [50, 35, 60, 45, 45, 35]
xlims=[(0, 8), (0, 50), (-20, 20), (0, 1), (0, 15), (-5, 5)]
range_arr = [None]*6
range_arr[2] = (-20, 20)
for i, ax in enumerate(ax_arr):
ax.plot(x_ranges[i], numpy_y[i], label='Theoretical Dist.', color='black', \
linewidth=3, alpha=.7)
ax.hist(my_rvs[i], bins=bin_arr[i], alpha=.5, edgecolor='black',normed=True, \
label='Empirical Dist.', range=range_arr[i])
if i in [3, 4, 5]:
ax.set_xlabel('$X$', size = 15)
if i in [0, 3]:
ax.set_ylabel('PDF', size = 15)
194 CHAPTER 10. INTRODUCTION TO SIMULATION USING PYTHON
Chapter 11
Recursive Methods
195
196 CHAPTER 11. RECURSIVE METHODS
Problem 1.
3
x2 − 2x + = 0,
4
which has roots 1/2 and 3/2, and therefore:
n n
3 1
an = α +β .
2 2
x2 − 4x + 4 = 0,
which can be factored into (x − 2)2 = 0. The polynomial thus has one root, x = 2, with a
multiplicity of 2, and therefore:
an = α2n + βn2n .
Using the initial conditions a0 = 2 and a1 = 6 leads to α = 2 and β = 1. Thus, the solution
to the recurrence equation is:
an = 2nn+1 + n2n .
Problem 2.
(a) Let An,k be the event of observing exactly k heads out of n coin tosses, and let H denote the
event that the last coin toss is a heads. By conditioning on the last coin toss I obtain:
where the equality follows because if the last coin toss is heads, then we need exactly k − 1
heads from the first n − 1 tosses, and if the last coin toss is tails, then we need exactly k
heads from the first n − 1 tosses. Converting this to the notation used in the problem:
=⇒
an+1,k+1 = an,k p + an,k+1 (1 − p).
Problem 3. Let A be the desired event and let q = 1 − p be the probability of tails. To solve this
problem, I first condition on whether the first toss is a heads or tails:
where P (A|HH) = 1 since if you flip 2 consecutive heads, the experiment is done and where
P (A|HT ) = P (A|T ), since the first heads does not matter because we are interested in 2 consecutive
heads, and 1 isolated heads does not get us any closer to the event A.
To help solve for P (A|T ), I also condition on the second toss:
where P (A|T T ) = 0 since if you flip 2 consecutive tails, the experiment is done (and the desired even
did not occur) and where P (A|T H) = P (A|H), for essentially the same reason that P (A|HT ) =
P (A|T ) as described above.
I now re-express these 3 equations in slightly more readable notation:
a = aH p + aT q
aH = p + aT q
T
a = aH p,
and we see that we have a system of 3 equations with 3 unknowns. Solving for a and plugging
q = 1 − p back in I find that:
p2 (2 − p)
a= .
1 − p(1 − p)
As a check, we know that in the limit that p goes to 1, this expression should evaluate to unity (we
definitely get HH before T T ) and in the limit that p goes to 0 this expression should evaluate to
0 (we definitely get T T before HH). Indeed it is easily to check that this expression satisfies these
2 limits.
Problem 4. Let An be the event that the number of heads out of n tosses is divisible by 3, p be
the probability of heads and q = 1 − p be the probability of tails. To solve this problem recursively,
I first condition on whether the first toss is a heads or tails:
Here, P (An |T ) = P (An−1 ) since if the first toss is a tails, as in the sequence below, we have observed
no heads,
198 CHAPTER 11. RECURSIVE METHODS
T ...
n n−1 n−2 1
so that the experiment just starts over at 1 less total flips (n − 1), and is equivalent to the sequence
below:
... .
n−1 n−2 1
The numbers below the sequence n, n − 1, n − 2, . . . show the number of flips remaining before you
make that particular flip.
To solve for P (An |H), I condition on the second toss:
Here, P (An |HT ) = P (An−1 |H) since the the probability of An for the sequence below,
H T ...
n n−1 n−2 1
is the same as the probability of An for a sequence that starts with 1 heads with n − 1 flips:
H ... .
n−1 n−2 1
Here, P (An |HHT ) = P (An−1 |HH) since the the probability of An for the sequence below,
H H T ...
n n−1 n−2 n−3 1
is the same as the probability of An for a sequence that starts with 2 heads with n − 1 flips:
H H ... .
n−1 n−2 n−3 1
Also P (An |HHH) = P (An − 3) since, if we have already gotten 3 heads in the first 3 flips then the
probability that the number of heads flipped in the sequence is divisible by 3 is the same as if the
remaining n − 3 flips is divisible by 3.
I summarize this set of recursive equations in somewhat more readable notation:
HH
an = an−3 p + aHH n−1 q
aH = aHH p + aH q
n n n−1
an = aH
n p + an−1 q.
We see that the equations are a coupled set of recursive equations, so must be solved simultaneously
by first solving for aHH H
n , then using this to solve for an , then using this to solve for an , iteratively
until we reach the desired value of n. In order to do this, we will need several initial conditions
which we can easily be compute by hand. For a sequence with n = 1, the number of heads is
divisible by 3 if we throw one tails (probability q). For a sequence with n = 2, the number of heads
is divisible by 3 if we throw two tails (probability q 2 ). For a sequence with n = 3, the number of
heads is divisible by 3 if we throw 3 tails or 3 heads (probability p3 + q 3 ):
199
a1 = q
a2 = q 2
a3 = p3 + q 3 .
For a sequence that starts with 1 head and with n = 1, the number of heads is never divisible by
3 (probability 0). For a sequence that starts with 1 head and with n = 2, the number of heads is
never divisible by 3 (probability 0). For a sequence that starts with 1 head and with n = 3, the
number of heads is only divisible by 3 if we throw 2 heads after the first (probability p2 ):
aH
1 =0
aH
2 =0
aH 2
3 =p .
Finally, for a sequence that starts with 2 head and with n = 1, the number of heads is never
divisible by 3 (probability 0). For a sequence that starts with 2 head and with n = 2, the number
of heads is never divisible by 3 (probability 0). For a sequence that starts with 2 head and with
n = 3, the number of heads is only divisible by 3 if we throw 1 head after the first 2 (probability
p):
aHH
1 =0
aHH
2 =0
aHH
3 = p.
I can check this coupled set of recursive equations by recognizing that we can compute P (An )
directly using the binomial distribution and only summing over the number of successes which are
divisible by 3. This can be written as:
b3c
n
X n 3k n−3k
P (An ) = p q .
3k
k=0
I wrote a python function (below) to compute P (An ) using both methods. I compute P (An ) for a
range of n, for several values of p, and plot P (An ) calculated recursively against P (An ) calculated
with the binomial distribution as well as the 45 degree line in Fig. 11.1. If there is perfect agreement
between the 2 methods, the points should lie along this line, and indeed this is exactly what we
see.
import numpy a s np
from s c i p y . s p e c i a l import binom
d e f c omp u t e b i n o m r e c u r ( p , N ) :
”””
Compute p r o b a b i l i t y t h a t t h e number o f heads out o f N
c o i n f l i p s ( each o f p r o b a b i l i t y p ) i s d i v i s i b l e by 3 .
Returns t h e p r o b a b i l i t y computed from t h e b i n o m i a l
d i s t r i b u t i o n and t h e p r o b a b i l i t y computed from r e c u r s i o n .
”””
200 CHAPTER 11. RECURSIVE METHODS
1.0
p = 0.2 1.0
p = 0.5 1.0
p = 0.8
an (recursion)
an (recursion)
0.6 0.6 0.6
Figure 11.1: Comparison of P (An ) calculated recursively against P (An ) calculated with the bino-
mial distribution (Problem 4).
#compute t h e p r o b a b i l i t y from t h e b i n o m i a l
P arr =[]
f o r n in range (1 , N) :
P = np . sum ( np . a r r a y ( [ binom ( n , 3∗ k ) ∗ p ∗ ∗ ( 3 ∗ k )∗(1 −p ) ∗ ∗ ( n−3∗k ) \
f o r k i n r a n g e ( 0 , i n t ( np . f l o o r ( n / 3 ) + 1 ) ) ] ) )
P a r r . append (P)
#compute t h e p r o b a b i l i t y from r e c u r s i o n
q = 1−p
#i n i t i a l i z e r e c u r s i o n
an HH = [ 0 , 0 , p ]
an H = [ 0 , 0 , p ∗ ∗ 2 ]
an = [ q , q ∗ ∗ 2 , p∗∗3+q ∗ ∗ 3 ]
for i n r a n g e (N−4):
#r e c u r s i o n update e q u a t i o n s
an HH new = an [ −3]∗ p+an HH [ −1]∗ q
an H new = an HH new∗p+an H [ −1]∗ q
an new = an H new ∗p+an [ −1]∗ q
an HH . append ( an HH new )
an H . append ( an H new )
an . append ( an new )
r e t u r n ( P arr , an )