Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
25 views74 pages

ST206 or 202 Lecture Notes (MT)

The document is a set of lecture notes for the course ST202: Probability and Distribution Theory at the London School of Economics for the academic year 2021-2022. It covers various topics in probability, random variables, univariate and multivariate distributions, and conditional distributions, organized by week and lecture. Key concepts include probability measures, random variables, distributions, and conditional expectations.

Uploaded by

Soumojit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views74 pages

ST206 or 202 Lecture Notes (MT)

The document is a set of lecture notes for the course ST202: Probability and Distribution Theory at the London School of Economics for the academic year 2021-2022. It covers various topics in probability, random variables, univariate and multivariate distributions, and conditional distributions, organized by week and lecture. Key concepts include probability measures, random variables, distributions, and conditional expectations.

Uploaded by

Soumojit Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

ST202.

1: Probability and
Distribution Theory
Tay Meshkinyar
Dr. Milt Mavrakakis

The London School of Economics and Political Science

2021-2022
Contents

1 Introduction 4

2 Probability 5
2.1 Week 1: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 A Pair of Dice . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 [a bit of] Measure Theory . . . . . . . . . . . . . . . . . . 5
2.2 Week 1: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 The Probability Measure . . . . . . . . . . . . . . . . . . 7
2.2.2 More Properties of Probability Measures . . . . . . . . . . 8
2.2.3 Sample Problems . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Week 2: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Discrete Tools . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Conditional Probability . . . . . . . . . . . . . . . . . . . 12
2.3.3 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 The Law of Total Probability . . . . . . . . . . . . . . . . 13
2.4 Week 2: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Random Variables & Univariate Distributions 16


3.1 Week 2: Lecture 2 (continued) . . . . . . . . . . . . . . . . . . . 16
3.1.1 The Random Variable . . . . . . . . . . . . . . . . . . . . 16
3.2 Week 3: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Examples of Random Variables . . . . . . . . . . . . . . . 17
3.2.2 The Cumulative Distribution Function . . . . . . . . . . . 17
3.3 Week 3: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3.1 Types of Random Variables . . . . . . . . . . . . . . . . . 20
3.3.2 Some Distributions . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Week 4: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.1 A Distribution of Emails . . . . . . . . . . . . . . . . . . . 22
3.4.2 Discrete Uniform Distribution . . . . . . . . . . . . . . . . 23
3.4.3 Continuous Random Variables . . . . . . . . . . . . . . . 23
3.5 Week 4: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1
0 ⧸ Contents

3.5.1 Some Continuous Distributions . . . . . . . . . . . . . . . 26


3.5.2 Expectation, Variance, and Moments . . . . . . . . . . . . 27
3.6 Week 5: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6.1 Markov Inequality . . . . . . . . . . . . . . . . . . . . . . 29
3.6.2 Jensen Inequality . . . . . . . . . . . . . . . . . . . . . . . 31
3.6.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Week 5: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.1 Moment-Generating Function . . . . . . . . . . . . . . . . 33
3.7.2 Cumulant-Generating function . . . . . . . . . . . . . . . 36
3.8 Week 6: Reading Week . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Week 7: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9.1 Functions of Random Variables . . . . . . . . . . . . . . . 38
3.10 Week 7: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.10.1 Location-scale transformation . . . . . . . . . . . . . . . . 42
3.10.2 Sequences of Random Variables & Convergence . . . . . . 42
3.10.3 The Borel-Cantelli Lemmas . . . . . . . . . . . . . . . . . 44

4 Multivariate Distributions 46
4.1 Week 8: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Joint CDFs and PDFs . . . . . . . . . . . . . . . . . . . . 46
4.2 Week 8: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Bivariate Density . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Multiple Random Variables . . . . . . . . . . . . . . . . . 49
4.2.3 Covariance and Correlation . . . . . . . . . . . . . . . . . 50
4.3 Week 9: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 Joint Moments . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Joint MGFs . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Joint CGFs . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4 Week 9: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4.1 Independent Random Variables . . . . . . . . . . . . . . . 55
4.4.2 Random Vectors & Random Matrices . . . . . . . . . . . 56
4.4.3 Transformations of Random Variables . . . . . . . . . . . 58
4.5 Week 10: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 Sums of Random Variables . . . . . . . . . . . . . . . . . 58
4.5.2 Multivariate Normal Distributions . . . . . . . . . . . . . 61

5 Conditional Distributions 63
5.1 Week 10: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Another Deck of Cards . . . . . . . . . . . . . . . . . . . . 63
5.1.2 Conditional Mass and Density . . . . . . . . . . . . . . . 63
5.2 Week 11: Lecture 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . 66

2
0 ⧸ Contents

5.2.2 Law of Iterated Expecations . . . . . . . . . . . . . . . . . 66


5.2.3 Properties of Conditional Expectation . . . . . . . . . . . 68
5.2.4 Law of Iterated Variance . . . . . . . . . . . . . . . . . . . 68
5.3 Week 11: Lecture 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1 Conditional Moment Generating Function . . . . . . . . . 69
5.3.2 Some Practical Applications . . . . . . . . . . . . . . . . . 70

Conclusion 73

3
Chapter 1

Introduction

Welcome to my transcribed set of lecture notes for ST202: Probability, Dis-


tribution Theory, and Inference. This document uses an edited version of the
theme used in Gilles Castel’s differential geometry notes. Much of the workflow
used to write these notes was ported from his lightning-fast, elegant setup on
Linux. Check out his github here, as well as his personal website. You can also
find the most up to date version of these notes here. This chapter serves mainly
for theme consistency, and to match the numbering of the course textbook. The
course thus begins with Chapter 2.

4
Chapter 2

Probability

2.1 Week 1: Lecture 1


2.1.1 A Pair of Dice Tue 28 Sep 14:00

Example 2.1.1. Roll two dice. The probability sum is > 10. There are three
favourable outcomes:
(5, 6), (6, 5), (6, 6).

There are 36 total outcomes. Then the probability is 3


36 = 1
12 . ⋄

Definition 2.1.2. The sample space Ω is the collection of every possible


outcome. An outcome ω is an element of the sample space (ω ∈ Ω).

Definition 2.1.3. An event A is a set of possible outcomes in Ω (A ⊆ Ω).

2.1.2 [a bit of ] Measure Theory


Let ψ be a set and G be a collection of subsets of ψ. Note that if A ∈ G, then
A ⊆ ψ.

Definition 2.1.4. A measure is a function m : G → R+ such that

i. m(A) ≥ 0 for all A ∈ G,

ii. m(∅) = 0,
S∞ P∞
iii. if A1 , A2 , . . . ∈ G are disjoint, then m ( i=1 Ai ) = i=1 m(Ai ).

5
Week 1: Lecture 2 2 ⧸ Probability

Definition 2.1.5. A set G is a σ-algebra on ψ if

i. ∅ ∈ G,

ii. if A ∈ G then Ac ∈ G,

iii. if A1 , A2 , A3 , . . . ∈ G then

[
Ai = A1 ∪ A2 ∪ A3 ∪ . . . ∈ G.
i=1

Definition 2.1.6. Let ψ be a set, G a σ-algebra of ψ, and m a measure


of G. The space (ψ, G) is a measurable space. The space (ψ, G, m) is a
measure space.

Example 2.1.7. Let ψ be a set. The set {∅, ψ} is the smallest σ-algebra of ψ.

Suppose that |ψ| > 1. Let A ⊂ ψ. Then {∅, A, Ac , ψ} is the smallest non-
trivial σ-algebra.


Example 2.1.8. The σ-algebra G = {A : A ⊆ ψ} = P(ψ) is the power set of
ψ. Hence, if
ψ = {ω1 , ω2 , . . . , ωk },

then |G| = 2|ψ| = 2k . ⋄


Example 2.1.9. Is m(A) = |A|, i.e., the number of elements of A, a well-defined
measure?

i.) m(A) ≥ 0? ✓

ii.) m(∅) = 0? ✓

iii.) if A1 , A2 , . . . are disjoint,


∞ ∞ ∞ ∞
!
[ [ X X
m Ai = Ai = |Ai | = m (Ai ) . ✓
i=1 i=1 i=1 i=1

2.2 Week 1: Lecture 2


Wed 29 Sep 10:00

6
Week 1: Lecture 2 2 ⧸ Probability

Let (ψ, G) be a measurable space, with ω ∈ ψ and A ∈ G. Consider



1, if w ∈ A
m(A) = 1A (ω) =
0, if w ̸∈ A.

Exercise. Check that this is a measure! (on the problem set).

2.2.1 The Probability Measure

Definition 2.2.1. Consider the measurable space (Ω, F). Define (Ω, F, P )
as a probability space. The function P is a probability measure that
satisfies P (A) ∈ [0, 1] for all A ∈ F and P (Ω) = 1.

Since P is a measure,

• P (A) ≥ 0 for all A ∈ F,

• P (∅) = 0,

• P (A ∪ B) = P (A) + P (B) if A ∩ B = ∅ (mutually exclusive).

In general, if A, B ∈ F do we have A ∩ B ∈ F? Observe that

(A ∩ B)c = Ac ∪ B c .
So yes! We do.

In general, if A1 , A2 , A3 , · · · ⊆ Ω are mutually exclusive, then

∞ ∞
!
[ X
P Ai = P (Ai ).
i=1 i=1

Basic Properties of Probability Measures

i. P (Ac ) = 1 − P (A)

ii. If A ⊆ B, then P (B \ A) = P (B) − P (A)

iii. P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

Example proof (the remaining proofs are left as an exercise):

Proof. i. A, Ac are disjoint, and thus A ∪ Ac = Ω, but P (A ∪ Ac ) =


P (A) + P (Ac ).

Corollary 2.2.2. If A ⊆ B, then P (A) ≤ P (B).

7
Week 1: Lecture 2 2 ⧸ Probability

General Addition Rule:

n
! n n
[ X X
P Ai = P (Ai ) − P (Ai ∩ Aj )
i=1 i=1 i,j=1, i<j
X
+ P (Ai ∩ Aj ∩ Ak )
i<j<k

− ...
+ (−1)n+1 P (A1 ∩ A2 ∩ · · · ∩ An ).

2.2.2 More Properties of Probability Measures

Theorem 2.2.3 (Boole’s Inequality). If (Ω, F, P ) is a probability space


and A1 , A2 , A3 , · · · ∈ F, then:
∞ ∞
!
[ X
P Ai ≤ P (Ai ).
i=1 i=1

Proof. Define

B 1 = A1
B 2 = A2 \ B 1
B3 = A3 \ (B1 ∪ B2 )
..
.
Bi = Ai \ (B1 ∪ · · · ∪ Bi−1 ).
S∞
Then B1 , B2 , B3 , · · · ∈ F (confirm this!) They are disjoint, and i=1 Bi =
S∞
i=1 Ai . So,

∞ ∞ ∞ ∞
! !
[ [ X X
P Ai = P Bi = P (Bi ) ≤ P (Ai ).
i=1 i=1 i=1 i=1

Proposition 2.2.4. If A1 , A2 , A3 , . . . is an increasing sequence of sets A1 ⊆


S∞
A2 ⊆ A3 ⊆ . . . , then limn→∞ P (An ) = P ( i=1 Ai ) .

The following figure may come in handy:

8
Week 1: Lecture 2 2 ⧸ Probability

A4

A3
A2
A1

Figure 2.1: Visual representation of A1 , A2 , . . .

Proof. Define

B 1 = A1
B 2 = A2 \ A1
..
.
Bi = Ai \ Ai−1
..
.
Sn
Note that these events are mutually exclusive, and so An = i=1 Bi .
S∞ S∞
Moreover, i=1 Bi = i=1 Ai . Hence,

n
!
[
lim P (An ) = lim P Bi
n→∞ n→∞
i=1
n
X
= lim P (Bi )
i=1

!
[
=P Bi
i=1

!
[
=P Ai .
i=1

We will use P (A) = |A||Ω| , for A ∈ F, with assumptions that the each event is
equally likely, and that the sample space is finite.

9
Week 2: Lecture 1 2 ⧸ Probability

2.2.3 Sample Problems


1) Lottery: choose 6 numbers from {1, 2, . . . , 59}. What is the probability of
matching 6 numbers?

2) Birthdays: 100 people in this lecture. What is the probability that at least
two share a birthday?

Note. Read how the multiplication rule applies to permutations and combina-
tions.

2.3 Week 2: Lecture 1


2.3.1 Discrete Tools Tue 4 Oct 14:00

Let A be an event in a σ-algebra F, and let P (A) = |A| |Ω| be a probability


measure. Note that if we can break the experiment we are interested in into k
subexperiments Ωi ⊆ Ω, then the multiplication rule dictates

|Ω| = |Ω1 | × |Ω2 | × · · · |Ωk |.

Permutations

Definition 2.3.1. Take n distinct objects, and choose k of them to be put


in a specific order. A permutation refers to one such ordering.

We can find the number of possible permutations of size k using the multiplica-
tion rule:

n(n − 1) · · · 1
n × (n − 1) × · · · × (n − k + 1) =
|{z} | {z } | {z } (n − k)(n − k − 1) · · · 1
1st choice 2nd kth
n!
=
(n − k)!
n
= Pk .

Combinations

Definition 2.3.2. Take n distinct objects, and choose k of them, but do


not put them in order. A combination refers to one such group. The
number of combinations of size k is represented as n Ck

Note that a permutation can be represented as

10
Week 2: Lecture 1 2 ⧸ Probability

choose k objects put these k


   
× .
out of n objects in order
| {z } | {z }
nC k!
k

Hence,

n n!
Pk =n Ck × k! ⇒n Ck =n Pk = .
(n − k)!k!

Notation. We also denote combinations by n


, also referred to as the bino-

k
mial coefficient.
In general,
n  
n
X n
(a + b) = aj bn−j .
j
But why?
Remark 2.3.3. Take n objects, k of type I, and n − k of type II. Put these
n objects in order. How many possible ways of ordering them? We have nk .


Again, why? Think of it this way. Suppose there are n slots. We can put an
object of type I or II in each slot. The order doesn’t matter, so there are nk


ways to choose k slots.


But how does this relate to the binomial coefficient? First note that

(a + b)n = (a + b)(a + b)(a + b) · · · (a + b)


| {z }
n times

Now consider the form of each term of the polynomial:

aj bn−j

Each term can be thought of as one combination of slots. We "choose" a or


b for each part of the product, and multiply them together to get a term of the
form ak bn−k . By the above, nk is how many terms are constructed for each k.


Example 2.3.4. Let {1, 2, . . . , 59} be a set of numbers. Choose 6 without


replacement. What is the probability that I match all 6 if I draw at random?
Two ways to solve this:

1. Order Matters: suppose that we consider every permutation of drawings.


Then
59!
|Ω| = 59 P6 = .
53!
Now let A be the event that we match all 6 numbers, regardless of order.
Then |A| = 6!, and
|A| 6! 53!
P (A) = = .
|Ω| 59!

11
Week 2: Lecture 1 2 ⧸ Probability

2. Order Doesn’t Matter: suppose that we consider ever combination of draw-


ings. Then
59!
|Ω| = 59 C6 = .
53! 6!
Let A be the event that we match all 6 numbers. Then |A| = 1, because there
is only one combination that fits the criteria of A. Then
1 6! 53!
P (A) = 59 C
= ,
6 59!
the same answer as before. ⋄

2.3.2 Conditional Probability


Let (Ω, ]F, P ) be a probability space. Let B ∈ F with P (B) > 0. Define a new
probability measure PB such that A ∈ F, and

P (A ∩ B)
PB (A) = P (A | B) = .
P (B)
|A|
Note that if P (A) = |Ω| , then

|A ∩ B|/|Ω| |A ∩ B|
P (A | B) = = .
|B|/|Ω| |B|
So,
P (Ac | B) = 1 − P (A | B).

If A1 , A2 , · · · ∈ F, and P (Ai ) > 0 for P (Ai ) > 0 for i = 1, 2, . . . , then

P (An ∩ · · · ∩ A1 ) = P (An | An−1 ∩ · · · ∩ A1 )P (An−1 | An−2 ∩ · · · A1 ) · · · P (A1 )

2.3.3 Bayes’ Rule


Let F be a σ-algebra. For two events A, B ∈ F,

P (A ∩ B)
P (A | B) =
P (B)
P (A ∩ B)
=
P (B)
P (B | A)P (A)
=
P (B)
P (B | A)
= P (A) .
P (B)

We refer to P (A) as the prior and P (B | A) as the Bayes factor.

12
Week 2: Lecture 2 2 ⧸ Probability

2.3.4 The Law of Total Probability

Definition 2.3.5. A partition of Ω is a collection of events {B1 , B2 , . . . }


such that

i. P (Bi ) > 0 for all i,


S∞
ii. i=1 = Ω (collectively exhaustive),

iii. Bi ∩ Bj = ∅ for all i ̸= j (pairwise mutually exclusive).

Theorem 2.3.6 (Law of Total Probability). Let the set {B1 , B2 , . . . } be a


partition of Ω. Then for any A ∈ F,

X ∞
X
P (A) = P (A ∩ Bi ) = P (A | Bi )P (Bi ).
i=1 i=1

2.4 Week 2: Lecture 2


Wed 6 Oct 10:00
Example 2.4.1. Suppose that 1.2% of live births lead to twins. Further suppose
that 13 are identical twins, and 23 are fraternal. We can describe each of these
events with the outcomes and their associated probabilities below:
1
identical (BB , GG )
3 1
2
1
2

2
fraternal (BB , GG , BG , GB ).
3 1
4
1
4
1 1
4 4

Define the events T, I, F, M as T : twins, I : identical twins, F : fraternal, M :


twin boys. We now work out each of their associated probabilities.

13
Week 2: Lecture 2 2 ⧸ Probability

M
1/4

F
2/3
3/4 Mc
T
M
0.012 1/2
1/3
I

1/2 Mc
0.988
Tc

Figure 2.2: A probability tree representing this situation.

By multiplying along the paths of each event, we can obtain the probabilities
of the events I, F, M, and P (F | M ).

1
P (I) = P (I | T )P (T ) = × 0.012 = 0.004
3
2
P (F ) = P (F | T )P (T ) = × 0.012 = 0.008
3
1 2 1 1
P (M ) = × × 0.012 + × × 0.012 = 0.004
4 3 2 3
1
P (M | F )P (F ) × 0.008 1
P (F | M ) = = 4 = .
P (M ) 0.004 2

2.4.1 Independence
Let A, B ∈ Ω. If A and B are independent, then
P (A ∩ B)
P (A|B) = P (A) ⇒ = P (A),
P (B)
which, in turn implies our definition of independence:

Definition 2.4.2. If A and B are independent, or A ⊥


⊥ B, then P (A ∩
B) = P (A)P (B).

What if B = ∅? Then P (B) = 0, but also P (A ∩ B) = 0. The definition


also implies the following:

(i.) if A ⊥
⊥ B and P (B) > 0, then P (A | B) = P (A)

14
Week 2: Lecture 2 2 ⧸ Probability

(ii.) if A ⊥
⊥ B, then Ac ⊥ ⊥ B c , and Ac ⊥
⊥ B, A ⊥ ⊥ Bc.

Let A1 , A2 , A3 , . . . , An ∈ F. When do we say that these are independent?

Definition 2.4.3. (1) The set {A1 , . . . , An } are pairwise independent


if
P (Ai ∩ Aj ) = P (Ai )P (Aj ) for all i ̸= j

(2) The set {A1 , . . . , An } are (mutually) independent if any subset of at


least two events are (mutually) independent.

15
Chapter 3

Random Variables &


Univariate Distributions

3.1 Week 2: Lecture 2 (continued)


3.1.1 The Random Variable Wed 6 Oct 10:00

What is a random variable? Informally, it is a numerical quantity that takes dif-


ferent values with different probabilities. Its value is determined by the outcome
of experiments.
Example 3.1.1. Consider the twin example from before. Let X represent the
number of girls from a given birth. We can map each event to some value of X:


BB
0
BG
1
GB
2
GG

More formally we can say that X is a function, that is, X : Ω → R, where

16
Week 3: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

for ω ∈ Ω,X(ω) ∈ R. Then

P (X = 1) = P ({ω ∈ Ω : X(ω) = 1}
2 1
= P ({BG, GB}) = =
4 2
P (X > 0) = P ({ω ∈ Ω : X(ω) > 0})
3
= P ({BG, GB, GG}) = .
4

Definition 3.1.2. Let Ω be a sample space and E be a measurable space.


For our purposes, let E = R. A random variable is a function X : Ω → E
with the property that, if Ax = {ω ∈ Ω : X(ω) ≤ x}, then Ax ∈ F for all
x ∈ R.

3.2 Week 3: Lecture 1


3.2.1 Examples of Random Variables Tue 12 Oct 14:00

Example 3.2.1. Let X be a random variable. For x = 2, we have A2 ∈ F, so


we can write P (A2 ) = P (X ≤ 2). ⋄
Example 3.2.2. Suppose there is a family with two children. Let X = the
number of girls. Then
1
P (A0 ) = P ({BB}) =
4
3
P (A1 ) = P ({BB, BG, GB}) =
4
3
P (A 32 ) = P (A1 ) =
4
P (A2 ) = P (Ω) = 1
P (A−1 ) = P ({}) = 0
P (Aπ ) = P (Ω) = 1.

3.2.2 The Cumulative Distribution Function

Definition 3.2.3. A random variable X is positive if X(ω) ≥ 0 for all


ω ∈ Ω.

17
Week 3: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

Definition 3.2.4. The cumulative distribution function (CDF) of a


random variable X is the function FX : R → [0, 1] given by FX (x) =
P (X ≤ x).
| {z }
Ax

Example 3.2.5. Let X be a random variable and FX be a valid CDF. Then

P (A1 ) = P (X ≤ 1) = FX (1).


Example 3.2.6. In our two child example,
 
1 3 3 3
FX (0) = , FX (1) = , FX (2) = 1, FX (−1) = 0, FX = ,...
4 4 2 4

Moreover, note that the CDF in this case is a step function, as seen in the figure
below.

FX (x)

3
4

1
4

x
-1 1 2 3

Figure 3.1: The Cumulative Distribution Function of the two child example.

Definition 3.2.7. A function g : R → R is right-continuous if g(x+) =


g(x) for all x ∈ R, where

g(x+) = lim g(x + h),


h↓0

and g(x−) = lim g(x − h).


h↓0

18
Week 3: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Proposition 3.2.8. If FX is a CDF, then

i. FX is increasing, i.e., if x < y then FX (x) ≤ FX (y).

ii. FX is right-continuous, i.e. FX (x+) = FX (x) for all x ∈ R.

iii. limx→−∞ = 0 and limx→∞ FX (x) = 1.

Proof. i. If x < y, then AX ⊆ Ay , so

FX (x) = P (Ax ) ≤ P (Ay ) = FX (y).

ii. Take a decreasing sequence {xn } such that xn ↓ x as n → ∞ (x1 ≥


x2 ≥ x3 ≥ · · · ). We have

Ax 1 ⊇ AX 2 ⊇ · · ·

and Ax ⊇ Axn . So

\
Ax = Ax n .
n=1

Then

lim FX (xn ) = lim P (Axn )


x→∞ n→∞
!
\
=P Ax n
n∈N

= P (Ax )
= FX (X)
⇒ lim FX (x + h) = FX (x).
h↓0

iii. In M&P textbook.

Some basic properties of CDFs

Observe that

• P (X > x) = 1 − P (X ≤ x) = 1 − FX (x)

• P (x < X ≤ y) = FX (y) − FX (x)

• P (X < x) = FX (x−)

• P (X = x) = FX (x) − FX (x−).

19
Week 3: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

3.3 Week 3: Lecture 2


3.3.1 Types of Random Variables Wed 13 Oct 10:00

Some examples of random variable types:

• Discrete: hurricanes (0,1,2,3,. . . )

• Continuous: javelin throw distance

• Continuous model for discrete situation: average salary

• Neither discrete nor continuous: queuing time

• Neither discrete nor continuous for discrete situation: yearly income

These types are not as clear cut as we may believe.

Definition 3.3.1. The support of a non-negative function g : R → [0, ∞)


is the subset of R where g is strictly positive.

Notation.

(i) X ∼ Poisson(λ), X ∼ N (λ, σ 2 )

(ii) X ∼ Fx (a CDF)

Example 3.3.2. Recall the prior two child example. Our discrete random vari-
able was in the form of a step function. ⋄

Definition 3.3.3. X is a discrete random variable if and only if it takes


values in {x1 , x2 , x3 , . . . } ⊂ R.

Definition 3.3.4. The probability mass function (PMF) of a discrete ran-


dom variable X is the function fx : R → [0, 1] where fX (x) = P (X = x).

In our example: fX (0) = 1/4, fX (1) = 1/2, fX (2) = 1/4 fX (x) = 0 for all
other x. Hence, {0, 1, 2} is the support.

(i) fX (x) = FX (x) − FX (x−)

(ii) FX (x) = u∈R,u≤x fX (u) i.e. P (X ≤ x) = u:u≤x P (X = u)


P P

20
Week 3: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Proposition 3.3.5. For valid PMF fX (x) and valid CDF FX (x),

(i.) fX (x) = FX (x) = FX (x−), or

P (X = x) = P (X ≤ x) − P (X < x).

(ii.) FX (x) = fx (u), or


P
u∈R, u≤x
X
P (X ≤ x) = P (X = u).
u:u≤x

3.3.2 Some Distributions


Binomial Distribution

We obtain a binomial distribution with the following:

• Repeat experiment n times.

• Each time, declare one of two outcomes: success or failure.

• Every trial is independent.

• P (“success") = p every repetition.

Define X as the number of successes. Let X ∼ Bin(n, p), where n is the


number of trials and p is the probability of success. The PMF of X is
 
n x
fX (x) = p (1 − p)n−x for x = 0, 1, . . . n.
x
and fX (x) = 0 otherwise. Check if fX is a valid PMF:
n
X
fX (x) = (p + (1 − p))n = 1n = 1.✓
x=0

Example 3.3.6. For our previous example, where X is the number of girls,
X ∼ Bin 2, 12 .


Bernoulli Distribution

X ∼ Bernoulli(p) is the same as Bin(1, p).

Geometric Distribution

Same (Bernoulli) setup as Binomial, but:

• the number of trials is not fixed

21
Week 4: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

• we repeat the experiment until first success

Let Y : number of trials required. Then Y ∼ Geo(p).

fY (y) = P (Y = y) = (1 − p)y−1 p, for y = 1, 2, . . .

Check the validity of fX :



X p p
fY (y) = = = 1.
y=1
1 − (1 − p) p

Sometimes: Y ∗ : the number of failures before first success. Note that



Y = Y − 1.

Negative Binomial Distribution

Same setup as Geometric, but we stop when we obtain the rth success for some
given r ∈ Z+ .
Let X : number of trials required to obtain r successes. Then X ∼ NegBin(r, p),
and
fX (x) = pr (1 − p)x−r for x = r, r + 1, r + 2, . . .

A more common iteration of the Negative Binomial Distribution:

• X ∗ : number of failures before r successes. (support is {0, 1, 2, . . . }).

• X∗ = X − r

Note that NegBin(1, p) is the same as Geo(p).

3.4 Week 4: Lecture 1


3.4.1 A Distribution of Emails Tue 19 Oct 14:00

Example 3.4.1. Suppose that we want to create a probability distribution of


how many emails are sent in each point of time on the LSE server between 10AM
and 11AM. There aren’t exactly any Bernoulli trials here by default, so we need
to create them. Split the 1 hour period into 60 one minute intervals. If an email
is sent during a given interval, we count that as a success. Call this random
variable X60 . Note that X60 ∼ Bin(60, p). Here’s the problem: this model will
systematically undercount the number of emails, since there are a maximum of
60 trials, and each trial counts for only 1. Now increase the number of Bernoulli
trials, and assume that the probability remains constant over any given time
interval. Hence, if we have 120 30-second intervals, then X120 ∼ Bin(120, p2 ).

22
Week 4: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

What if we continue increasing the number of trials? Or, what is X∞ , i.e.


lim Bin(n, p) as n → ∞, p → 0 with np remaining constant?

n!
fX (x) = lim px (1 − p)n−x
n→∞, p→0, np=λ x!(n − x)!
n−x
λx

n! λ
= lim 1 −
n→∞ x!(n − x)! nx n
 n  −x x
n(n − 1) · · · (n − x + 1) λ λ λ
= lim 1− 1−
n→∞ n × n··· × n n n x!
x
λ
= 1 × e−λ × 1 ×
x!
e−λ λx
= , x = 0, 1, 2, . . .
x!
which is the PMF of the Poisson distribution. Hence, X ∼ Poisson(λ).

3.4.2 Discrete Uniform Distribution


Definition 3.4.2. A discrete random variable X is uniformly distributed
if it has PMF

1 for x ∈ {x1 , x2 , . . . , xn }
fX (x) = n
0 otherwise.

3.4.3 Continuous Random Variables


Note that a discrete CDF is a step function. A continuous CDF, however, is
continuous everywhere.
FX (x) FY (y)
1 1

P (X = x∗ )
P (Y = y) = 0
∀y ∈ R

x y
x∗

Figure 3.2: Discrete vs. Continuous CDF

23
Week 4: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

Definition 3.4.3. A random variable X is continuous if its CDF can be


written as Z x
FX (x) = fX (u) du
−∞

for some integrable real-valued function fX .

Definition 3.4.4. We write fX (x) to denote the probability density


function (PDF) of X.

Proposition 3.4.5. For all x ∈ R,

i. fX (x) = d
dx FX (x)

= FX (x)

ii. fX (x) ≥ 0 for all x ∈ R

iii. R fX (x) dx = 1.
R

iv. Let a, b ∈ R, with a < X ≤ b. Then

FX (b) − FX (a) = P (a < X ≤ b)


Z b
= fX (u) du.
a

v. For any B ⊆ R, Z
P (X ∈ B) = fX (x) dx.
B

Example 3.4.6. A continuous random variable X is uniformly distributed for


parameters a, b if 
 1 , a≤x≤b
fX (x) = b−a
0, otherwise.

More compactly, X ∼ Unif[a, b]. We can say

h
P (c ≤ X ≤ c + h) = .
b−a

24
Week 4: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

fY (y)

y
0 a c c+h b

Figure 3.3: The continuous Uniform Distribution with parameters a, b.


Example 3.4.7. Let X be the number of email arrivals in an hour, and suppose
X ∼ Poisson(λ). Note that we can scale this, with X(t) = Poisson(tλ.) Let Y
be the time of the first arrival. Note that

FY (y) = P (Y ≤ y)
= 1 − P (Y > y)
= 1 − P (X(y) = 0)
e−λy (λy)0
=1−
0!
= 1 − eλy , y ≥ 0.

Differentiate FY (y) to find the density function:

d
fY (y) = FY (y)
dy
d
= (1 − e−λy )
dy
= λ−λy , y ≥ 0;

which is the PDF of the exponential distribution with rate λ. Hence, Y ∼


Exp(λ). ⋄

25
Week 4: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

fY (y)

y
0

Figure 3.4: A simple graphical representation of the PDF of the Exponential


Distribution.

3.5 Week 4: Lecture 2


3.5.1 Some Continuous Distributions Wed 20 Sep 10:00

Exponential Distribution

An Exponential Distribution can be described with either a rate parameter or


a scale parameter.

Definition 3.5.1. We say X ∼ Exp(λ) for rate parameter λ if fx (x) =


λe−λx for x > 0, λ > 0. We say X ∼ Exp(θ) for scale parameter θ if
fx (x) = 1/θe−x/θ , for x > 0.

Note that θ = λ1 .

Normal Distributions

Definition 3.5.2. We say that X is normally distributed, or X ∼


N (µ, σ 2 ) for mean µ and variance σ 2 if
1 x−µ 2
fX (x) = √ e −( σ ) for x ∈ R.
2πσ 2

Example 3.5.3. The standard normal distribution is Normal(0, 1). ⋄

Some properties of the normal distribution:

• If X ∼ N (µ, σ 2 ), then x−µ


σ ∼ N (0, 1).

• If Z ∼ N (0, 1), then µ + σZσN (µ, σ 2 ).

26
Week 4: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Remark 3.5.4. The normal CDF has no closed form. It can be written as an
infinite sum, but it cannot be written in a finite number of operations.
Remark 3.5.5. If Z ∼ N (0, 1) we write Φ(z) for FZ (z). The Φ function is the
CDF for the standard normal.

Gamma Distribution

Definition 3.5.6. The PDF of the Gamma Distribution is


λα α−1 −λx
fX (x) = x e , x > 0, 0 otherwise
Γ(α)

We denote the gamma distribution as Gamma(α, λ), where α is the shape


parameter, and λ is the rate parameter.

Definition 3.5.7. The gamma function is as follows


Z ∞
Γ(k) = xk−1 e−x dx.
0

If k ∈ Z+ , then Γ(k) = (k − 1)!, so Gamma(1, λ) is Exp(λ).


Remark 3.5.8. Beware! The scale parameter θ is commonly in use too, with
θ = λ1 .

fX (x)

α=3

α=1

x
0

Figure 3.5: PDF of the Gamma Distribution for α = 1 and α = 3

3.5.2 Expectation, Variance, and Moments


We can now characterize some properties of random variables:

• mode((X) = arg max(fX (x))

27
Week 4: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

• median(X) = m, where FX (m) = 1/2.

• mean(X)? See below:

Definition 3.5.9. The mean, or expected value of X is



P xf (x), X is discrete
x
µ = E(x) = R ∞x
 xfX (x) dx, X is continuous.
−∞

Note. We generally ask that


X
|x|fX (x) < ∞.
x

or for continuous random variables,


Z
|x|fX (x) dx < ∞.
R

Example 3.5.10. Let X ∼ Uniform[a, b]. Then


Z
E(X) = xfX (x) dx
R
Zb
1
= x dx
a b − a
 2 b
1 x
=
b−a 2 a
b2 − a 2
=
2(b − a)
(b − a)(b + a)
=
2(b − a)
a+b
=
2

Remark 3.5.11. Note that

P
x g(x)fX (x) (discrete)
E(g(x)) =
R g(x)fX (x) dx
R
(continuous),

as long as X
|g(x)|fX (x) < ∞,
x
and similarly when X is continuous.
Example 3.5.12. For a random variable and a0 , a1 , a2 , · · · ∈ R,

E(a0 + a1 X + a2 X 2 + · · · ) = a0 + a1 E(X) + a2 E(X 2 ) + · · ·

28
Week 5: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

Remark 3.5.13. A quick aside:


Z Z Z
(ex + 2 sin(x)) dx = ex dx + 2 sin x dx

Definition 3.5.14. The variance (σ 2 ) of a function is

Var(x) = E[(x − E(x))2 ].

Observe that 
P (x − µ)2 f (x)
X (discrete)
σ = R x
2
 (x − µ)2 fX (x) dx
R
(continuous).

Some properties of variance:

i. Var(X) ≥ 0,

ii. Var(a0 + a1 X) = a21 Var(X) (prove this!),

iii. Var(X) = E(X 2 ) − E(X)2 (this too!).

Definition 3.5.15. The standard deviation σ of a random variable X is


p
Var(X).

3.6 Week 5: Lecture 1


3.6.1 Markov Inequality Tue 26 Oct 14:00

Theorem 3.6.1 (Markov Inequality). If Y is a positive random variable,


and E(Y ) < ∞, then
E(Y )
P (Y ≥ a) ≤ ,
a
for any a > 0.

29
Week 5: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

fY (x)

y
0 a

Figure 3.6: The Markov Inequality shows that the shaded area (the survival
)
function of X evaluated at a) is always less than or equal to E(Y
a .

Proof. Observe that


Z ∞
P (Y ≥ a) = fY (y) dy
Za∞
≤ fY (y) dy
a
1 ∞
Z
= yfY (y) dy
a a
Z ∞
1
≤ yfY (y) dy
a 0
1
= E(Y ).
a

Example 3.6.2. Let Y be the random variable representing a person’s lifespan.


Say that E(Y ) = 80. Note that
80 1
P (Y ≥ 160) ≤ = .
160 2

Theorem 3.6.3 (Chebyshev Inequality). If X is a random variable with


Var(X) ≤ ∞, then

Var(x)
P (|X − E(x)| ≥ a) ≤ ,
a2
for any a > 0.

30
Week 5: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

Proof. Let Y = Var(X). Observe that

P (|X − E(X)| ≥ a) = P ((X − E(X))2 ≥ a2 )


E(Y )

a2
E[(X − E(X))2 ]
=
a2
Var(X)
= .
a2
Alternatively,
 
x−µ
P (X ≥ µ + λσ or X ≤ µ − λσ) = P ≥λ
σ
= P (|x − µ ≥ λσ)
σ2

λ2 σ 2
1
= 2.
λ

Example 3.6.4. Let X ∼ Normal(µ, σ 2 ). Then


 
x−µ 1
P ≥2 ≤ .
σ 4

Note that the exact probability is ≈ 0.05. ⋄

Definition 3.6.5. A function g : R → R is convex if for any a ∈ R, we


can find λ such that

g(x) ≥ g(a) + λ(x − a) for all x ∈ R.

A concave function is the same principle, but with

g(x) ≤ g(a) + λ(x − a) for all x ∈ R.

3.6.2 Jensen Inequality

Theorem 3.6.6 (Jensen Inequality). If X is a random variable (with E(x)


defined) and g : R → R is convex (with E(g(x)) < ∞), then

E(g(x)) ≥ g(E(x)).

31
Week 5: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

Proof. Using the definition of a convex function with a = E(X), we have


Z
E(g(X)) = g(x)fX (x) dx
ZR
≥ [g(E(X)) + λ(x − E(X))] fX (x) dx
R
Z Z
= g(E(X)) fX (x) dx + λ (x − E(X))fX (x) dx
R R

= g(E(X)) + λE(X − E(X))


= g(E(X)).

If h : R → R is concave, then E(h(X)) ≤ h(E(X)).


Example 3.6.7. A special case:

E(aX + b) = aE(X) + b.


Example 3.6.8. Note that

E(X 2 ) ≥ (E(X))2 .


Example 3.6.9. If Y > 0,
 
1 1
E ≥ .
Y E(Y )

3.6.3 Moments
Definition 3.6.10. The rth moment of a random variable X is

µ′r = E(X r ), for r = 1, 2, 3, . . .

Definition 3.6.11. The rth central moment of X is

µr = E[(X − E(X))r ], for r = 1, 2, 3, . . .

Example 3.6.12. Some moments:

µ′1 = E(X),
µ1 = 0,
µ2 = Var(X) = E(X 2 ) − E(X)2
⇒ µ2 = µ′2 − (µ′1 )2 .

32
Week 5: Lecture 2 3 ⧸ Random Variables & Univariate Distributions


Example 3.6.13. Let X ∼ Exp(λ). Then

µ′r = E(X r )
Z
= xr fX (x) dx
ZR∞
= xr λe−λx dx
0
Z ∞
d
= xr (−e−λx ) dx
0 dx
Z ∞
−λx ∞
rxr−1 (−e−λx ) dx
 r 
= x (−e ) 0 −
0
r ∞ r−1 −λx
Z
= x λe dx
λ 0
r
= µ′r−1 .
λ
Observe that
r ′
µ′r = µ
λ r−1
r r−1 ′
= µr−2
λ λ
= ...
r r−1 1
= · · · µ′0
λ λ λ
r!
= r.
λ
So E(X) = λ1 , E(X 2 ) = λ2 ,
2
and so on. Further note that
 2
2 1 1
Var(X) = 2 − = 2.
λ λ λ

3.7 Week 5: Lecture 2


Wed 28 Oct 10:00
3.7.1 Moment-Generating Function

33
Week 5: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Definition 3.7.1. The moment-generating function (MGF) of a ran-


dom variable X is a function MX : R → R+ 0 given by
X

 etx fX (x) (discrete)

tX x
MX (t) = E(e ) = Z
 etx fX (x) dx (continuous),


R

where we require that MX (t) < ∞ for all t ∈ [−h, h] for some h > 0 (a
neighborhood of 0).

Remark 3.7.2. Note that



y2 y3 X yj
ey = 1 + y + + + ··· = .
2! 3! j=0
j!

And so

MX (t) = E(etX )
(tX)2
 
= E 1 + tX + + ···
2!
 

X (tX)j 
= E
j=0
j!
∞ j
X t
= E(X j )
j=0
j!
t2 ′ t3
= 1 + tµ′1 + µ2 + µ′3 + · · ·
2! 3!
µ′r E(X r )
The coefficient of tr is r! = r! .

Proposition 3.7.3. The rth derivative of MX (t) at t = 0 is µ′r .

Proof.
(r) dr
MX (t) = µX (t)
dtr 
dr t2 t3

= r 1 + tµ′1 + µ′2 + µ′3 + · · ·
dt 2! 3!
2
t
= µ′r + tµ′r+1 + µ′r+2 + · · ·
2!

34
Week 5: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

This implies
(r)
µX (0) = µ′r = E(X r ).

Proposition 3.7.4. If X, Y are random variables and we can find h > 0


such that MX (t) = MY (t) for all |t| < h, i.e., t ∈ (−h, h), then

FX (x) = FY (x) for all x ∈ R.

Proof. Omitted.

Example 3.7.5. Let X ∼ Poisson(λ). Observe that

MX (t) = E(etX )
X
= etx fX (x)
x

X e−λ λx
= etx
x=0
x!

X e−λ (λet )x
=
x=0
x!
∞ t
t X e−λe (λet )x
= eλe e−λ
x=0
x!
t
= eλ(e −1)
for t ∈ R
t
= exp(λ(e − 1))
= exp(λ(et − 1))
t2 t3
= exp(λ(1 + t + + + ··· − 1
2 6
2
t2 λ2 (t + t2 + · · · )2
= 1 + λ(t + + · · · ) + + ···
2 2
2 2 2
λt λ t
= 1 + λt + + + ···
2 2
λt2 t2
= 1 + λt + + (λ + λ2 ) + · · · .
2 2
From this, E(X) = λ, and E(X 2 ) = λ + λ2 . Moreover,

Var(X) = λ + λ2 − λ2 = λ.

Or,

MX = exp(λ(et − 1))λet ⇒ µ′1 = MX

(0) = λ.

35
Week 5: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Example 3.7.6. Let Y ∼ Γ(α, λ). Then

MY (t) = E(etY )
Z ∞
λα α−1 −λy
= etY y e dy
0 Γ(α)
Z ∞
λα (λ − t)α
= dy
(λ − t)α 0 Γ(α)y α−1 e−(λ−t)y
 α
λ
=
λ−t
 −α
t
= 1− , for |t| < λ.
λ

Negative Binomial Expansion

 −α
t
MY (t) = 1 −
λ

X j + α − 1   t j

=
j=0
α−1 λ

tj (j+α−1)! −j
So, for example, the coefficient of j! is (α−1)! λ . Then

(1 + α − 1)! −1
E(Y ) = λ
α − 1)!
α! 1
=
(α − 1)! λ
α
= .
λ

3.7.2 Cumulant-Generating function

Definition 3.7.7. The cumulant-generating function (CGF) of a ran-


dom variable X is KX (t) ln(MX (t)).

We can write
κ2 2 κ3 3
KX (t) = κ1 t + t + t + ···
2! 3!
tr
The rth cumulant, κr , is the coefficient of r! in the power series expansion
of KX (t) about 0.
Example 3.7.8. Let X ∼ Poisson(λ). Then

36
Week 7: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

KX (t) = ln MX (t)
= ln(exp(λ(et − 1)))
= λ(et − 1)
t2 t3
= λt + λ + λ + ··· .
2 3!
So, κ1 = κ2 = κ3 = · · · = λ. ⋄

Proposition 3.7.9. If X is a random variable, then

i. κ1 = µ′1 = E(X)

ii. κ2 = µ′2 − (µ′1 )2 = µ2 = Var(X)

iii. κ3 = µ3 = E[(X − E(X))3 ].

Proof.

i. Observe that

KX (t) = ln(MX (t))


′ 1
⇒ KX (t) = M ′ (t)
MX (t) X
′ 1
⇒ κ1 = KX (0) = M ′ (0) = µ1 .
MX (0) X

ii.

′ ′′ ′
(t))2

′′ MX (t) MX (t)MX (t) − (MX
KX (t) = =
MX (t) (MX (t))2
′′
⇒ κ2 = KX (0) = µ′2 − (µ′1 )2 .

iii. Left as an exercise.

3.8 Week 6: Reading Week


:)

3.9 Week 7: Lecture 1


Tue 9 Nov 14:00

37
Week 7: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

3.9.1 Functions of Random Variables


Let X be a random variable, and g : R → R be a well-behaved function. We’re
interested in

Y = g(X), E(g(X))

When we first encountered functions of random variables, we started with


the CDF, and we worked from there. But observe that

FY (y) = P (Y ≤ y)
= P (g(X) ≤ y) ̸= P (X ≤ g −1 (y)).

Hence, this doesn’t work, e.g., for g(x) = x2 .

Definition 3.9.1. If B ⊆ R and g : R → R, the inverse image of B is


defined as
g −1 (B) = {x ∈ R : g(x) ∈ B}.

Example 3.9.2. If g(x) = x2 ,

g −1 ({4}) = {−2, 2}
g −1 ([0, 1]) = [−1, 1]

Example 3.9.3. For a random variable Y and some set B,

P (Y ∈ B) = P (g(X) ∈ B)
= P ({ω ∈ Ω : g(X(ω)) ∈ B})
= P ({ω ∈ Ω : X(ω) ∈ g −1 (B)})
= P (X ∈ g −1 (B))


Remark 3.9.4. Note that

FY (y) = P (Y ≤ y)
= P (Y ∈ (−∞, y])
= P (X ∈ g −1 ((−∞, y])
 X


 (discrete)
 x:g(x)≤y
= Z
(continuous).



 fX (x) dx
x:g(x)≤y

38
Week 7: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

Further,
X
fY (y) = . . . = fX (x).
x:g(x)=y

Example 3.9.5. Let X be a continuous random variable and Y = g(X) = X 2 .


For y ≥ 0:

FY (y) = P (Y ≤ y)
= P (X 2 ≤ y)
√ √
= P (− y ≤ X ≤ y
√ √
= FX ( y) − FX (− y)
d
⇒ fY (y) = FY (y)
dy
√ √

1
2 y (fX ( y) + fX (− y),
 √ y≥0
=
0, y < 0.

If X ∼ Normal(0, 1), then

1 −(√y)2
 
1
fy (y) = √ √ e 2 + ...
2 y 2π
1 −1 −y
= √ y 2 e 2 , y ≥ 0.

(1/2)2 1 −1 − 1 y
= √ y2 e 2 .
π

Note that π = Γ( 12 ). Hence, Y ∼ Γ 21 , 12 .


Monotonicity

Definition 3.9.6. A function is monotone if it is strictly increasing or


strictly decreasing.

Remark 3.9.7. If a function is monotone increasing, then

y ∈ (c, d) ⇐⇒ x ∈ (a, b), .

and hence, g −1 ((c, d)) = (a, b).

39
Week 7: Lecture 1 3 ⧸ Random Variables & Univariate Distributions

y = g(x)
d

y0

x
0 a g −1 (y0 ) b

Figure 3.7: A monotone increasing function

Similarly, if a function is monotone decreasing, then

y ∈ (c, d) ⇐⇒ x ∈ (a, b), .

and hence, g −1 ((c, d)) = (a, b)

y0
d
y = g(x)

x
0 a g −1 (y0 ) b

Figure 3.8: A monotone decreasing function

In general, if g is monotone (increasing or decreasing),

40
Week 7: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

g −1 ((−∞, y]) = {x ∈ R : g(x) ∈ (−∞, y]}



(−∞, g −1 (y)], g is increasing
=
[g −1 (y), ∞), g is increasing.

FY (y) = P (X ∈ g −1 ((−∞, y]))



P (X ∈ (−∞, g −1 (y)]), g↑
=
P (X ∈ (g −1 (y), ∞]), g↓

F (g −1 (y)) g↑
X
=
1 − FX (g −1 (y)−), g↓

Note. FX (x−) = limh↓0 FX (x − h) = P (X < x).


Remark 3.9.8. If X is continuous, then
d

−1
 FX (g (y)),

 g ↑
dy
fY (y) =
d
 (1 − FX (g −1 (y))), g ↓


dy
 
d −1


 g (y) fX (g −1 (y)), g ↑
 dy
=  
d
 − g −1 (y) fX (g −1 (y)),

 g ↓
dy
d −1
= fX (g −1 (y)) g (y) , g ↑ or ↓ .
dy

Example 3.9.9. Let Y = eX . Note that

g(x) = ex ⇐⇒ g −1 (y) = log y,

so
d −1
fY (y) = fX (g −1 (y)) g (y)
dy
1
= fX (log y)
y
1
= fX (log y) , y ≥ 0.
y
If we define y = g(x), x = g −1 (y), we can write

dx
fY (y) = fX (x)
dy

41
Week 7: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

3.10 Week 7: Lecture 2


3.10.1 Location-scale transformation Wed 10 Nov 10:00

Let Y be a continuous random variable, and Y = µ + σX, with σ > 0. Then

dx
fY (y) = fX (x) .
dy

Note that
y−µ
y = µ + σx ⇐⇒ x = ,
σ
so  
y−µ 1
fY (y) = fX .
σ σ
What about the MGF/CGF?

MY (t) = E(etY ) = E(et(µ+σx) )


= E(etµ etµx ) = etµ MX (tσ)
⇒ KY (y) = ln MY (t) = tµ + KX (tσ)
(tσ)2 (tσ)3
= tµ + tσKX,1 + KX,2 + KX,3 · · ·
2 3!

KY,1 (t) = µ + σKX,1 (t)


KY,r (t) = σ r KX,r (t), for r = 2, 3, 4, . . .

3.10.2 Sequences of Random Variables & Convergence

Definition 3.10.1. A sequence converges (xn ) → x if, for all ε > 0 there
exists some N ∈ N such that |xn − x| < ε for all n ≥ N .

Say we have a sequence of random variables (Xn ). What does it mean to say
that (Xn ) “converges"?

Convergence in...

42
Week 7: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Definition 3.10.2. We say that a sequence of random variable (Xn ) con-


verges in

• probability, if limn→∞ P (|Xn − X| < ε) = 1, then Xn →P X.

• distribution, if limn→∞ FXn (x) = FX (x) for all x ∈ R, then Xn →d


X. Convergence in distribution is a milder form of convergence than
Probability.

• mean square, if E[(Xn − X)2 ] = 0 then Xn →m.s X. Convergence


in mean square is stronger than convergence in probability.

Remark 3.10.3. Note that


E[(Xn − X)2 ]
lim P (|Xn − X| < ε) ≤ → 0.
n→∞ ε2
So
lim P (|Xn − X| < ε) → 0, Xn →P X.
n→∞

And thus,

convergence in m.s. ⇒ c. in probability ⇒ c. in distribution .

Convergence almost surely

Definition 3.10.4. We say that Xn converges to X almost surely if


 
P lim |Xn − X| < ε = 1.
n→∞

More compactly, we say Xn →a.s. X.

Remark 3.10.5. Alternatively, if

A = {ω ∈ Ω : Xn (ω) → X(ω) as n → ∞},

then we want P (A) = 1. Now consider Ac . There exists ε > 0 where for every n
we can find m ≥ n with |Xm (ω) − X(ω)| > ε. Equivalently: There are infinitely
many m with |Xm (ω) − X(ω)| > ε.

If
An = |Xn − 0| > ε

for some ε ∈ R, then P (finitely many An occur)= 1, i.e. Xn →a.s. 0. Or,

“There’s going to be a last one" - Milt Mavrakakis.

43
Week 7: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

Remark 3.10.6. Note that convergence...

Almost Surely ⇒ in Probability ⇒ in Distribution

and, again, that convergence in

Mean Square ⇒ in Probability ⇒ in Distribution.

3.10.3 The Borel-Cantelli Lemmas


Definition 3.10.7. The limit superior is defined as

!
\ [
c
A = lim sup En = Em .
n→∞
n∈N m=n

S∞
Note that m=n Em occurs when at least one Em (m ≥ n) occurs.

Theorem 3.10.8 (First Borel-Cantelli Lemma). Let (Ω, F, P ) be a proba-


bility space and E1 , E2 , E3 , . . . ∈ F with n∈N P (En ) < ∞ then P (lim supn→∞ En ) =
P

0.

Proof. Observe that


∞ ∞
!!
\ [
P (lim sup En ) = P Em
n→∞
n=1 m=n

!
\
=P
n=1

= lim P (Bn )
n→∞

!
[
= lim P
n→∞
m=n

X
≤ lim P (Em ).
n→∞
m=n

Example 3.10.9. Define


n−1
X
Sn−1 = P (Em ) = lim (S∞ − Sn−1 ) = S∞ − S∞ = 0
n→∞
m=1

as long as S∞ < ∞.
For a coin, the probability of tails is P (Em ) = 1/2m . Then

X ∞
X
P (Em ) = 1/2m = 1 < ∞
m=1 m=1

44
Week 7: Lecture 2 3 ⧸ Random Variables & Univariate Distributions

so P ("infinitely many tails") = 0. ⋄

We can show that Xn →a.s. X by showing that


X
P (|Xn − X| > ε)
n∈N

converges.

Theorem 3.10.10 (Second-Borel-Cantelli Lemma). Suppose that


E1 , E2 , E3 , . . . are mutually independent and
X
P (En ) = ∞.
n∈N

Then P (lim supn→∞ En ) = 1.

Proof. Omitted. See here if you are still curious.

45
Chapter 4

Multivariate Distributions

4.1 Week 8: Lecture 1


4.1.1 Joint CDFs and PDFs Tue 16 Nov 14:00

Recall. Note that

FX : R → [0, 1] FX1 ,...,Xn : Rn → [0, 1].

Definition 4.1.1. The joint cumulative distribution function of


X1 , . . . , Xn is the function

FX1 ,...,Xn (x1 , . . . , xn ) = P (X1 ≤ x1 , X2 ≤ x2 , . . . , Xn ≤ xn ).

Note that the commas in the last expression indicate ∩.

Bivariate CDFs

Notation. We write a bivariate CDF as

FX,Y (x, y) = P (X ≤ x, Y ≤ y).

Note that

P (x1 < X ≤ x2 , y1 < Y ≤ y2 ) = FX,Y (x2 , y2 ) − FX,Y (x1 , y2 )


−FX,Y (x2 , y1 ) + FX,Y (x1 , y1 ).

Moreover,
FX,Y (−∞, y) = 0 = FX,Y (x, −∞)

Similarly,
FX,Y (∞, ∞) = 1.

46
Week 8: Lecture 1 4 ⧸ Multivariate Distributions

Lastly,

FX,Y (x, ∞) = lim FX,Y (x, y)


y→∞

= P (X ≤ x, Y ≤ ∞)
= P (X ≤ x)
= FX (x),

which is defined as the marginal CDF of X. Naturally,

lim FX,Y (x, y) = FY (y).


x→∞

If X, Y are both discrete, the joint PMF is fX,Y (x, y) = P (X = x, Y = y).


So XX
FX,Y (x, y) = fX,Y (u, v).
u≤x v≤y

Example 4.1.2. Draw 2 cards from a deck of 52 cards. Let X : number of


kings drawn, and Y : the number of aces drawn. Note that
44 43
fX,Y (0, 0) = ≈ 0.713.
52 51
We can represent the probabilities of each event using an array:

x↓y→ 0 1 2 fX (x)
0 0.713 0.133 0.004 0.850
1 0.133 0.012 0 0.145
2 0.004 0 0 0.004
fY (y) 0.85 0.145 0.004 1

Figure 4.1: An array representing the probabilities for values of x and y.

It follows that XX
fX,Y (x, y) = 1.
x y

and that X
fX (x) = fX,Y (x, y).
y

Definition 4.1.3. Random variables X, Y are jointly continuous if


Z y Z x
FX,Y (x, y) = fX,Y (x, y)(u, v) du dv
−∞ −∞

for all x, y ∈ R.

47
Week 8: Lecture 2 4 ⧸ Multivariate Distributions

So
∂2
fX,Y (x, y) = FX,Y (x, y).
∂x∂y
Now, we have
Z
fX,Y (x, y) dx dy = 1,
R2
and Z Z
fX (x) = fX,Y (x, y) dy, fY (y) = fX,Y (x, y) dx,
R R
and Z Z
P ((X, Y ) ∈ B) = fX,Y (x, y) dx dy.
B

Remark 4.1.4. Aside: Note that


X
fX (x) = fX,Y (x, y),
y

so
fX (0) = fX,Y (0, 0) + fX,Y (0, 1) + fX,Y (0, 2) + · · · .

4.2 Week 8: Lecture 2


Wed 17 Nov 10:00
Note. Sometimes when you have jointly continuous random variables, you need
to be careful about the support.

4.2.1 Bivariate Density


Example 4.2.1 (Bivariate Density).

 8xy, 0 < x < y < 1
fX,Y (x, y) =
 0, otherwise.

48
Week 8: Lecture 2 4 ⧸ Multivariate Distributions

x
0 1

Figure 4.2: The support of Example 4.2.1. The support of a bivariate PDF is
an interval in R2 . Finding the limits of integration for each axis can be tricky.

Z ∞ Z ∞ Z 1 Z y Z 1 Z 1
fX,Y (x, y) dxdy = 8xy dx dy = 8xy dy dx.
−∞ −∞ 0 0 0 x

Note that Z Z 1
fX (x) = fX,Y (x, y) dy = 8xy dy.
R x


P g(x)f (x)
X (discrete)
E(g(x)) = R ∞x

−∞
g(x)fX (x) dx (continuous).

4.2.2 Multiple Random Variables


Recall. If X is a random variable and g : R → R is a well-behaved function,
then
Example 4.2.2. X1 , X2 , . . . , Xn : Daily max temperatures. Say n = 365. You
might want to take the average:

X1 , X2 , . . . , Xn
.
365
Or the maximum, or the median, etc. These are all functions g : R → Rn .
If X1 , X2 , . . . , Xn are random variables, and g : Rn → R is a well-behaved

49
Week 9: Lecture 1 4 ⧸ Multivariate Distributions

function.

E(g(X1 , X2 , . . . , Xn )) =

P · · · P g(x , . . . , x )f (x , . . . x )
x1 xn 1 n 1 n (discrete)
R · · · R g(x1 , . . . , xn )fX ,...,X (x1 , . . . , xn ) dx1 · · · dxn
R R 1 n
(continuous).


In previous example:
Z
E(X + 2Y ) = xfX,Y (x, y) dx dy
R2
Z 1 Z y
= (x + 2y)8xy dx dy.
0 0

But we can split them!

Z
E(X + 2Y ) = xfX,Y (x, y) dx dy
R2
Z
=2 yfX,Y (x, y) dx dy.
R2

4.2.3 Covariance and Correlation

Definition 4.2.3. Let X, Y be random variables. The covariance of X


and Y is defined as

Cov(X, Y ) = E[(X − E(x))(Y − E(Y ))]


= E(XY ) − E(X)E(Y ).

Some helpful properties of covariance:

• Cov(aX, aY ) = abCov(X, Y )

• Cov(X + c, Y + d) = Cov(X, Y )

• Cov(X, X) = Var(X)

• Cov(X + Y, U + V ) = Cov(X, U ) + Cov(X, V ) + Cov(Y, U ) + Cov(Y, V )

Definition 4.2.4. Let X, Y be random variables. The correlation coef-


ficient of X and Y is

Cov(X, Y )
Corr(X, Y ) = p = ρ,
Var(X) Var(Y )
with −1 ≤ ρ ≤ 1.

50
Week 9: Lecture 1 4 ⧸ Multivariate Distributions

4.3 Week 9: Lecture 1


Proposition 4.3.1. Let X, Y be random variables. Then Tue 23 Nov 14:00

−1 ≤ Corr(X, Y ) ≤ 1,

Moreover, | Corr(X, Y )| = 1 iff Y = rX + k, for constants r ̸= 0 and k.

Proof. Define Z = Y − rX, where r ∈ R. Observe that

0 ≤ Var(Z)
= Var(Y − rX)
= Var(Y ) + Var(−rX) + 2 Cov(Y, −rX)
= Var(Y ) + r2 Var(X) − 2r Cov(X, Y ).

Let h(r) = Var(Y ) + r2 Var(X) − 2r Cov(X, Y ). Note that h(r) is a


quadratic equation. Let ∆ be the discriminant of h(r). Then

∆ = b2 − 4ac
= (−2 Cov(X, Y ))2 − 4 Var(X) Var(Y )
= 4(Cov(X, Y )2 − Var(X) Var(Y )).

Since 0 ≤ h(r), h(r) has at most one root. Then ∆ ≤ 0, and hence

Cov(X, Y )2 ≤ Var(X) Var(Y ).

Thus,
!2
Cov(X, Y )
p ≤ 1,
Var(X) Var(Y )
which implies that −1 ≤ Corr(X, Y ) ≤ 1. If ∆ = 0, or Corr(X, Y )2 = 1,
then h(r) has a double root, i.e., h(r∗ ) = 0 for some r∗ ∈ R. Moreover,

h(r∗ ) = 0 ⇐⇒ Var(Y − r∗ X) = 0,

so
Y − r∗ X = k ⇐⇒ Y = r∗ X + k.
Cov(X,Y )
We can show that r∗ = − 2a
b
= Var(X) .

51
Week 9: Lecture 1 4 ⧸ Multivariate Distributions

Now suppose that Y = rX + k. Then

Cov(X, Y ) = Cov(X, rX + k)
= r Cov(X, X)
= r Var(X)
= Var(Y )
= Var(rX + k)
= r2 Var(X).

So
r Var(X)
Corr(X, Y ) = p
Var(X)r2 Var(X)
r
=√
r2
r
=
|r|

1, if r > 0
=
−1, if r < 0.

4.3.1 Joint Moments


Definition 4.3.2. If X, Y are random variables, the (r, s)th joint moment
of X and Y is
µ′r,s = E(X r Y s ).

Definition 4.3.3. The (r, s)th joint central moment is

µr,s = E[(X − E(X))r (Y − E(Y ))s ].

Example 4.3.4. Note that

µ′1,0 = E(X)
µ′r,0 = E(X r )
µ′0,3 = E(Y 3 ).


Example 4.3.5. Note that

Cov(X, Y ) µ1,1
Corr(X, Y ) = p =√ .
Var(X) Var(Y ) µ2,0 µ0,2

52
Week 9: Lecture 1 4 ⧸ Multivariate Distributions


Example 4.3.6. Let

x + y, 0 ≤ x, y ≤ 1
fX,Y (x, y) =
0, otherwise.

Then

µ′r,s = E(X r Y s )
Z
= xr y s fX,Y (x, y) dx dy
R2
Z 1 Z 1
= xr y s (x + y) dx dy
0 0
Z 1 Z 1
= (xr+1 y s + xr y s+1 ) dx dy
0 0

= ...

4.3.2 Joint MGFs


Definition 4.3.7. The joint MGF of X and Y is

MX,Y (t, u) = E(etX+uY )


= E(etX euY )
 ! 
X (tX)i X (uY )j
= E  
i! j!
i∈N0 j∈N0
 
i j
X X t u
= E X iY j 
i!j!
i∈N0 j∈N0
X X ti uj
= E(X i Y j ) .
i!j!
i∈N0 j∈N0

Note that E(X i Y j ) = µi,j .


tr us
The (r, s)th joint moment of X, Y is the coefficient of r!s! in the power series
expansion of MX,Y (t, u). Moreover,

(r,s) ∂ r+s
MX,Y (0, 0) = MX,Y (t, u)
∂tr ∂us t=0, u=0

= µ′r,s
= E(X r Y s ).

53
Week 9: Lecture 2 4 ⧸ Multivariate Distributions

4.3.3 Joint CGFs


Definition 4.3.8. Define

KX,Y (t, u) = log MX,Y (t, u).

Then KX,Y (t, u) is the joint cumulant generating function of X and


Y. Let
X X ti uj
KX,Y (t, u) = κi,j .
i!j!
i∈N0 j∈N0

Then κi,j is the (i, j) th


joint cumulant.

Example 4.3.9. Let X, Y be random variables. Then κ1,1 = Cov(X, Y ). ⋄

Proof. Observe that

MX,Y (t, u) = 1 + µ′1,0 t + µ′0,1 u + µ′1,1 tu + · · ·


⇒ KX,Y (t, u) = log MX,Y (t, u).

This implies that

∂ ∂
MX,Y (t, u) µ′1,0 + µ′1,1 u + · · ·
KX,Y (t, u) = ∂t =
∂t MX,Y (t, u) MX,Y (t, u)

∂2 MX,Y (t, u)
⇒ KX,Y (t, u) = ∂t
∂u ∂t MX,Y (t, u)
µ′1,0 + µ′1,1 u + · · · (µ′1,0 + µ′1,1 u + · · · )(µ′0,1 + · · · )
= −
MX,Y (t, u) (MX,Y (t, u))2

Thus,
(1,1)
κ1,1 = KX,Y (0, 0) = µ′1,1 − µ′1,0 µ′0,1
= E(XY ) − E(X)E(Y )
= Cov(X, Y ).

By Example 4.3.9, we can write


κ1,1
Corr(X, Y ) = √ .
κ2,0 κ0,2

54
Week 9: Lecture 2 4 ⧸ Multivariate Distributions

4.4 Week 9: Lecture 2


4.4.1 Independent Random Variables Wed 24 Nov 10:00

Definition 4.4.1. Two random variables X and Y are independent (X ⊥ ⊥


Y ) iff {X ≤ x} and {Y ≤ y} are independent events for all x, y ∈ R, i.e.:

FX,Y (x, y) = P (X ≤ x, Y ≤ y) = P (X ≤ x)P (Y ≤ y) = FX (x)FY (y).

If X, Y are independent and jointly continuous, then

fX,Y (x, y) = fX (x)fY (y).

If (X, Y ) are independent and discrete, then

fX,Y (x, y) = P (X = x, Y = y)
= P (X = x)P (Y = y)
= fX (x)fY (y).

Let X, Y be jointly continuous. If X ⊥


⊥ Y then
Z ∞Z ∞
E(X, Y ) = xyfX,Y (x, y) dx dy
−∞ −∞
Z ∞ Z ∞
= xfX (x) dx yfY (y) dy
−∞ −∞

= E(XY )
= E(X)E(Y ).

Hence, X ⊥
⊥ Y ⇒ X, Y are uncorrelated, i.e., Cov(X, Y ) = 0.

Proposition 4.4.2. If X ⊥
⊥ Y and g, h : R → R are well-behaved functions,
then g(X) ⊥
⊥ h(Y ) and E(g(X)h(Y )) = E(g(X))E(h(Y )).

Proof. Omitted. Left as an exercise.

Example 4.4.3. For random variables X, Y with X ⊥


⊥Y,

MX,Y (t, u) = E(etx euY )


= E(etx )E(euY )
= MX (t)MY (u),

and thus, KX,Y = KX (t) + KY (t).

55
Week 9: Lecture 2 4 ⧸ Multivariate Distributions

Example 4.4.4. Let X, Y be continuous random variables with joint density



x + y, 0 < x, y < 1
fX,Y (x, y) =
0, otherwise.
Note that

fX (x) = . . . = x + 1/2, 0<x<1


fY (y) = . . . = y + 1/2, 0 < y < 1,

and thus,
fX,Y (x, y) ̸= fX (x)fY (y),
so X ̸⊥
⊥Y. ⋄
Example 4.4.5. Let

kxy, 0<x<y<1
fX,Y (x, y) =
0, otherwise.
Two functions that don’t have the same support cannot be the same function.
Hence, X ̸⊥
⊥ Y because of the support. ⋄
Notation. We write that X1 , X2 , . . . , Xn are independent iff {X1 ≤ x1 }, . . . , {Xn ≤
xn } are mutually independent. Hence,
u
Y
FX1 ,...,Xn (x1 , . . . , xn ) = FXi (xi ).
i=1
Also
E(X1 X2 · · · Xn ) = E(X1 ) · · · E(Xn ).

4.4.2 Random Vectors & Random Matrices


Definition 4.4.6. We say that X is a random vector if


X1
 X2 
 
 .. 
X= 
 . 
Xn

for random variables (Xi ). We say that W is a random matrix if


 
W1,1 · · · W1,n
 . ..
..

W=  . · · ·


Wm,1 · · · Wm,n

for random variables (Wi,j ).

56
Week 9: Lecture 2 4 ⧸ Multivariate Distributions

Let X = (X1 , . . . , Xn )T , x = (x1 , . . . , xn )T . So

FX (x) = FX1 ,...Xn (x1 , . . . , xn ).

And similarly for fX (x) and Mx (t). The expectation of a random vector X is
given by  
E(X1 )
 . 
E(X) =  . 
 . ,
E(Xn )
and the expectation of a random matrix W is given by
 
E(W1,1 ) . . . E(W1,n )
 .. .. .. 
E(W) =   . . .


E(Wm,1 ) . . . E(Wm,n )
What is the variance of a random vector?

Recall. For a random variable X,


Z
E(g(X)) = g(x)fX (x) dx.
Rn

Then

Var(X) = E[(X − E(X)(X − E(X))T ]


 
Var(X1 ) Cov(X1 , X2 ) Cov(X1 , X3 ) . . .
 .. .. 
= Cov(X2 , X1 )
 . . .

 .. 
. ... Var(Xn )
Note that this is a symmetric n × n matrix. If X1 , . . . Xn are independent
Qn
& identically distributes (IID), or FX (x) = i=1 fX1 (xi ), then Var(X) = σ 2 In
where σ 2 =Var(X1 ).

Definition 4.4.7. An n × n matrix A is positive semidefinite (or non-


negative definite) if, for any b ∈ Rn if it holds that

bT Ab ≥ 0.

Let X be an n × 1 random vector and let b ∈ Rn (vector of constants). Then


T T T T
0 ≤ Var( b
| {zX} ) = E[(b X − E(b X))(. . .) ]
n scalar (1×1)

= E[bT (X − E(X))(X − E(X)T b]


= bT E[(X − E(X))(X − E(X))T ]b
= Var(X).

We often write Var(X) ≥ 0 in place of writing that the variance matrix is


positive semidefinite.

57
Week 10: Lecture 1 4 ⧸ Multivariate Distributions

4.4.3 Transformations of Random Variables


Recall. Univariate case: Let X be a random variable, and Y = g(x) where
g : R → R is monotonic. Then

dx
fY (y) = fX (x) ,
dy

where x = g −1 (y).

Remark 4.4.8. We now want to transform (U, V ) into (X, Y ). We have

X = g1 (U, V )
Y = g2 (U, V ),

where (X, Y ) = g(U, V ). Assume that the transformation is a bijective function.


The inverse is (U, V ) = h(X, Y ) = g−1 (X, Y ). Then

fX,Y (x, y) = fU,V (u, v)|Jh (x, y)|,

where Jh (x, y) is the Jacobian of h.

4.5 Week 10: Lecture 1


4.5.1 Sums of Random Variables Tue 30 Nov 14:00

Recall. For random variables X, Y ,

E(X + Y ) = E(X) + E(Y ).

Var(X + Y ) = Var(X) + 2 Cov(X, Y ) + Var(Y )


r   r  
X r X r ′
E[(X + Y )r ] = E(X j Y r−j ) = µ .
j=0
j j=0
j j,r−j

Proposition 4.5.1. If Z = X + Y, then


X

 fX,Y (u, z − u), (discrete),

u
fZ (z) = Z
 fX,Y (u, z − u)du (continuous).


R

58
Week 10: Lecture 1 4 ⧸ Multivariate Distributions

Proof. For the discrete case, note that

fZ (z) = P (Z = z)
= P (X + Y = z)
X
= P (X = u, Y = z − u)
u
X
= fX,Y (u, z − u).
u

By the Law of Total Probability,


[
{X + Y = Z} = {X = u, Y = Z − U }.
u

For the continuous case, note that

Z = X + Y, U = X ⇐⇒ X = U, Y = Z − U.

Let
(Z, U ) = ∂(X, Y ), (X, Y ) = h(U, Z).

Then
∂x ∂x
Jh (x, y) = ∂u ∂z
∂y ∂y
∂u ∂z

1 0
= .
−1 1

Then
fU,Z (u, z) = fX,Y (u, z − u)|Jh | = 1,

which implies that


Z Z
fZ (z) = fU,Z (u, z) du = fX,Y (u, z − u) du
R R

Definition 4.5.2. Let f and g be functions. The convolution of f and g


is Z
f (x)g(y − x) dy.
R

Notation. If f and g are functions, their convolution is denoted by f ∗ g.

59
Week 10: Lecture 1 4 ⧸ Multivariate Distributions

Remark 4.5.3. If X ⊥
⊥ Y, then

P f (u)f (z − u)
X Y (discrete),
fZ (z) = R u
 fX (u)fY (z − u) du
R
(continuous).

Hence,
fZ = fX ∗ fY = fY ∗ fX .

Assume X ⊥ ⊥ Y. To work out the distribution of Z = X + Y, either work out


the convolution of fX , fY , or use their MGFs/CGFs:

MZ (t) = MX (t)MY (t) ⇐⇒ KZ (t) = KX (t) + KY (t).

Example 4.5.4.

X ∼ N (µX , σx2 ), Y ∼ N (µY (σY2 ), with X⊥


⊥ Y, Z = X + Y
2
Recall that Kx (t) = µX t + σX 2 , so
2 t

KZ (t) = KX (t) + KY (t)


t2 t2
= µx t + σx2
+ µY t + σY2
2 2
2 t2
= (µX + µY )t + (σX + σY2 )
2
⇒ Z ∼ Normal(µX + µY , σX + σY2 ).
2


Example 4.5.5. Let

X ∼ Exp(λ), Y ∼ Exp(θ), X⊥
⊥ Y, Z = X + Y

Observe that
Z
fZ (z) = fX (u)fY (z − u) du
ZRz
= λe−λu θe−λ(z−u) du
0
 z
1 −(λθ)u
= λθe−θz − e
λ−θ 0
λθ −θz −(λ−θ)z
= e (1 − e
λ−θ
λθ(e−θz − e−λz )
= , for z > 0, λ ̸= 0
λ−θ

60
Week 10: Lecture 1 4 ⧸ Multivariate Distributions

Example 4.5.6. Consider

X1 , X2 , . . . , Xn .
Pn
Let S = i=1 Xi . Suppose that (Xi ) are mutually independent. Then
n
Y
fS = fX1 ∗ fX2 ∗ . . . fXn , MS (t) = MXi (t).
i=1


If X1 , . . . , Xn are IID (identically distributed), then
n
Y
MS (t) = MXi (t) = (MX1 (t))n ,
i=1

which implies that KS (t) = nKX1 (t).


Example 4.5.7. If X1 , . . . , Xn ∼ Bernoulli(p), then

MS (t) = (MX1 (t))n = (1 − p + pet )n ,

which implies that S ∼ Bin(n, p). ⋄

4.5.2 Multivariate Normal Distributions


Bivariate Normal Distribution

How can we derive a bivariate normal distribution? Starting point: take U, V ∼


Normal(0, 1), with U ⊥
⊥ V . Then

1 −(u2 +v2 )/2


fU,V (u, v) = fU (u)fV (v) = e , u, v ∈ R.

Moreover,
2
+t2 )/2
MU,V (s, t) = e(s , s, t ∈ R
Let U, V ∼i N (0, 1). Define
p
X = U, Y = ρU + 1 − ρ2 V, where |ρ| ≤ 1.

Some quick properties of the bivariate standard normal:

(1) X ∼ N (0, 1) by definition. Moreover, Y is normal, as it is a sum of inde-


pendent Normals:
p
E(Y ) = E(ρU + 1 − ρ2 V ) = 0,

and thus,
p
Var(Y ) = ρ2 Var(U ) + ( 1 − ρ2 )2 Var(V )
= ρ2 + 1 − ρ2 = 1,

which implies that Y ∼ Normal(0, 1).

61
Week 10: Lecture 1 4 ⧸ Multivariate Distributions

(2) Corr(X, Y ) = ρ. Observe that


p
Cov(X, Y ) = Cov(U, ρU + 1 − ρ2
p
= Cov(U, ρU ) + Cov(U, 1 − ρ2 V )
= ρ Cov(U, U )
= ρ.

Thus,
Cov(X, Y )
Corr(X, Y ) = p =ρ
Var(X) Var(Y )

(3) Any linear combination of X and Y is normally distributed:


p
aX + bY + c = aU + b(ρ + 1 − ρ2 V ) + c
p
= (a + bρ)u + b 1 − ρ2 V + c.

which is Normal, as U ⊥
⊥ V.

Example 4.5.8. Let U, V ∼i Normal(0, 1), and


p
X = U, Y = ρU + 1 − ρ2 V.

We have

fX,Y (x, y) = . . . (try this!)


x2 − 2ρxy + y 2
 
1
= exp − , x, y ∈ R
2(1 − ρ2 )
p
2π 1 − ρ2
= fU,V (u, v)|Ju (x, y)|, x, y ∈ R.

Example 4.5.9. Let U, V ∼i Normal(0, 1), and


p
X = U, Y = ρU + 1 − ρ2 V.

Then
1 2
KX,Y (t, u) = (s + 2ρst + t2 ).
2

Proof. Try this!

Finally, to obtain the bivariate normal from X and Y , we take

X ∗ = µX + σX X, Y ∗ = µY + σY Y.

Then X ∗ ∼ N (µX , σX ), and Y ∗ ∼ N (µY , σY ), and Corr(X ∗ , Y ∗ ) = ρ.

62
Chapter 5

Conditional Distributions

5.1 Week 10: Lecture 2


5.1.1 Another Deck of Cards Wed 1 Dec 10:00

Example 5.1.1. Draw 2 cards from full deck. Define Y : # of aces, X : # of


kings (see Figure 4.1).

P (one Ace | one King) = P (Y = 1 | X = 1)


P (Y = 1, X = 1)
=
P (X = 1)
fX,Y (1, 1)
= .
fX (1)

5.1.2 Conditional Mass and Density


In general,
fX,Y (x, y)
P (Y = y | X = x) = .
fX (x)

Definition 5.1.2. The conditional probability mass function of Y


given X = x is
fX,Y (x, y)
fY |X (y | x) =
fX (x).

63
Week 10: Lecture 2 5 ⧸ Conditional Distributions

Question: Does it sum to 1?


X X fX,Y (x, y)
fY |X (y | X) =
y
fX (x)
y
1 X
= fX,Y (x, y)
fX (x) y
fX (x)
=
fX (x)
= 1. ✓

Definition 5.1.3. The conditional cumulative distribution function


of Y given X = x is
X
FY |X (y | x) = fY |X (u | x)
u≤y

Definition 5.1.4. If X, Y are jointly continuous, we define the condi-


tional probability density function of Y given X = x as

fX,Y (x, y)
fY |X (y | x) = .
fX (x)

Example 5.1.5. Let X, Y be jointly continuous random variables with



8xy, 0 < x < y < 1
fX,Y (x, y) =
0, otherwise.

Recall that
fX (x) = 4x(1 − x2 ), 0 < x < 1.

Then
8xy 2y
fY |X (y | x) = 2
= , x < y < 1.
4x(1 − x ) 1 − x2
Furthermore,
Z y
FY |X (y | x) = fY |X (u | x) du
−∞
Z y
2u
= du
x 1 − x2
y 2 − x2
= , x < y < 1.
1 − x2
Plug in y = x to check if this is plausible.
Recall. P (A ∩ B ∩ C) = P (A | B ∩ C)P (B | C)P (C). Similarly,

64
Week 10: Lecture 2 5 ⧸ Conditional Distributions

fX,Y (x, y) = fY |X (y | x)fX (x),

and
fX,Y,Z (x, y, z) = fZ|X,Y (z | x, y)fY |X (y | x)fX (x).

A Simple Model

Example 5.1.6. X : # of hurricanes formed, Y : # of hurricanes making


landfall
Suppose that X ∼ Poisson(λ) and (Y | X = x) ∼ Bin(x, p). Then

e−λ λx
 
x y
fX,Y (x, y) = fY |X (y | x)fX (x) = p (1 − p)x−y .
y x!
supported by x, y = 0, 1, 2, . . . , y ≤ x. Then
X
fY (y) = fX,Y (x, y)
x

X x! e−λ λx
= py (1 − p)x−y
x=y
y!(x − y)! x!

e−λ py X (1 − p)x−y λx
= .
y! x=y (x − y)!

Let z = x − y. Then

e−λ py y X (1 − p)z λz
fY (y) = λ
y! z=0
z!
e−λ (λpy ) λ(1−p)
= e
y!
e−λp (λp)y
=
y!
−λp
e (λp)y
= , y = 0, 1, 2, . . .
y!

This implies that Y ∼ Poisson(λp). ⋄

In general, if X is discrete and Y is continuous,

fX,Y (x, y) = fY |X (y | x) × fX (x)


| {z } | {z } | {z }
joint mass / density conditional density marginal mass

Moreover, Z X
fX,Y (x, y) dy = 1.
R x

65
Week 11: Lecture 1 5 ⧸ Conditional Distributions

Insurance Example

Example 5.1.7. Define Z : total value of claims, Y : # of claims submitted,


and X : average # of claims. Suppose that

X ∼ Γ(α, λ), (Y | X = x) ∼ Poisson(x).

Then
(Z | Y = y) ∼ some continuous model.

5.2 Week 11: Lecture 1


5.2.1 Conditional Expectation Tue 7 Dec 14:00

Definition 5.2.1. The conditional expectation of Y given X is E(Y |


X) = ψ(X).

Example 5.2.2 (Hurricanes).

(Y | X = x) ∼ Bin(x, p),

so E(Y | X = x) = xp ⇒ E(Xp).

Important difference: E(Y | X) gives a random variable, E(Y | X = x) gives


ψ(x). ⋄

5.2.2 Law of Iterated Expecations

Proposition 5.2.3. For random variables X and Y , we have

E(Y ) = E(E(Y | X)].

66
Week 11: Lecture 1 5 ⧸ Conditional Distributions

Proof.

E[E(Y | X)] = E[ψ(X)]


Z
= ψ(x)fX (x) dx
ZR
= E(Y | X = x)fX (x) dx
ZR Z 
= yfY |X (y | x) dy fX (x) dx
ZR R

= yfY |X (y | x)fX (x) dy dx


R2
Z
= yfX,Y (x, y) dy dx
R2

= E(Y ).

Example 5.2.4 (More Hurricanes). E(Y ) = E[E(Y | X)] = E(Xp) = λp. Then
X ∼ Poisson(λ), (Y | X = x) ∼ Bin(x, p). ⋄

Note that the Law of Iterated Expectations is conceptually similar to The


Law of Total Probability, which states
X
P (A) = P (A | Bi )P (Bi ).
i∈N

Example 5.2.5. Let X, Y be random variables with joint density



xe−xy e−x , x, y > 0,
fX,Y (x, y) =
0, otherwise.

Find E(Y | X) :

Z Z ∞
fX,Y (x, y) dy = xe−xy e−x dy
R 0

= [e−xy e−x ]y→∞


y=0

= e−x , x > 0.

Hence, X ∼ Exp(1). It is often helpful to write this explicitly. Now,


fX,Y (x, y) xe−xy e−x
fY |X (y | x) = = = xe−xy .
fX (x) e−x
Hence, (Y | X = x) ∼ Exp(x). This implies that
1 1
E(Y | X = x) = and E(Y | X) = .
x X

67
Week 11: Lecture 1 5 ⧸ Conditional Distributions

If g : R → R is a well-behaved function, and we define


X

 g(y)fY |X (y | x) (discrete)

y
h(x) = E[g(Y ) | X = x] = Z
(continuous),

 g(y)fY |X (y | x) dy

R

then the conditional expectation of g(Y ) given X is E(g(Y ) | X) = h(X).

5.2.3 Properties of Conditional Expectation


For any two random variables X and Y ,

• E((aX + b) | Y ) = aE(X | Y ) + b,

• E(XY | X) = XE(Y | X).

• Think about: E(XY | X = x) = E(xY | X = x) = xE(Y | X = x).

• E[E(X | Y )Y | X] = E(Y | X)E(Y | X) = E(Y | X)2 , since E(Y | X) is a


function of X.

Definition 5.2.6. The rth conditional moment of Y given X is E(Y r |


X), and the rth conditional central moment is

E[(Y − E(Y | X))r | X].

Example 5.2.7 (Conditional Variance). Let X, Y be random variables. Then

Var(Y | X) = E[(Y − E(Y | X))2 |X]


= E(Y 2 | X) − E(Y | X)2 .

Proof. Prove this!

5.2.4 Law of Iterated Variance


Proposition 5.2.8. Let X and Y be random variables. Then

Var(Y ) = E[Var(Y | X)] + Var[E(Y | X)].

68
Week 11: Lecture 2 5 ⧸ Conditional Distributions

Proof. We have

Var(Y ) = E(Y 2 ) − E(Y )2


= E[E(Y 2 | X)] − (E[E(Y | X)])2
= E[Var(Y | X) + E(Y | X)2 ] − E[E(Y | X)]2
| {z } | {z }
ψ(X)2 ψ(X)

= E[Var(Y | X)] + E[ψ(X) ] − (E(ψ(X)))2


2

= E[Var(Y | X)] + Var[E(Y | X)].

Hurricanes Again

Example 5.2.9. Let (Y | X = x) ∼ Bin(x, p), and X ∼ Poisson(λ). Then

Var(Y ) = E[Var(Y | X)] + Var[E(Y | X)]


= E(Xp(1 − p)) + Var(Xp)
= λp(1 − p) + λp2
= λp.

5.3 Week 11: Lecture 2


5.3.1 Conditional Moment Generating Function Wed 8 Dec 10:00

Definition 5.3.1. If

MY |X (u | v) = E[euY |X = x] = ϕ(u, x),

then the conditional moment generating function is ϕ(u, X). Hence,

MY |X (u | X) = E[euY | X].

Observe that by the Law of Iterated Expecations,

MY (u) = E[MY |X (u | x)] = E(euY ).

Example 5.3.2. Let

X ∼ Poisson(λ), (Y | X = x) ∼ Bin(x, p).

Then

MY |X (y | x) = (1 − p + peu ) ⇒ MY |X (u | X) = (1 − p + peu )X .

69
Week 11: Lecture 2 5 ⧸ Conditional Distributions

So

MY (u) = E[MY |X (u | X)]


= E[(1 − p + peu )X ]
u
= E[eX ln(1−p+pe )]

= MX (ln(1 − p + peu ))
u
= exp(λ(eln(1−p+pe )
− 1)
u
−1)
= eλp(e ,

so Y ∼ Poisson(λp).
t
Remark 5.3.3. Aside: MX (t) = eλ(e −1)
.
Thus,

MX,Y (t, u) = E[etX euY ]


= E[E[etX euY | X]]
= E[etX E[euY | X]]
= E[etX MY |X (u | X)].

5.3.2 Some Practical Applications


Example 5.3.4 (Height). Suppose that you know the mean height and variance
of the male and female population. Let

X : height of a student,
W : male or female (male = 0, female = 1).

Then
2
(X | W = 1) ∼ Normal(µW , σW )
2
(X | W = 0) ∼ Normal(µM , σM )
W ∼ Bernoulli(p).

Moreover,
X
fX (x) = fX|W (x | w)fW (w)
w
| {z }
fX,W (x,w)

= p(fX|W (x | 1) + (1 − p)fX|W (x | 0).

Note that
1 (x−µw )2
fX|W (x | 1) = √ e− 2σ 2 w .
2πσ 2 w

70
Week 11: Lecture 2 5 ⧸ Conditional Distributions

fX (x)

X|w=1

X
X|w=0

x
0 µW µM

Figure 5.1: The distribution of X from Example 5.3.4 is somewhere in between


the distributions of X | w = 1 and X | w = 0.


Example 5.3.5 (Household Insurance).

Exp(λ) : amount claimed each year


Normal ∼ Geo(p) : years policy is held
Y : total amount claimed

Then
N
X
Y = Xi , a random sum.
i=1

We assume that N is independent of Xi . This is a basic assumption that


may or may not be reasonable, depending on the situation. We further assume
that N ≥ 0, with Y = 0 if N = 0. Then

n
!
X
E(Y | N = n) = E Xi )
i=1

= nE(X1 ).

So

E(Y ) = E[E(Y | N )]
= E[N E(X1 )]
= E(X1 )E(N ).

71
Week 11: Lecture 2 5 ⧸ Conditional Distributions

Moreover,
n
!
X
Var(Y | N = n) = Var Xi
i=1

= n Var(X1 ).
Now, how do we iterate variances? We use the Law of Iterated Variance:

Var(Y ) = E[Var(Y | N )] + Var(E(Y | N )]


= E[N Var(X1 )] + Var(N E(X1 ))
= Var(X1 )E(N ) + E(X1 )2 Var(N ).
Moreover,
MY |N (u | n) = E(euY | N = n)
 Pn 
= E eu i=1 Xi

= (MX1 (u))n .
Finally,
MY (u) = E[NY |N (u | N )]
= E[(MX1 (u))N ]
= E[exp(N ln MX1 (u)]
= Mn (log MX1 (u)).
This implies
KY (u) = KN (KX1 (u)).
Back to the insurance example. Note that X1 , X2 , . . . ∼ Exp(λ), and N ∼
Geo(p). We have

11 1
E(Y ) = E(N )E(X1 ) = = .
pλ λp
Then
MY (u) = MN (ln(MX1 (u))
u −1
= MN (ln(1 − )
λ
 −1
1 1 u
= 1− + 1−
p p λ
 −1
1 1 u
= 1− + −
p p λp
 
u
= 1− ,
λp
so Y ∼ Exp(λp). ⋄

72
Conclusion

Any issues with the lecture notes can be reported on the git repository, by
either submitting a pull request or an issue. I am happy to fix any typos or
inaccuracies with the content. In addition, feel free to edit my work, just keep
my name on it if you’re going to publish it somewhere else. The figures can
be edited with Inkscape, the software I used to create them. When editing the
figures, make sure to save to pdf, and choose the option that exports the text
directly to LATEX. I hope these notes helped!

73

You might also like