Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
72 views59 pages

Lecture Notes Probability39

This document contains lecture notes on fundamentals of probability theory. It covers topics like measures, measurable functions, random variables, integration, limit laws, and Gaussian random variables. It begins with preliminaries on countable sets and bijectivity of maps. It then defines measures, discrete measures, independence, and the Borel-Cantelli lemma. Subsequent sections cover measurable functions, modes of convergence, Lebesgue integration, density functions, transformation of random variables, product measures, strong and weak laws of large numbers, characteristic functions, and convergence in distribution and the Central Limit Theorem.

Uploaded by

Husam Zaheer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views59 pages

Lecture Notes Probability39

This document contains lecture notes on fundamentals of probability theory. It covers topics like measures, measurable functions, random variables, integration, limit laws, and Gaussian random variables. It begins with preliminaries on countable sets and bijectivity of maps. It then defines measures, discrete measures, independence, and the Borel-Cantelli lemma. Subsequent sections cover measurable functions, modes of convergence, Lebesgue integration, density functions, transformation of random variables, product measures, strong and weak laws of large numbers, characteristic functions, and convergence in distribution and the Central Limit Theorem.

Uploaded by

Husam Zaheer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

LECTURE NOTES

FUNDAMENTALS OF PROBABILITY THEORY

IGOR WIGMAN

Contents
0. Preliminaries: countable sets 1
I. Measures 4
I.1. Motivation 4
I.2. Discrete measures 6
I.3. Sets generating σ-algebras 7
I.4. Extending to measures 8
I.5. Independence 12
I.6. The Borel-Cantelli lemma 15
II. Measurable functions and random variables 17
II.1. Measurable functions 17
II.2. Random variables 19
II.3. Modes of convergence 21
III. Lebesgue theory of integration 24
III.1. Lebesgue integrals: definition and fundamental properties 24
III.2. Density functions 32
III.3. Transformation of random variables 37
III.4. Product measures 38
IV. Limit laws and Gaussian random variables 42
IV.1. Strong and weak laws of large numbers 42
IV.2. Characteristic function 45
IV.3. Gaussian random variables 50
IV.4. Convergence in distribution and Central Limit Theorem 54

0. Preliminaries: countable sets


The notion of countably infinite (or finite) sets is instrumental for the rest of the module. To
define these we will need the notion of bijectivity of maps.
Definition 0.1 (Bijectivity of maps). Let A and B be two arbitrary sets, finite or infinite, and a
map f : A → B.
(1) We say that f is injective (injection), if for a, b ∈ A, f (a) = f (b) implies that a = b. In
other words, f is injective, if every element c of B has at most one element in A mapping to
c by f .

This course was inspired to high extent by the lecture notes of the course given by Kolyan Ray at King’s College
London, during the academic year 2019 − 2020.
1
2 IGOR WIGMAN

(2) We say that f is surjective (surjection), if for every c ∈ B there exists a ∈ A so that
f (a) = c. In other words, f is surjective, if every element c of B has at least one element in
A mapping to c by f .
(3) We say that f is bijective (bijection), if f is injective and surjective.
If f : A → B is bijective, then it is invertible, i.e. there exists a map g = f −1 : B → A so that
g ◦ f = idA and f ◦ g = idB . If f : A → B is a map, C ⊆ A is a subset of A, and D ⊆ B is a subset
of B, then the (direct) image f (C) = {f (a) : a ∈ C} ⊆ B of C is the collection of all the images of
elements in C, and the inverse image f −1 (D) = {a ∈ A : f (a) ∈ D} ⊆ B of D is the collection of
all the elements of A mapping into D, regardless of whether f is invertible or not. If f : A → B is
injective, then the restriction of its range f : A → f (A) is bijective.
For A a finite set, its cardinality |A| is the number of elements in A. It is easy to check that if
A, B are both finite, then if there exists an injection f : A → B, then |A| ≤ |B|; if there exists a
surjection f : A → B then |A| ≥ |B|. In light of the above, if there exists a bijection f : A → B,
then |A| = |B|. We then extend the definition of cardinality to arbitrary sets (finite or infinite), by
making the latter into a definition.
Definition 0.2 (Cardinality, countability). (1) For two arbitrary sets A, B, |A| = |B| (we say
that A, B have the same cardinality), if there exists a bijection f : A → B.
(2) A set A is countably infinite if there exists a bijection f : A → N, with N = {1, 2, 3, . . .}
the set of positive integers. In this case we denote the cardinality of A: ℵ0 = |N|.
(3) We say that A is countable if either A is countably infinite or A is finite.
(4) A set A is uncountable if A is not countable. In particular, A has to be infinite.
The above defines an equivalence relation A ∼ B if |A| = |B|, with corresponding equivalence
classes being all the possible different cardinalities of sets. The countable infinite sets is the smallest
infinite cardinality, in the sense that every infinite set contains a countably infinite set, see also
Lemma 0.7 below; it is the equivalence class of the positive integers. If A is countably infinite, and
if f : A → N is the bijection postulated by the definition of countability, then we may list all the
elements of A sequentially:
A = {a1 = f −1 (1), a2 = f −1 (2), . . .};
if A is finite, then the same makes sense, except that the corresponding sequence is finite. In other
words, A is countable if it could be enumerated as a sequence (finite or infinite).
Example 0.3. Let Z be the collection of all integers. We define f : Z → N by
(
2n n>0
f (n) = .
1 − 2n n ≤ 0
That is, we enumerate Z = {0, 1, −1, 2, −2, . . .} (here 0 = f −1 (1), 1 = f −1 (2), 2 = f −1 (3),...). Since
f is a bijection (i.e., our enumeration indeed covers each element of Z precisely once), the set Z is
countable. It is a bit counter-intuitive, as Z properly contains N, and “twice as big”, but, nevertheless,
equals to N by cardinality.
Example 0.4. Any infinite subset of N is countably infinite, since it could be enumerated.
Example 0.5. The set N × N = {(n, m) : n, m ∈ N} is countable. To see that, we can enumerate
N × N in the following way: (1, 1), (2, 1), (1, 2), (3, 1), (2, 2), (1, 3), (4, 1), (3, 2), (2, 3), (1, 4), . . ., i.e. if
one draws a picture (which is instructive), this corresponds to listing all elements of each consecutive
diagonal, starting from the bottom right one. This corresponds to the function h : N × N → N,
a

h((a, b)) = b − a + 1.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 3

In general, for k ≥ 1, the set Nk = {(n1 , . . . , nk ) : ni ∈ N} is countable. To see that we use


induction, having proved the statement for k = 1, 2. If k ≥ 2, we use the induction hypothesis to
obtain a bijection f : Nk−1 → N, and then define the function g : Nk = Nk−1 × N → N × N given by
g(n1 , . . . , nk ) = (f (n1 , . . . nk−1 ), nk ),
which, it is easy to check, is a bijection.
Problem 0.6. (1) Prove that a countable union of straight lines on R2 cannot cover the whole
2
of R .
(2) Prove that a countable union of ellipses on R2 cannot cover the whole of R2 .
(Hint: for (1) take a circle and intersect it with the given straight lines. How would you modify
your argument to treat (2).)
Lemma 0.7. Let A be a set. Then the following are equivalent:
(1) A is countable (allowed to be finite).
(2) There exists an injection h : A → N.
(3) A = ∅ or there exists a surjection g : N → A.
Criterion (2) in Lemma 0.7 is the easiest criterion for countability to check, as it allows to infer
that A is countable (finite or infinite) directly.
Proof. (1)⇒(2): If A is finite, then there is a bijection f : A → {1, . . . , k} ⊆ N, where k = |A| is the
number of the elements of A. Otherwise, if A is countably infinite, then there exists a bijection (in
particular, injection) f : A → N.
(2)⇒(3): If h : A → N is an injection, then every element n ∈ N has at most one pre-image
a = an ∈ h−1 (n). We may then set unambiguously for n ∈ h(A): g(n) := an , and otherwise set g(n)
arbitrarily. Then g is surjective, since for every n ∈ A, g(h(n)) = n.
(3)⇒(1): Suppose that g : N → A is surjective, and assume further that A is infinite (as otherwise
we are done). Then, by the surjectivity assumption, for every a ∈ A there exists some pre-image
n = n(a) ∈ g −1 (a). Set h : A → N to be h(a) := n arbitrary element of g −1 (a). Then, since
the pre-images of distinct a, b ∈ A are disjoint by the definition, the map h is injective, and hence
h : A → h(A) ⊆ N is bijective. Mind that, by assumption that A is infinite, so is h(A), i.e. h(A) is
an infinite subset of N, and therefore countably infinite by Example 0.4. All in all, h : A → h(A) is
a surjection, hence |A| = |h(A)| = ℵ0 .

Example 0.8. The collection Q of rational numbers is countable. For r ∈ Q we take an arbitrary
representation r =  pq with p, q ∈ N and  = ±1, and map f : Q → N3 as f (r) = ( + 2, p, q), which
is clearly injective, since any given triple , p, q recovers the number r.
Example 0.9. The collection R of real numbers is uncountable. We argue by Cantor’s diagonal
argument. Assume by contradiction that x1 , x2 , . . . is an enumeration of R, and let xi = ni .di1 di2 di3 . . .

dij · 10−j . Now define x0 via its decimal expansion
P
be the decimal expansion of xi , i.e. xi = ni +
j=1
x0 = 0.d1 d2 d3 . . ., where (
0 djj 6= 0
dj = .
1 djj = 0
Then x0 can’t be equal to x1 , since d1 differs from the first digit d11 of the decimal expansion of x1 .
Reasoning similarly, this time invoking the i’th digit in place of the first one, x0 6= xi , and therefore
x0 ∈ R can’t possibly appear in the postulated enumeration of R, which is a contradiction to the
countability assumption of R.
4 IGOR WIGMAN

We can infer from examples 0.8 and 0.9 that “most” of the real numbers are irrational.
Lemma 0.10. A countable union of countable sets is countable. Namely, S if {Ai }α∈I is a family of
sets, where I is countable, and for all α ∈ I, Ai is countable, then A := Ai is a countable set.
α∈I

Proof. Since N × N is countable (Example 0.5), it is sufficient to find an injection f : A → N × N


(Lemma 0.7(2)). By Lemma 0.7(2), there exist injection f : I → N, and for every α ∈ I, and
injection gα : Aα → N. For x ∈ A, the set of indexes α ∈ I so that x ∈ Aα is non-empty, which
we denote I 0 ⊆ I, and let m = m(x) := min0 f (α) (the minimal such index, when identifying the
α∈I
indexes with a subset of N under f ), and α ∈ f −1 (m) uniquely defined. Then h(x) := (m, gα (x)) is
a injection A → N2 , since, given the values of m and gα (x), it is easy to recover x.

Example 0.11. For k ≥ 0 let Ak = {A ⊆ N : |Ak | = k} the collection of all subsets of N with
precisely k elements. (Note that an element of Ak is itself a set of positive integer numbers.) Then
Ak injects Ak → Nk , via the simple map
A = {a1 , . . . , ak } 7→ (a1 , . . . ak ) ∈ Nk ,
where the order of the elements {ak } of A is arbitrary. Therefore, for every k ≥ 0, Ak is countable,
and thanks to Lemma 0.10, the set
[∞
A = Ak (0.1)
k=0
of all finite subsets of N is itself countable.
The power set of a set X is P(X) = {A : A ⊆ X} the collection of all subsets of X, also
denoted 2X . It is easy to show using Cantor’s diagonal method, that P(N) is not countable. (If, by
contradiction, S1 , S2 , . . . is an enumeration of P(N), then the set S = {n ∈ N : n ∈
/ Sn } ∈ P(N) is
not equal to any of the Sn .) Hence we may decompose
P(X) = A ∪ A∞ ,
where A is as in (0.1) and
A∞ = {A ⊆ N : |A| = ∞}
is the collection of all infinite subsets of N, then necessarily A∞ is also uncountable.

I. Measures
I.1. Motivation. One can recall that a fair die falls on each of its 6 sides with probability 61 . In this
case, the sample space of all possible outcomes of a die toss is the finite set Ω = {1, 2, 3, 4, 5, 6},
and in this case one may ask what is the probability of getting as an outcome of the elements of
a given set (event) A ⊆ Ω, equalling to P (A) = 16 · |A|, i.e. the number of elements of A times
the probability 16 of the outcome being each of the elements. Alternatively, one may consider an
arbitrary finite sample space Ω = {x1 , . . . , xk } with k = |Ω|, endow each element of Ω with a given
k
P
probability pi =: P (xi ) ≥ 0, subject to the constraint pi = 1, and set for an event A ⊆ Ω:
i=1
X
P (A) = P (x).
x∈A

In some cases, this construction could be generalized, more or less as is, namely for countable sample
spaces (see §I.2). However, what if the sample space is uncountably infinite?
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 5

Suppose, for example, that a random number x ∈ [0, 1] is drawn, for a moment without specifying
what is meant by “random number in [0, 1]”. We could ask questions like “what is the probability that
0 < x < 12 or, more generally, for an interval (a, b) ⊆ [0, 1], “what is the probability that x ∈ (a, b)”.
It would make perfect sense that, if indeed x is drawn randomly with probability “spread equally”
in [0, 1], then the probability of x belonging to an interval is proportional to its length, so that, the
probability of x ∈ (a, b) is b − a, and, in particular, the probability that 0 < x < 12 is 12 .
One could ask more elaborate questions, like “what is the probability that x is rational”, or “what

bi
P
is the probability that in the binary expansion x = 2i
, bi ∈ {0, 1}, asymptotically half of the bits
i=1

ai
P
{bi }i≥1 are bi = 0, or “what is the probability that in the ternary expansion x = 3i
, ai ∈ {0, 1, 2}
i=1
only digits ai = 0, 2 appear” (i.e., ai = 1 never occurs). Since, as we have seen above, the “vast
majority” of real number are irrational, it would be reasonable to assume that, if the answer to the
former question makes sense, then this probability is precisely 0, whereas the answers to the latter
two questions are less obvious, if at all make sense.
But how do we formulate (and answer) all the above questions in a rigorous way, and which of the
above make sense, i.e. what is the repertoire of all possible questions whose answer makes sense in
this theory? It turns out that the “correct” answer for the repertoire of questions is “σ-algebra”
(“sigma-algebra”), whose elements are, in this language, events whence the requested probability is
a measure (in our case, a probability measure). The σ-algebra is going to be the domain of the
measure.
Definition I.1 (σ-algebra). Let E be a set (finite or infinite, countable or uncountable. A σ-algebra
E on E is a collection of subsets of E, subject to the following axioms:
(1) The empty set is in E , i.e. ∅ ∈ E .
(2) For all A ∈ E , its complement E \ A ∈ E .

(3) For every sequence {An }n≥1 ⊆ E of elements of E , the union is An ∈ E . I.e. countable
S
n=1
unions of elements of E are in E .
The tuple (E, E ) is called a measurable space, and each A ∈ E is called a measurable set
(that later, in probabilistic contexts will be referred to as “event”).
The “σ” in “σ-algebra” designates the possible countably infinite unions in (3) of Definition I.1.
Otherwise, if these are only allowed to be finite, the corresponding collection of subsets of E is an
algebra (which we will not use in this module).
Example I.2. The following examples are σ-algebras for E arbitrary. It is easy to check directly
that they satisfy (1)-(3) of Definition I.1.
(1) The trivial σ-algebra E = {∅, E}.
(2) Take A ⊆ E arbitrary, and E = EA = {∅, A, E \ A, E}.
(3) Assuming E 6= ∅, take E the power set of E:
E = P(E) = {A : A ⊆ E}.
This σ-algebra is commonly used for E countable (e.g. E = Z).
Lemma I.3. If (E, E ) is a measurable space, then:
(1) E ∈ E .
(2) A countable intersection of elements of E is in E , i.e. if {An }n≥1 ⊆ E is a family of elements

of E , then An ∈ E .
T
n=1
6 IGOR WIGMAN

(3) If A, B ∈ E , then B \ A ∈ E .
Proof. (1) Follows directly from (1) and (2) of Definition I.1.
(2) By De Morgan’s law:
[∞ ∞
\
An = E \ (E \ An ),
n=1 n=1
and the statement follows directly from (2) and (3) of Definition I.1.
(3) We have B \ A = B ∩ (E \ A) ∈ E , by (2) of Definition I.1 and (2) of Lemma I.3.

Having the stage set, we are now going to discuss the central notion of measure, whose domain
is going to be the postulated σ-algebra.
Definition I.4 (Measure). Let (E, E ) be a measurable space. A measure on (E, E ) is a function
µ : E → [0, ∞] assigning a non-negative real number (or infinity) to every element of E , satisfying
he following properties: µ(∅) = 0, and for every sequence {An }n≥1 of disjoint elements in E (i.e. for
every n 6= m, An ∩ Am = ∅),

! ∞
[ X
µ An = µ(An ).
n=1 n=1
(We say that µ is σ-additive, i.e. additive w.r.t. a countable family of elements of E . The
disjointness of {An } is crucial, as otherwise we could have taken all the An to be a fixed set, whence
µ must vanish identically.)
The triplet (E, E , µ) is called a measure space.
Definition I.5 (Probability measure). If (E, E , µ) is a measure space so that µ(E) = 1, then we say
that µ is a probability measure, and (E, E , µ) is a probability space. In this case the notation
(Ω, F , P ) is often used.
I.2. Discrete measures. Let E = Z, E = P(Z) the power set, and suppose thatS µ is an arbitrary
measure on (Z, P(Z)). Then, since any A ⊆ Z is countable, we may write A = {k}, and thus, by
k∈A
the σ-additivity in Definition I.4, X
µ(A) = µ({k}) (I.1)
k∈A
(mind that {k} is a singleton, the set containing one element, k, which is not the same as that
element). Denote the function m : Z → R≥0 as m(k) := µ({k}), so that (I.2) reads
X
µ(A) = m(k). (I.2)
k∈A

Conversely, any function m : Z → [0, ∞] defines a measure µ via (I.2). The function m(·) is called
the mass function of µ. The measure µ is a probability measure, if and only if m(·) satisfies
X
m(k) = 1,
k∈Z

in which case m is the probability mass function. For example, a Poisson random variable X
with parameter λ > 0 has the probability mass function
( k
λ −λ
e k≥0
m(k) = P (X = k) = k! .
0 otherwise
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 7

Similarly, the Binomial distribution Bin(n, p), n ≥ 0, p ∈ [0, 1] has m(k) = nk pk (1 − p)n−k for

k = 0, 1, . . . , n, and otherwise m(k) = 0.

I.3. Sets generating σ-algebras. For E = [0, 1], we would like to define a (probability) measure
that would manifest x ∈ E drawn randomly (uniformly) discussed in §I.1. If such a measure µ
exists, then for an interval (a, b) ⊆ E, it would be equal to µ((a, b)) = b − a. In the anticipation of
extending µ to a measure, it is essential to first discuss its domain, i.e. a σ-algebra that contains all
the intervals (a, b) ⊆ E, of primary concern of this section. This is a particular case of the following
general situation:
Definition I.6 (σ-algebra generated by a set.). Let E be an arbitrary set (finite or infinite, countable
or uncountable), and A some collection of subsets of E. The σ-algebra generated by A is the
intersection of all σ-algebras containing A :
\
σ(A ) := E. (I.3)
A ⊆E
E is a σ-algebra on E

Lemma I.7. Let E be an arbitrary set, and A a collection of subsets of E.


(1) F := σ(A ) is a σ-algebra on E containing A .
(2) F is the smallest σ-algebra on E containing A , in the sense that every σ-algebra E on E
containing A satisfies F ⊆ E .
Proof. (1) First we check that (1)-(3) of Definition I.1 hold with F :
(1): Since, by the same property holding for all the σ-algebras E in (I.3), ∅ ∈ E , and hence
and ∅ is in the intersection on the r.h.s. of (I.3).
(2): If A ∈ F , then for every E on the r.h.s. of (I.3) we have A ∈ E , hence by (2),
E \ A ∈ E , and then also E \ A is in the intersection on the r.h.s. of (I.3).
(3): If {An }n≥1 ⊆ F , then for every E on the r.h.s. of (I.3) and every n ≥ 1, we have
∞ ∞
An ∈ E , and then by (3), An ∈ E , and so also
S S
An is in the intersection on the r.h.s.
n=1 n=1
of (I.3)
That shows that F is a σ-algebra on E. That F contains A follows directly from the fact
that (I.3) is an intersection of sets all of which contain A .
(2) That follows directly from the fact that if E is a postulated σ-algebra on E, then in the
definition (I.3) of F , E is one of the sets in the intersection.

Example I.8. (1) If E is an arbitrary set and A ⊆ E, then σ({A}) = {∅, A, E \ A, E}.
(2) If E a σ-algebra on E, then σ(E ) = E , which follows by the definition (or with the help of
Lemma I.7).
In general, given E and a collection A of subsets of E, it might be very difficult (or impossible)
to list all the elements of σ(A ), since one has to take unions and intersections of arbitrary families
of countably many elements in A (and superimpose these). A fundamentally important σ-algebra
is the Borel σ-algebra on R.
Definition I.9 (Borel σ-algebra). The Borel σ-algebra B = B(R) on R is the σ-algebra generated
by the collection of the intervals
{(−∞, a] : a ∈ R}.
A set A ∈ B is called a Borel set.
8 IGOR WIGMAN

Equivalently, B is the σ-algebra generated by all open sets in R (those sets A ⊆ R that contain
every point together with some neighbourhood). The σ-algebra B is properly contained in the power
set P(R), i.e. there exist subsets of R that are not Borel, but constructing these is nontrivial.
Problem I.10. Show that B is generated by the open intervals {(a, b) : a < b}, or, alternatively,
by the semi-closed intervals {(a, b] : a < b}.
I.4. Extending to measures. In this section we are concerned with the situation when we are
given a collection A of subsets of E, and the values of µ(A) for all A ∈ A , and are interested in
extending µ to the whole σ-algebra σ(A ). Both the existence of such a measure and its uniqueness
are of fundamental importance.
Definition I.11 (π-system). Let E be a set, and A a collection of subsets of E. We say that A
is a π-system, if (i) ∅ ∈ A , and (ii) for all A, B ∈ A , A ∩ B ∈ A .
Example I.12. The collections {(−∞, a] : a ∈ R} ∪ {∅}, {(a, b) : a < b} ∪ {∅} and {(a, b] : a <
b} ∪ {∅} are all π-systems that generate the Borel σ-algebra.
From this point on, we will not care about ∅, so, for instance, treat the collection {(−∞, a] : a ∈ R}
as if it were a π-system. In what follows we will be concerned in proving the following important
result, asserting the uniqueness of the extension to a measure from π-systems to the generated
σ-algebra.
Theorem I.13 (Uniqueness of extension to measure). Let A be a π-system on E, E = σ(A ), and
µ1 , µ2 be two measures on (E, E ). Then, if µ1 = µ2 on A (i.e. for every A ∈ A , µ1 (A) = µ2 (A)),
and, further, µ1 (E) = µ2 (E) < ∞, it follows that µ1 = µ2 on E .
That is, prescribing the values of a measure on a π-system A , implies uniqueness of the measure
on σ(A ) with the given values. For example, prescribing the values {µ((−∞, x]) : x ∈ R} (thinking
of these as P (X ≤ x)), prescribes uniquely a probability measure on the Borel σ-algebra B(R) on
R. However, we didn’t assert that such a measure exists at all, and, in general, unless we will impose
further assumptions, it might not exist.
Most of the rest of this section is dedicated to proving Theorem I.13. The following notion of
d-systems is important in the course of the proof of the uniqueness of the extension from π-systems
to the full σ-algebra.
Definition I.14 (d-system (λ-system)). Let E be a set, and A a collection of subsets of E. We
say that A is a d-system (also called λ-system), if it satisfies the following axioms:
i. E ∈ A .
ii. For all A ∈ A , E \ A ∈ A , i.e. the complements of elements of A are in A .

iii. For all disjoint sequences {An }n≥1 ⊆ A of elements in A , the union An ∈ A is in A .
S
n=1

Lemma I.15. A collection A of subsets of E is a σ-algebra, if it is both a π-system and a d-system.


Proof. Homework assignment.

Proposition I.16 (“Dynkin’s Lemma”). Let A be a π-system of subsets of some set E. Then if
D is a d-system of subsets of E so that A ⊆ D, then also σ(A ) ⊆ D.
Proof. In what follows we show that
\
σ(A ) = D, (I.4)
D is a d-system
A ⊆D
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 9

i.e. σ(A ) is the intersection of all d-systems containing A . This certainly implies the statement
T I.16, since the postulated D is one of those d-systems on the r.h.s. of (I.4). Denote
of Proposition
D0 := D, i.e. to be the set on the r.h.s. of (I.4), that we claim to be, in fact, a σ-algebra.
D is a d-system
A ⊆D
This is sufficient to yield σ(A ) ⊆ D0 , since σ(A ) is the minimal σ-algebra containing A (see Lemma
I.7), whereas the inclusion σ(A ) ⊇ D0 is trivial from (I.4), since σ(A ) is one of those D on the r.h.s.
of (I.4).
It then remains to prove that D0 is a σ-algebra. On one hand, it is easy to check that it is a
d-system as an intersection of d-systems. In what follows we will establish that D0 is a π-system,
which will conclude the proof of Proposition I.16, thanks to Lemma I.15. First, since ∅ ∈ A ⊆ D0 ,
it follows that ∅ ∈ D0 , which is (i) of Definition I.11. We are then reduced to proving that for all
A, B ∈ D0 , A ∩ B ∈ D0 . We prove that in 3 steps:
(1) Step 1: If A, B ∈ A , then A ∩ B ∈ A ⊆ D0 , since A is a π-system.
(2) Step 2: We fix B ∈ D0 and consider the collection
CB := {A ⊆ E : A ∩ B ∈ D0 }
of subsets of E, eventually (after Step 3) aiming to prove that D0 ⊆ CB ; this is a merely
rephrasing of the statement of the statement of Proposition I.16. By Step 1, for every B ∈ A
(rather than B ∈ D0 , unfortunately, something we will have to relax during Step 3) A ⊆ CB .
Therefore, for B ∈ A to prove that D0 ⊆ CB , it is sufficient to prove that CB is a d-system,
since then it is one of the sets in the intersection on the r.h.s. of (I.4) (which is, by definition,
D0 ).
First, clearly, E ∈ CB , since E ∩ B = B, which is (i) of Definition I.14. Next, if A ∈ CB ,
then
(E \ A) ∩ B = E \ ((A ∩ B) ∪ (E \ B)) ∈ D0 ,
since D0 is a d-system, and (A ∩ B) ∪ (E \ B) is a union of disjoint sets, so that E \ A ∈ CB ;
this is (ii) of Definition I.14. Finally, if {An }n≥1 is a disjoint sequence of elements in CB , then

! ∞
[ [
An ∩ B = (An ∩ B) ∈ D0 ,
n=1 n=1

since {An ∩ B}n≥1 is a disjoint sequence in the π-system D0 , so that An ∈ CB , which is
S
n=1
(iii) of Definition I.14. The above shows that CB is a d-system, for every B ∈ A , which, as it
was mentioned above, implies that CB ⊇ D0 , i.e. for every A ∈ D0 and B ∈ A , A ∩ B ∈ D0 .
(3) Step 3: This time we fix B ∈ D0 , and define again CB := {A ⊆ E : A ∩ B ∈ D0 }; now we
can infer by Step 2, that CB ⊇ A , and we already proved in Step 2 that CB is a d-system
(regardless of whether B ∈ A or not). Hence, by the above logic, D0 ⊆ CB , i.e. for every
A, B ∈ D0 , A ∩ B ∈ D0 , hence D0 is a π-system, as claimed.

Proof of Theorem I.13. Let D := {A ∈ σ(A ) : µ1 (A) = µ2 (A)}. We know by assumption that
A ⊆ D, and wish to show that D is a d-system, for then it contains σ(A ) by Proposition I.16,
which is a restatement of Theorem I.13.
We are then to check properties (i)-(iii) of Definition I.14. (i) E ∈ D by the assumptions of
Theorem I.13. (ii) Let A ∈ D, then E \ A ∈ D, and for i = 1, 2, we have
µi (A) + µi (E \ A) = µi (E) < ∞,
10 IGOR WIGMAN

by the additivity of µi (particular case of σ-additivity). Hence if µ1 (A) = µ2 (A), that forces µ1 (E \
A) = µ2 (E \ A), hence E \ A ∈ D, which concludes (ii) of Definition I.14. (iii) If {An }n≥1 is a disjoint

sequence of elements of D, and A =
S
An , then by the σ-additivity of both µ1 and µ2 , it follows
n=1
that ∞ ∞
X X
µ1 (A) = µ1 (An ) = µ2 (An ) = µ2 (A),
n=1 n=1
which demonstrates that also A ∈ D.

Having dealt with the uniqueness of extension of a set function to a measure, we are now concerned
with its existence.
Definition I.17 (Ring). Let E be a set, and A a collection of subsets of E. We say that A is a
ring on E, if (i) ∅ ∈ A and (ii) for all A, B ∈ A , both B \ A ∈ A and A ∪ B ∈ A .
Theorem I.18 (Carathèodory’s Extension Theorem). Let A be a ring on E, and µ : A → [0, ∞] a
function satisfying: (i) µ(∅) = 0 and (ii) for all sequences {An }n≥1 ⊆ A of disjoint elements of A

An ∈ A , one has
S
so that also
n=1

! ∞
[ X
µ An = µ(An ).
n=1 n=1

Then µ extends to a measure on σ(A ).


Hence one can obtain a measure by prescribing it on a ring, via Carathèodory’s Extension Theorem.
The proof of Carathèodory’s Extension Theorem is outside the scope of this course. Of our particular
interest is the measure endowing sets of real numbers with a notion of length (if it makes sense),
the Borel measure constructed from the natural π-system of sets {(a, b] : a < b}, extended to the
generated Borel σ-algebra B(R).
Theorem I.19. There exists a unique measure µ on (R, B(R)), so that for all a, b ∈ R with a < b,
µ((a, b]) = b − a.
This measure µ is called the Borel measure.
It is possible to extend (“complete”) the Borel measure µ to an even bigger σ-algebra on R, called
the Lebesgue measure, whose restriction to B(R) equals µ; we will not deal with that in this course.
Unlike the Borel measure, the Lebesgue measure is complete, i.e. every subset of a zero measure set
is measurable.
Proof. Let A be the ring on R consisting of all finite unions of disjoint intervals of the form
A = (a1 , b1 ] ∪ . . . ∪ (am , bm ]. (I.5)
Note that such a representation is not unique, e.g. (0, 1] = (0, 1/2] ∪ (1/2, 1]. Then, by Problem I.10,
A generates the Borel σ-algebra B(R).
For an A of the form (I.5) define
Xm
µ(A) := (bi − ai );
i=1

note that, despite the lack of uniqueness of a representation (I.5) of the given A, the corresponding
number µ(A) is well-defined, e.g. for A = (0, 1] = (0, 1/2] ∪ (1/2, 1], we have µ(A) = 1 = 21 + 12 . Now
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 11

we wish to apply Carathèodory’s Extension Theorem I.18 on µ, to which end we need to check the
countable additivity of µ, the main difficulty of this theorem.
First, µ(∅) = 0, and also the (finite) additivity µ(A ∪ B) = µ(A) + µ(B) for all A, B ∈ A
disjoint is easy. Now suppose that {Ai }i≥1 ⊆ A is a disjoint sequence of elements of A so that also
∞ n
Ai ∈ A , and define the increasing sequence Bn :=
S S
A := Ai , so that Bn ↑ A (i.e. Bn ⊆ Bn+1 ,
i=1 i=1

S
and Bn = A). Since, by the additivity of µ,
n=1
n
X ∞
X
µ(Bn ) = µ(Ai ) −→ µ(Ai ),
n→∞
i=1 i=1

for the σ-additivity it is sufficient to show that µ(Bn ) → µ(A).


Now, take Cn := A\Bn . Then Cn ∈ A , since A is a ring, and Cn ↓ ∅ (i.e. C1 ⊇ C2 ⊇ C3 ⊇ . . . and

T
Cn = ∅), and, from the above, it is sufficient to prove that µ(Cn ) → 0. Suppose, by contradiction,
n=1
otherwise, i.e. there exists some 0 > 0 so that for every n ≥ 1, µ(Cn ) > 20 (recall that µ(Cn ) is
monotone decreasing). Since Cn ∈ A , we may write
Cn = (an1 , bn1 ] ∪ . . . ∪ (anmn , bnmn ],
and set
Dn := (an1 + δn , bn1 ] ∪ . . . ∪ (anmn + δn , bnmn ],
with some small δn > 0 that we are free to choose.
Note that Dn ∈ A , provided that δn is sufficiently small. We have that
Dn = [an1 + δn , bn1 ] ∪ . . . ∪ [anmn + δn , bnmn ] ⊆ Cn (I.6)
(this is the closure of Dn , a closed set, which, by definition, is a complement of an open set, covered
in 1st year analysis), and for δn < 0 · 2n1mn ,
µ(Cn \ Dn ) < mn · δn < 0 · 2−n , (I.7)
by our choice of δn . Then, since Cn ↓ ∅,
µ(Cn \ (D1 ∩ . . . ∩ Dn )) = µ((Cn \ D1 ) ∪ . . . ∪ (Cn \ Dn )) ≤ µ((C1 \ D1 ) ∪ . . . ∪ (Cn \ Dn ))
Xn Xn
≤ µ(Ci \ Di ) ≤ 0 · 2−i < 0 ,
i=1 i=1

thanks to (I.7). Hence, this, together with the assumption that µ(Cn ) ≥ 20 , implies that µ(D1 ∩
. . . Dn ) ≥ 0 , in particular, that for every n ≥ 1, D1 ∩ . . . ∩ Dn 6= ∅, and also
Kn := D1 ∩ . . . ∩ Dn 6= ∅.
To summarize, the sequence {Kn }n≥1 is a decreasing sequence of bounded non-empty closed sets

T
in R. Then, by Cantor’s Lemma, Kn 6= ∅, but, upon recalling that, by construction (and (I.6)),
n=1
Kn ⊆ Cn , hence

\ ∞
\
∅ 6= Kn ⊆ Cn ,
n=1 n=1

Cn = ∅. All in all, this proves the σ-additivity of µ defined on A ,
T
contradicting the assumption
n=1
hence, in light of Theorem I.18, the existence of a measure µ on σ(A ) = B(R), as required.
12 IGOR WIGMAN

For the uniqueness we wish to apply Theorem I.13 on µ. However, here the assumption µ(E) < ∞
is violated. We bypass this setback by intersecting the sets with finite intervals, as follows. Let
µ be two measures on (R, B(R)) with λ((a, b]) = µ((a, b]) = b − a for all a < b (e.g. µ is the
Borel measure whose existence was established above, and λ any other measure with the defining
properties on the π-system). Take a number n ∈ Z, and define the measures µn and λn on (R, B(R))
as µn (A) = µ((n, n + 1] ∩ A) and λn (A) = λ((n, n + 1] ∩ A). Then, for every n ∈ Z, λn and µn
are probability measures on (R, B(R)) so that λn = µn on the π-system {(a, b] : a < b} generating
B(R).
Since λn and µn satisfy the conditions of Theorem I.13 (unlike λ and µ), it follows that λn = µn
on B(R). But then for every A ∈ B,
X X X X
µ(A) = µ(A ∩ (n, n + 1]) = µn (A) = λn (A) = λ(A ∩ (n, n + 1]) = λ(A),
n∈Z n∈Z n∈Z n∈Z

by the σ-additivity of both µ and λ. That is, µ = λ.



For future reference we will need the following lemma dealing with an increasing sequence of
sets. A sequence of sets {An }n≥1 is increasing if for every n ≥ 1, one has An ⊆ An+1 , so that
A1 ⊆ A2 ⊆ A3 ⊆ . . . ....
Lemma I.20. Let (E, E , µ) be a measure space, {An }n≥1 ⊆ E be an increasing sequence of measur-

S
able sets, and A = An . Then
n=1
µ(A) = lim µ(An ).
n→∞

Proof. To use the σ-additivity of µ, we are going to decompose the sequence {An } into disjoint sets
in the following way. Set B1 := A1 ∈ E , and, for n ≥ 2, Bn := An \ An−1 ∈ E . Then the {Bi }i≥1 are
disjoint sets in E , with
[n
An = Bi
i=1
for all n ≥ 1, and, in particular,

[ ∞
[
A= An = Bi .
n=1 i=1
Therefore, by the σ-additivity of µ,

! ∞ n n
!
[ X X [
µ(A) = µ Bi = µ(Bi ) = lim µ(Bi ) = lim µ Bi = lim µ(An ).
n→∞ n→∞ n→∞
i=1 i=1 i=1 i=1

I.5. Independence. Recall that a probability space is the triple (Ω, F , P ), where Ω is sample
space, i.e. the set of all possible outcomes of some random experiment, F is the collection of
events, i.e. the set of all observable sets of outcomes (that are subsets of Ω), and P is a probability
measure on Ω, i.e. for an event A ∈ F , P (A) is the probability of the occurrence of A.
Definition I.21 (Almost sure events). An event A ∈ F occurs almost surely (A is an almost
sure event), abbreviated a.s., if P (A) = 1.
Example I.22. Let Ω = [0, 2π), F the Borel σ-algebra on Ω (this is B(R) restricted to Ω), and
P = 2π1
µ with µ the Borel measure on Ω (so that P (Ω) = 1, as it should be). In this case (Ω, F , P )
corresponds to a random point drawn uniformly on the circle, i.e. θ ∈ [0, 2π) is a random uniform
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 13

angle. Let A = [0, 2π) \ Q be the event of drawing an irrational angle. Then P (A) = 1, since, by the
σ-additivity of µ, the probability of its complement is
X
P (Q ∩ [0, 2π)) = P ({x}) = 0.
x∈Q∩[0,2π)

Therefore, with probability 1, the randomly drawn angle is irrational, i.e. the event A of drawing an
irrational angle is almost sure.
Example I.23 (Infinite toss of a fair coin). Here we assume that a fair coin (i.e. the probability of
a head or a tail is 21 ) is tossed infinitely (countably) many times. The corresponding sample space is
Ω = {0, 1}N = {(ω1 , ω2 , . . .) : ∀i ≥ 1. ωi ∈ {0, 1}},
where ωi = 1 means that the i’th coin toss is a head. As a σ-algebra we take F to be generated by
the events of the form
A(1 ,...,n ) = {ω1 = 1 , . . . , ωn = n }
as n ∈ N, and (1 , . . . , n ) ∈ {0, 1}n (“cylinder sets”). As a probability measure, we take the one
defined on the cylinder sets as P (A(1 ,...,n ) ) = 2−n .
Definition I.24 (Independence). Let (Ω, F , P ) be a probability space, and I be a countable index
set.
(1) The events {Ai }i∈I are independent, if for all finite J ⊆ I,
!
\ Y
P Ai = P (Ai ). (I.8)
i∈J i∈J

(2) Let {Ei }i∈I be a family of σ-algebras so that for all i ∈ I, Ei ⊆ F (i.e. it is a family of
sub-σ-algebras of F ). We say that the {Ei }i∈I are independent, if every sequence {Ai }i∈I
so that for all i ∈ I, Ai ∈ Ei , is independent (as a sequence of events).
Intuitively, independence means that one random experiment does not convey any information
on another one. One might think that the logic behind the use of (I.8) is circular, as in order to
exploit independence, it is necessary to check its definition (I.8) in the first place, but then it gains
nothing. But, usually, it is clear from the context, when two events (or σ-algebras) are independent.
For example, if a fair coin is tossed several times, then the outcomes of these tosses are independent
(unless the coin is deformed during these). The following example shows that for a sequence of events
to be independent, it is not sufficient that every pair of events is independent (i.e. independence is
strictly stronger than pairwise independence).
Example I.25. A fair coin is tossed twice, so that the sample space is
Ω = {HH, HT, T H, T T },
where each outcome has probability 41 . Take
A1 = {T T, T H} = “T first”
A2 = {T T, HT } = “T second”
A3 = {T H, HT } = “exactly one T”
Here we have
1
P (A1 ) = P (A2 ) = P (A3 ) = ,
2
1
P (A1 ∩ A2 ) = P (A2 ∩ A3 ) = P (A1 ∩ A3 ) = ,
4
14 IGOR WIGMAN

but
1
P (A1 ∩ A2 ∩ A3 ) = 0 6= = P (A1 ) · P (A2 ) · P (A3 ).
8
So the events {A1 , A2 , A3 } are not independent, even though every pair of these events are indepen-
dent. The reason for this is that the outcome of any pair of these events determines the outcome of
the third one.
Suppose that one needs to test the independence of two σ-algebras. Of course one can check
whether the equality (I.8) is satisfied for any events lying in the respective σ-algebras. However, it
is sufficient to apply it on a generating set only, as asserted in the following theorem.
Theorem I.26. Let (Ω, F , P ) be a probability space, and A1 , A2 ⊆ F two π-systems contained in
F . If the equality
P (A1 ∩ A2 ) = P (A1 ) · P (A2 )
is satisfied for all A1 ∈ A1 , A2 ∈ A2 , then σ(A1 ) and σ(A2 ) are independent.
Proof. Fix a set A1 ∈ A1 , whence it is easy to check that the set functions µ(A), ν(A) defined on
A ∈ F as
µ(A) := P (A1 ∩ A); ν(A) := P (A1 ) · P (A)
are measures. Moreover, by the assumptions of Theorem I.26, one has µ(A) = ν(A) on the π-system
A2 . Hence, by Theorem I.13, for all A2 ∈ σ(A2 ),
µ(A2 ) = ν(A2 ),
i.e.
P (A1 ∩ A2 ) = P (A1 ) · P (A2 )
for all A1 ∈ A1 , A2 ∈ σ(A2 ). Now fix A2 ∈ σ(A2 ) (not restricted to A2 ∈ A2 ), and apply the same
argument with
µ(A) := P (A ∩ A2 ); ν(A) := P (A) · P (A2 ),
to yield that for all A1 ∈ σ(A1 ),
P (A1 ∩ A2 ) = µ(A1 ) = ν(A1 ) = P (A1 ) · P (A2 ),
as asserted by Theorem I.26. 
Example I.27. For two dice roll, the corresponding sample space is
Ω = {(ω1 , ω2 ) : ωi ∈ {1, 2, . . . , 6}} = {1, 2, . . . , 6}2 ,
the σ-algebra of events is the whole F = P(Ω) power set, and P is the uniform distribution on Ω
1
(i.e. each atomic outcome gets probability 36 ). We take the events Ak = {first die has score k}, Bk =
{sum of two scores is k}, Ck = {second die has score k}, k = 1, . . . , 6. Then P (A6 ) = P (B7 ) = 16 ,
and
1
P (A6 ∩ B7 ) = = P (A6 ) · P (B7 ),
36
since the only way that both A6 and B7 occur is (6, 1). Hence the events A6 and B7 are independent.
But clearly, the outcome of the first die has an effect on the total sum, so Ak can’t be independent
of Bj for all possible values of k, j, e.g.
1 1
P (A2 ∩ B11 ) = 0 6= P (A2 ) · P (B11 ) =· .
6 36
On the other hand, clearly, all the events Ak are independent of all the events Cj .
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 15

I.6. The Borel-Cantelli lemma. Let (Ω, F , P ) be a probability (or measure) space, and
S {An }n≥1
be a sequence of events. If for some n ≥ 1 and some sample point x ∈ Ω, we have x ∈ Am , that
m≥n
means that x is in some Am with m ≥ n, i.e. x is in some tail of the sequence {An }n≥1 . Therefore, if
∞ S
T
x∈ Am , then x ∈ An with n arbitrarily large, or, equivalently, x is an element of an infinite
n=1 m≥n
subsequence of {An }n≥1 . We then define
∞ [
\
{An infinitely often} = {An i.o.} = Am ,
n=1 m≥n

also denoted ∞ [
\
lim sup An := Am . (I.9)
n→∞
n=1 m≥n
The set lim sup An is an event, by (I.9), and the usual operations inside a σ-algebra F .
n→∞ T
Similarly, if for some n ≥ 1, x ∈ Am , then it means that x ∈ Am for all m ≥ n, and then if
m≥n
∞ T
S
we allow n to be arbitrary by taking x ∈ Am , that means that x ∈ An for n sufficiently large
n=1 m≥n
(“eventually”). We then define
∞ \
[
lim inf An = {An eventually} = {An ev.} := Am . (I.10)
n→∞
n=1 m≥n

Similarly to lim sup An , lim inf An is an event, by (I.10).


n→∞ n→∞
Since, clearly, belonging to An eventually is (strictly) stronger than for infinitely many n, one has
lim inf An ⊆ lim sup An .
n→∞ n→∞

Moreover, by De Morgan’s law, if we denote for an event A ∈ F its complement w.r.t. Ω, Ac := Ω\A,
we may write
 c ∞ [
!c ∞
!c
\ [ [
lim sup An = Am = Am
n→∞
n=1 m≥n n=1 m≥n

(I.11)
[ \
= Acm = lim inf Acn .
n→∞
n=1 m≥n

Lemma I.28 (Borel-Cantelli). Let (Ω, F , P ) be a probability space, and {An }n≥1 ⊆ F be a sequence
of events. Then:

P
(1) If P (An ) < ∞, then P ({An i.o.}) = 0.
n=1

P
(2) If the events {An }n≥1 are independent and P (An ) = ∞, then P ({An i.o.}) = 1.
n=1

Intuitively, the Borel-Cantelli Lemma I.28 asserts, that, if the probabilities P (An ) decay sufficiently
rapidly, then, almost surely, An does not occur infinitely often. Its counterpart, valid only under
independence assumption, states, that if these probabilities do not decay sufficiently rapidly, then,
almost surely, An occur infinitely often. In this case (i.e. if {An }n≥1 are assumed to be independent),

P
An occur infinitely often almost surely, if and only if the series P (An ) diverges. The analogue
n=1
16 IGOR WIGMAN

of the Borel-Cantelli lemma, properly rephrased, holds for measure spaces in place of probability
spaces.
Proof. (1) Recall that
∞ [
\
{An i.o.} = An .
n=1 m≥n
S
Therefore, for every n0 ≥ 1, {An i.o.} ⊆ An , so that its probability is bounded above by
m≥n0
! ∞
[ X
P ({An i.o.}) ≤ P An ≤ µ(An ),
m≥n0 n=n0

that could be made arbitrarily small by taking n0 large, as a tail of a convergent series. Since
P ({An i.o.}) does not depend on n0 , that it is of zero probability follows.
(2) We set an = P (An ), and will use the inequality 1 − a ≤ e−a , valid for a ∈ R, for a = an . Since
the events {An }n≥1 are independent, so are {Acn }n≥1 . Hence we have for n ≥ 1 and N ≥ n
that ! !
N
\ YN YN XN
P Acm = (1 − am ) ≤ e−am = exp − am → 0
m=n m=n m=n m=n
N
 
Acm ⊆ Acm , its probability is P Acm
T T T
as N → ∞ (n is fixed). Hence, since = 0,
m≥n m=n m≥n
valid for all n ≥ 1, and so, by the σ-additivity of P ,
∞ \
! ∞
!
[ X \
c c
P ({Am ev.}) = P Am ≤ P Acm = 0.
n=1 m≥n n=1 m≥n

Finally, the duality (I.11) implies that


P ({An i.o.} = 1 − P ({Acn ev.}) = 1 − 0 = 1.

Example I.29. A fair coin is tossed independently infinitely many times. Let
An = {Head at the n0 th toss}.
Then the {An }n≥1 are independent, and
∞ ∞
X X 1
P (An ) = = ∞,
n=1 n=1
2
hence by the Borel-Cantelli Lemma I.28, P (get a head i.o.) = 1. This is a rather crude result, and
various laws of large numbers discussed later will establish more precise, quantitative results holding
either with high probability or almost surely on the number of heads after N → ∞ tosses.
Example I.30. Let Xn , n ≥ 1 be i.i.d. standard exponential random variables, i.e. whose probability
density function is f (x) = e−x , x ≥ 0. We are interested in the asymptotic behaviour of max Xi
1≤i≤n
as n → ∞; it is reasonable to assume that max Xi grows to infinity, the question being how fast.
1≤i≤n
Let α > 0 be a fixed number, and take the events An := {Xn ≥ α · log n}. Then {An }n≥1 are
independent, and
Z∞
P (An ) = e−x dx = e−α·log n = n−α .
α·log n
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 17


P
It then follows, that the series P (An ) is convergent, if and only if α > 1. Hence, by the
n=1
Borel-Cantelli lemma, for every  > 0,
P (Xn / log n ≥ 1 +  i.o.) = 0,
whereas
P (Xn / log n ≥ 1 i.o) = 1.
It then follows that max Xi grows like log n precisely, in the following sense. Recall that for a
1≤i≤n
sequence an of real numbers, lim sup an and lim inf an were defined. These always exist, unlike a
n→∞ n→∞
limit that might or might not exist for a given sequence, and lim an exist, if and only
n→∞

lim sup an = lim inf an .


n→∞ n→∞

In this language, we may rewrite the conclusions of the above discussion, as


 
Xn
P lim sup = 1 = 1;
n→∞ log n

Xn
mind that lim log n
does not exist almost surely, as the sequence fluctuates, and P (Xn ≤ 1 i.o.) = 1
n→∞
can be similarly proved.

II. Measurable functions and random variables


II.1. Measurable functions.
Definition II.1 (Measurable functions). Let (E, E ) and (G, G ) two measurable spaces, and f :
E → G a function. The function f is measurable (or E − G -measurable), if for every A ∈ G , its
inverse image is f −1 (A) ∈ E .
The most important class of measurable functions under our interest is f : E → R, where the
range R is equipped with the Borel σ-algebra G = B(R).
Example II.2. Let A ⊆ E be a set, and consider the indicator function χA : E → R defined as
(
1 x∈A
χA (x) = .
0 x∈ /A

Then for a Borel set C ∈ B(R), the inverse image is




 ∅ 0, 1 ∈
/C

E \ A 0 ∈ C, 1 ∈ /C
χ−1
A (C) = .

 A 0∈ / C, 1 ∈ C

E 0, 1 ∈ C

Therefore, χA is measurable, if and only if ∅, A, E \ A, E ∈ E , which, by the properties of a σ-algebra


is equivalent to A ∈ E .
Definition II.3. Let (Ω, F , P ) be a probability space, and (G, G ) a measurable space. A measur-
able function X : Ω → G is called a (G-valued) random variable. By default, X : Ω → R, where
R is equipped with the Borel σ-algebra.
18 IGOR WIGMAN

Given a random variable X : Ω → R and a Borel set A ⊆ R, we define the set


{X ∈ A} := X −1 (A) = {ω ∈ Ω : X(ω) ∈ A};
it is an event (i.e. {X ∈ A} ∈ F ), by the measurability of X; thereupon its probability
P (X ∈ A) = P ({X ∈ A}) := P (X −1 (A))
makes sense. The follows criterion for measurability is easier to validate than the definition:
Lemma II.4. A function f : E → R from the measurable space (E, E ) to (R, B(R)) is measurable,
if and only if for every y ∈ R,
f −1 ((−∞, y]) = {x ∈ E : f (x) ≤ y} ∈ E .
The proof of Lemma II.4 will be proposed as a homework exercise. Suppose now that f : R → R
is a continuous function. Then, for every y ∈ R, the set f −1 ((−∞, y]) is closed, so that, in particular,
f −1 ((−∞, y]) ∈ B(R). Therefore we obtain the following conclusion.
Lemma II.5. Every continuous function f : R → R is B(R)-measurable (where both the domain
and the range are equipped with the Borel σ-algebra).
The class of measurable functions is so much wider than the one of continuous functions. They
are closed under the following operations:
Proposition II.6. Let fn : E → R, n ≥ 1 be a sequence of measurable functions. Then:
i. f1 + f2 is measurable.
ii. f1 · f2 is measurable.
iii. sup fn is measurable.
n≥1
iv. inf fn is measurable.
n≥1
v. lim sup fn is measurable. In particular, lim fn is measurable, if it exists.
n→∞ n→∞
vi. lim inf fn is measurable.
n→∞

Proof. i. By Lemma II.4, it is sufficient to show that for all y ∈ R,


{f1 + f2 ≤ y} = {x ∈ E : f1 (x) + f2 (x) ≤ y} ∈ E ,
and since E is a σ-algebra, it is equivalent to
E \ {f1 + f2 ≤ y} = {x ∈ E : f1 (x) + f2 (x) > y} ∈ E .
However, if f1 (x) + f2 (x) > y, then f2 (x) > y − f1 (x), and for  > 0 sufficiently small, if
|q − f1 (x)| < , then still f2 (x) > y − q, and we may take q ∈ Q and f1 (x) > q. Conversely, if
such a q ∈ Q (satisfying f1 (x) > q and f2 (x) > y − q) exists, then, evidently, f1 (x) + f2 (x) > y.
We conclude that
[
{f1 + f2 > y} = {f1 > q} ∩ {f2 > y − q} ∈ E ,
q∈Q

since f1 , f2 are measurable, and E is a σ-algebra.


ii. Similar to the above, left to homework assignment.
iii.   \
sup fn ≤ y = {fn ≤ y} ∈ E .
n≥1
n≥1
iv. Similar to the above.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 19

v. We recall that for a sequence {an }n≥1 ⊆ R,


lim sup an = lim sup am = inf sup am ,
n→∞ n→∞ m≥n n≥1 m≥n

since the sequence {sup am }n≥1 is monotone decreasing. The conclusion then follows upon ap-
m≥n
plying (iii)-(iv) of this proposition sequentially. Alternatively, one can reuse Lemma II.4.
vi. Similarly to above, this time with the use of
lim inf an = lim inf am = sup inf am .
n→∞ n→∞ m≥n n≥1 m≥n


One can get from the definitions that the composition of measurable functions is measurable.
Lemma II.7. If f : E → G and g : G → F are measurable functions (w.r.t. the corresponding
measurable spaces), then g ◦ f is measurable.
II.2. Random variables. A random variable X : Ω → R induces a new measure, the probability
distribution on R from the measure on Ω, that is very informative about X.
Definition II.8 (Probability distribution). The probability distribution of a random variable
X : Ω → R on a probability space (Ω, F , P ) is a probability measure PX : B(R) → [0, 1] on
(R, B(R)), defined as
PX (A) = P (X ∈ A) = P (X −1 (A)).
This makes sense, since X is measurable, hence X −1 (A) ∈ F .
We also define the distribution function FX : R → [0, 1] as
FX (x) = P (X ≤ x) = PX ((−∞, x]) = P (X −1 ((−∞, x])).
By the definition, FX is non-decreasing, satisfying
lim FX (x) = 0; lim FX (x) = 1.
x→−∞ x→+∞

Moreover, we claim that FX is right-continuous. To see that, we use Lemma I.20 with
∞  
\ 1
∅= x0 , x0 + ,
n=1
n
so that      
1 1
lim FX x0 + − FX (x0 ) = lim PX x0 , x0 + = 0,
n→∞ n n→∞ n
sufficient for the right-continuity.
Further, since A = {(−∞, x] : x ∈ R} is a π-system of subsets of R generating B(R), Theorem
I.13 demonstrates that prescribing FX , defined on R, uniquely determines the whose of PX defined
on B(R). That is, FX encodes all the information about the full probability distribution PX of X.
Example II.9. Set Ω = N, F = P(N), λ > 0, and
X e−λ
P (A) = λk ,
k∈A
k!
the Poisson probability measure. Then the identity map X : Ω → R, X(ω) = ω is a Poisson
random variable. Here PX = P , i.e. the probability distribution of X (restricted to N) equals to
the probability measure on Ω.
20 IGOR WIGMAN

Given a random variable X : Ω → R, the collection AX = {X −1 ((−∞, y]) : y ∈ R} of all pre-


images under X of sets (−∞, y] is a π-system, which is a subset of the σ-algebra E . Hence AX
generates a σ-algebra EX which is a sub-σ-algebra of E . Different random variables define different
σ-algebras, and a collection of random variables defined on the same probability space is independent,
if the corresponding σ-algebras are independent, in the sense of Definition I.24. This boils down to
the following definition:
Definition II.10 (Independence of random variables). A sequence {Xn }n≥1 of real-valued random
variables, all defined on the same probability space (Ω, F , P ) are independent, if for all n ≥ 1,
and x1 , x2 , . . . , xn ∈ R,
P (X1 ≤ x1 , . . . Xn ≤ xn ) = P (X1 ≤ x1 ) · . . . · P (Xn ≤ xn ).
By Theorem I.26, and in light of the fact that {(−∞, x] : x ∈ R} is a π-system generating B(R),
it follows that an independent sequence {Xn }n≥1 also satisfies
P (X1 ∈ B1 , . . . Xn ∈ Bn ) = P (X1 ∈ B1 ) · . . . · P (Xn ∈ Bn ),
where B1 , . . . Bn ∈ B(R) are arbitrary Borel sets.
Example II.11 (Infinite coin toss, cf. Example I.23). We take
Ω = {0, 1}N = {(ω1 , ω2 , . . .) : ∀i ≥ 1. ωi ∈ {0, 1}},
and the σ-algebra F generated by all events of the form
A(1 ,...,n ) = {ω1 = 1 , . . . , ωn = n }
with n ≥ 1, (1 , . . . , n ) ∈ {0, 1}n . The corresponding probability measure is induced by setting
P A(1 ,...,n ) = 2−n , making the coin tosses independent.
We can take a number m ≥ 1, and define Xk : Ω → R, k = 1, 2, . . ., by
km
X
Xk (ω) = ωi ,
i=(k−1)m+1

that counts the number of heads in the m consecutive tosses between (k − 1)m + 1 and km. Since
the tosses of Xk and Xk0 , k 6= k 0 are disjoint, all the {Xk }k≥1 are i.i.d., distributed ∼ Bin m, 21 , i.e.
 
X m −m
PXk (A) = P (Xk ∈ A) = 2 ,
i
i∈A∩{0,1,...,m}

A ∈ B(R). This shows that PXk (on B(R)) is much simpler than the underlying measure P on Ω.
Example II.12. Take Ω = (0, 1], F = B(0, 1]), and P the Borel measure on (0, 1]. Then the
identity map X(ω) = ω is a uniform random variable on (0, 1], i.e. X ∼ U(0, 1].
As we have seen, the same probability space is not exclusively restricted to one random variable,
but can accommodate several random variables with varying distribution measures, depending on
how “rich” this probability space is. For example, the trivial σ-algebra F = {∅, Ω} can only admit
the constant random variables (that have to be measurable w.r.t. F ). The following theorem,
whose proof is outside the scope of this course, shows that ((0, 1], B((0, 1])) is sufficiently rich to
accommodate random variables with unrestricted distributions, and further, these are not bound to
be dependent.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 21

Theorem II.13. Let (Ω, F , P ) be the probability space Ω = (0, 1], F = B((0, 1]) and P the Borel
measure on B((0, 1]). Suppose (Fn )n≥1 is an arbitrary sequence of distribution functions (i.e. for
every n ≥ 1, Fn is a distribution function of some random variable). Then there exist a sequence
{Xn }n≥1 of independent random variables on (Ω, F , P ) so that, for every n ≥ 1, the distribution
function of Xn is Fn .
II.3. Modes of convergence. Let {Xn }n≥1 be a sequence of random variables on a probability
space (Ω, F , P ), and X another random variable on the same space. Here we are interested in
whether Xn converges to X. A number of different modes of convergence could be of relevance, the
most immediate (but not necessarily the most “correct”) one is the pointwise convergence, i.e., for
every ω ∈ Ω,
lim Xn (ω) = X(ω).
n→∞
This condition seems to be too strong to impose, as the following example shows.
Example II.14. Return to Example I.23 of infinite toss of a fair coin, and set Xn (ω) = ωn ∈ {0, 1}.
It is reasonable to request that
n
1X 1
lim Xi (ω) = ,
n→∞ n 2
i=1
i.e. the proportion of heads should asymptotically approach 12 as the number of tosses grows. How-
ever, this does not hold pointwise, for example for ω = (0, 0, . . .), occurring with probability 0.
Recall Definition I.21 of almost sure events and their defining properties (a.s). Similarly, if (E, E , µ)
is a measure space. We say that A (or its defining property) holds almost everywhere (a.e.), if
µ(E \ A) = 0.
Definition II.15 (Convergence a.e. and in measure). Let fn : E → R be a sequence of measurable
functions and f : E → R another measurable function.
(1) We have fn → f almost everywhere (almost surely), denoted a.e. (a.s.), if
µ ({x ∈ E : fn (x) 6→ f (x)}) = 0.
(2) We have fn → f in measure (in probability), if for every  > 0,
µ (x ∈ E : |fn (x) − f (x)| > ) → 0.
µ P
Convergence in measure (resp. in probability) is denoted fn → f (resp. fn → f ).
In practice, we neglect events of measure 0, so the pointwise convergence will be irrelevant for
us, and instead, the a.e. (a.s.) convergence will be of interest. Unrolling the definition of a.e.
convergence, that means that there exists a set A so that the measure of its complement µ(E \A) = 0
vanishes, so that for every x ∈ A and every  > 0, there exists a number N = N (x, ) ∈ N sufficiently
large, so that for all n > N , |fn (x) − f (x)| < . Importantly, N is allowed to depend on the point
x. Otherwise, if N only depends on , the resulting notion is the much stronger on of uniform
convergence. In general, uniform convergence implies pointwise convergence (everywhere), which, in
its turn, implies a.e. convergence.
Example II.16. Let fn : R → R, fn (x) = χ(n,n+1] , where R is equipped with the Borel measure.
Then for every x ∈ R, if n > x+1, then fn (x) = 0, hence fn → 0 a.e. (in fact, everywhere). However,
for 0 <  < 1, µ({x : |fn | ≥ }) = µ((n, n + 1]) = 1, so that fn 6→ 0 in measure.
Example II.17. On the same space as above, take
(
2n − 2n2 x x ∈ [0, 1/n]
fn (x) = .
0 otherwise
22 IGOR WIGMAN

Then, for every x 6= 0, fn (x) = 0 with n sufficiently large (depending on x). Hence fn (x) → 0 for all
x 6= 0, and thereupon fn → 0 a.e. Further, for every  > 0,
1
µ({|fn (x)| > }) ≤ → 0,
n
so also fn → 0 in measure.
Example II.18. Let fn : [0, 1] → R, the sequence of the characteristic functions of dyadic intervals:
f1 = χ[0,1] , f2 = χ[0,1/2] , f3 = χ[1/2,1] , f4 = χ[0,1/4] , f5 = χ[1/4,1/2] ,..., formally defined as
fn (x) = χh n−2k , n−2k +1 i (x),
2k 2k

k ≥ 0, 2k ≤ n < 2k+1 . Then for  > 0, µ({x : |fn (x)| > }) ≤ 2−k → 0 as n → ∞, i.e. fn → 0 in
measure.
However, for a fixed x ∈ [0, 1], fn (x) attains both the values 0, 1 with n arbitrarily large, hence
the limit lim fn (x) does not exist, i.e. fn does not converge a.e.
n→∞

The above examples show that any of convergence a.e. or convergence in measure does not imply
the other one. Since the measure 0 sets are ignored, the limit of a sequence fn , either a.e. or in
measure, is only defined up to a measure zero set. For example, fn ≡ 0 converge to f ≡ 0, but, we
can also take f = χQ , with no convergence impaired. In either case, to define convergence (a.e. or in
measure, resp. a.s. or in probability), all the functions (resp. random variables) have to be defined
on the same measure space (resp. probability space). In what follows we will mainly be concerned
with a.s. convergence and convergence in probability of sequence of random variables Xn defined
on the same probability space. A sufficient condition for almost sure convergence could be easily
derived from the Borel-Cantelli lemma I.28; mind that Xn → X a.s. (resp. in probability), if and
only if Xn − X → 0 a.s. (resp. in probability).
Corollary II.19. (1) If for every  > 0,

X
P (|Xn | > ) < ∞,
n=1

then Xn → 0 a.s.
(2) Conversely, if the {Xn }n≥1 are independent, and Xn → 0 a.s., then, for every  > 0,

X
P (|Xn | > ) < ∞.
n=1

Proof. (1) By the Borel-Cantelli’s Lemma, under the conditions of Corollary II.19(1), for every  > 0
P (|Xn | >  i.o.) = 0. The subtlety is that we need to find a set of ω ∈ Ω of full probability, so that
the above holds for all  > 0. Let us denote the event
A = {|Xn | ≤  outside of finitely many n} = {ω ∈ Ω : ∃N = N (ω) : ∀n > N. |Xn | ≤ }
 
⊆ lim sup |Xn | ≤  .
n→∞

By theTabove, for all  > 0, P (A ) = 1, and the sought after statement holds on the intersection
A := A , i.e. on ω ∈ A, for every  > 0, there exists N = N (ω, ), so that for all n > N , |Xn | < .
>0
How to show that P (A) = 1? By the definition, A is an intersection of a continuous family of
events, of full probability each. Unfortunately, the σ-additivity of P is thereupon unable to yield
the same about the probability of A. However, we may restrict to  = n1 , so that to use the density
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 23


T
of these numbers around 0, and then A = A1/n , whence the σ-additivity (alternatively, the
n=1
Continuity Lemma I.20, or, rather its variant for decreasing sequences) of P does yield P (A) = 1

(P (A) = 1 − P (Ac ), and P (Ac ) ≤ P (Ac1/n ) = 0).
P
n=1
(2) Homework assignment.

Example II.20. Let Xn ∼ Bin(1, n−α ) be independent coin tosses with probability of heads 1

.
Then for 0 <  < 1,
P (|Xn | > ) = P (Xn = 1) = n−α
(otherwise if  ≥ 1, P (|Xn | > ) = 0). Then, by Corollary II.19, (1) and (2), Xn → 0 a.s., if and
only if α > 1.
Example II.21. Let X1 , X2 , . . . be Gaussian random variables (that need not be independent),
Xn ∼ N (0, σn2 ),
1
σn2 = → 0.
(log n)2
We need to estimate the probability that P (|Xn | > ), first by evaluating the probability of a
deviation for the standard Gaussian Z ∼ N (0, 1):
Z∞ r Z∞ r ∞ r
1 −x2 /2 2 x −x2 /2 1 2 h −x2 /2 i 1 2 −t2 /2
P (|Z| > t) = 2 · √ e dx ≤ · e dx = −e = t πe .
2π π t t π t
t t

Hence, for every  > 0,


  r r
 2 σn −2 /(2σn2 ) 2 1 2
P (|Xn | > ) = P |Z| > ≤ · ·e = · · n− log n/2 → 0
σn π  π  log n
P
as n → ∞, i.e. Xn −→ 0.
We further claim that for every  > 0,

X
P (|Xn | > ) < ∞,
n=1

whence Xn → 0 a.s. by Corollary II.19. To this end, given  > 0 we set n to be the minimal integer
satisfying 2 log(n )/2 > 1, and set δ > 0 so that 2 log(n )/2 = 1 + δ. Then
∞ r
X X 1 2 X 1 −1−δ
P (|Xn | > ) ≤ 1+ n < ∞.
n=1 n≤n
 π n>n
log n
 

Note that the sufficient condition of Corollary II.19(1) in particular implies the convergence Xn → 0
in probability (since the summands of a convergent series vanish). The following result shows that,
under the assumption µ(E) < ∞, the convergence a.e. is stronger than convergence in measure.
Conversely, from convergence in measure one can infer a.e. convergence along an infinite subsequence.
Theorem II.22. Let (E, E , µ) be a measure space and {fn }n≥1 a sequence of measurable functions
on E, and f a measurable function on E.
(1) Assume µ(E) < ∞. If fn → f a.e., then fn → f in measure.
(2) If fn → f in measure, then there exists a subsequence fnk , so that fnk → f a.e.
24 IGOR WIGMAN

P
In probabilistic language, Theorem II.22 states that (1) Xn → X a.s. implies Xn → X (the finite
P
measure condition is automatically satisfied), and (2) Xn → X implies Xnk → X a.s. for some
subsequence Xnk .
Proof. (1) It is convenient to work with gn := fn −f , so that the postulated convergence is gn → 0
a.e. Now take an arbitrary  > 0, and consider the set An := {x ∈ E : |gn (x)| ≤ }, and

T ∞
S
Bn = Am . Then Bn are increasing, and Bn = {x ∈ E : |gn (x)| ≤  ev.}, see (I.10).
m=n n=1
Then, thanks to Lemma I.20,
µ(Bn ) → µ({x ∈ E : |gn (x)| ≤  ev.}) = µ(E)
by assumption (i.e. that gn → 0 a.e). Since, by the definition of Bn , Bn ⊆ An , we have that
µ(E) ≥ µ({x : |gn (x)| ≤ }) = µ(An ) ≥ µ(Bn ) → µ(E),
and so, µ({x : |gn (x)| ≤ }) → µ(E). Using the finiteness of µ(E) (and this is the only place
this assumption is used), we finally have
µ({x : |gn (x)| > }) = µ(E) − µ({x : |gn (x)| ≤ }) → 0,
as required.
(2) Again, we denote gn := fn − f , and assume that gn → 0 in measure. Given a number k ≥  1,
1 1

µ |gn | > k → 0, hence there exists a number nk sufficiently large, so that µ |gnk | > k <
2−k . We claim that thus obtained sequence {nk } satisfy the postulated properties. Indeed,

X ∞
X
µ(|gnk | > 1/k) ≤ 2−k < ∞,
k=1 k=1

hence, by the Borel-Cantelli lemma (1) (which is valid on general measure spaces, not only
probability spaces),
µ(|gnk | > 1/k i.o.) = 0.
However, if, for some x ∈ E, |gnk (x)| ≤ k1 for k sufficiently large, then gnk (x) → 0, hence
E \ {|gnk | > 1/k i.o.} ⊆ {gnk → 0}, and
µ({gnk → 0}c ) ≤ µ(|gnk | > 1/k i.o.) = 0,
i.e. gnk → 0 a.e.


III. Lebesgue theory of integration


III.1. Lebesgue integrals: definition and fundamental properties. It turns out that the Rie-
mann definition of integral of a function has several major flaws. Its basic idea was that the points
of a (graph of a) function f : I → R, with I ⊆ R some interval, were divided according to the
proximity of their x-coordinate. It had the unfortunate by-product of the definition of the integral
being “tailored made” for continuous functions (though not being exclusively defined for them). The
breakthrough idea due to H. Lebesgue is that it would be more beneficial to distribute the points
according their proximity w.r.t. their y-coordinate.
Let (E, E , µ) be a measure space. A simple function is a (finite) linear combination of measurable
indicator functions with non-negative coefficients, i.e. a function of the form
Xm
f= ai χ A i , (III.1)
i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 25

where for m ≥ 1 one has 0 ≤ ai < ∞, and {Ai }1≤i≤m ⊆ E . We denote the class of simple
functions (III.1) by S (E ), and the corresponding integral to be
Z Z Z Xm
f dµ = f dµ = f (x)µ(dx) := ai · µ(Ai ), (III.2)
E E i=1

under the convention 0 · ∞ := 0, in case some of the indicator sets Ai are of infinite measure (we can
also allow for ai = ∞, under the same convention, i.e. f to attain the infinite value on a measure
zero set, which we will ignore from this point on, but strictly speaking, f is allowed to attain +∞,
and later on −∞, on measure 0 sets). We observe that the representation (III.1) of a simple function
is not unique, but nevertheless, by the properties of µ, their integral is well-defined. In particular,
for A ∈ E , Z
χA dµ = µ(A),
E
which, in probabilistic language, will state
E[χA ] = P (A),
with Z
E[ · ] := · dP

denoting the expectation of a random variable.
The following properties of the integral of simple functions are immediate from the definition.
Lemma III.1. For f, g ∈ S (E ) and α, β ≥ 0, one has
(1) Linearity: Z Z Z
(α · f + β · g)dµ = α f dµ + β gdµ.
E E E
(2) If for (a.e.) x ∈ E, f (x) ≤ g(x) then
Z Z
f (x)dµ ≤ g(x)dx.
E E
R
(3) We have f dµ = 0, if and only if f = 0 a.e. (recall that for f ∈ S (E ) we require f ≥ 0).
E

The following definition extends the definition of the integral to measurable non-negative functions.
Definition III.2. Let f : E → R≥0 be a (E − B(R))-measurable function. The integral of f is
 
Z Z 
f dµ := sup gdµ : g ∈ S (E ), 0 ≤ g ≤ f ,
 
E E

finite or infinite.
Thus, if f is a measurable, it is approximated from below by simple functions, and whatever
resulting “optimal” value of the area covered by those simple function is the value of the integral
of f . It is manifested by the following lemma (in combination with the follow-up theorem). Recall
that for a sequence of functions fn , we write fn ↑ f (a.e.), if for (resp. a.e.) every x ∈ E, fn (x) is
monotone increasing, and fn (x) → f (x).
26 IGOR WIGMAN

Lemma III.3. Let f : E → R≥0 be a (E − B(R))-measurable function. Then there exists a sequence
{fn } of (non-negative) simple functions, so that fn ↑ f as n → ∞.
Proof. Take the functions
(
n f (x) ≥ n
fn (x) = .
k · 2−n f (x) ∈ 2kn , k+1
 
2n , k = 0, 1, . . . , n · 2n − 1

First, it is easy to check that fn (x) ↑ f (x). That fn are simple follows from the properties of the
measure, so that each set
  
k k+1
An,k (f ) := x ∈ E : f (x) ∈ n , n
2 2
is measurable.


The following theorem shows, in particular, that the approximation quality of the simple functions
as in Lemma III.3 is sufficiently fine, so that the corresponding integrals are well-approximated. In
general, it is important to find sufficient conditions, when convergence a.e. of a sequence of functions
ensures the same for the integrals, i.e. we can exchange the limits and the integration. One of the
sufficient conditions is given by the following Dini’s Monotone Convergence Theorem, that we cite
unproved.
Theorem III.4 (Monotone Convergence Theorem). Let (E, E , µ) be a measure space, {fn }n≥1 be
a sequence of measurable non-negative functions fn : E → R, and f : E → R a measurable non-
negative function, so that fn ↑ f a.e. on E. Then
Z Z
fn dµ ↑ f dµ.
E E

We now aim to extend the notion of integral to arbitrary measurable functions, thus no longer
imposing the nonnegativity. Given a measurable function f : E → R, we decompose f into its
positive and negative parts, f + and f − respectively, in the following way
Definition III.5 (Positive and negative parts of a measurable function). Let f : E → R be a
measurable function. Then
f + (x) := max(f (x), 0), f − = − min(f (x), 0).
Note that both f ± ≥ 0 are nonnegative and measurable, and, further, we have
|f | = f + + f − ,
and
f = f + − f −.
f − dµ < ∞, whence
R R R
It can be easily seen that |f |dµ < ∞, if and only if both f + dµ < ∞ and
E E E
Z Z Z
|f |dµ = f + dµ + f − dµ.
E E E
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 27

Definition III.6 (Integral of arbitraryR measurable functions). A measurable function f : E → R


is integrable if both f + dµ < ∞ and f − dµ < ∞ in the sense of Definition III.2. In this case the
R
E E
integral of f is Z Z Z
f dµ := +
f dµ − f − dµ.
E E E

Within the context of a probability space (Ω, F , P ) and a random variable X : Ω → R, the
integral is referred to as expectation
Z Z
E[X] := XdP = X(ω)dP (ω).
Ω Ω

The following lemma extends the basic properties of the integral from simple functions to arbitrary
ones (cf. Lemma III.1).
Lemma III.7. For f, g integrable functions, and α, β ∈ R, one has:
(1) Linearity: Z Z Z
(α · f + β · g)dµ = α f dµ + β gdµ.
E E E
(2) If for (a.e.) x ∈ E, f (x) ≤ g(x) then
Z Z
f (x)dµ ≤ g(x)dx.
E E
R
(3) If f = 0 a.e., then f dµ = 0.
E R
(4) Conversely, if f ≥ 0 and f dµ = 0, then f = 0 a.e.
E

Proof. First we prove the three statement under the extra assumption that f, g ≥ 0 are nonnegative,
and α, β ≥ 0.
(1) We apply Lemma III.3 on f, g to yield two sequences fn , gn ∈ S (E ) of simple functions so
that fn ↑ f and gn ↑ g. Then, under our assumptions, αfn + βgn ↑ αf + βg, and
Z Z Z
(αfn + βgn )dµ = α fn dµ + β gn dµ,

whence the result follows upon an application of the Monotone Convergence Theorem III.4
on both the l.h.s. and the r.h.s.
(2) Follows directly from the definition
R of the integral of nonnegative functions.
(3) If f = 0 a.e., then f ∈ S (E ), and f dµ = 0 by the definition.
(4) Let fn ↑ f be the R sequence of simple functions prescribed by Lemma III.3. Then, since
fn ≤ f , it forces fn dµ = 0 by (2) of this lemma, and then fn = 0 a.e. by Lemma III.1(3),
E
so that f = 0 a.e.
For general f, g, decompose f = f + − f − and g = g + − g − , and apply the same to the negative
functions f ± , g ± separately.

Example III.8. Let f : R → R be defined by f (x) = sin(x)
x
, R equipped with the Borel σ-algebra
and the Borel measure, and we are interested in whether f is integrable on [0, ∞). It is possible to
28 IGOR WIGMAN

R∞ R∞
check that both +
f dµ(x) and f − dµ(x) are infinite, so f is not integrable. However, the Riemann
0 0
R∞ sin x
integral x
dx makes sense as an improper integral.
0

Definition III.9 (Dirac delta). Let (E, E ) be a measurable space, and x0 ∈ E an arbitrary point.
We denote the measure δx0 : E → {0, 1} defined by
(
1 x0 ∈ A
δx0 (A) = ;
0 x0 ∈
/A
it is usually called the “Dirac delta measure” at x0 , or “Dirac delta function” at x0 , even
though it is not a function.
The Dirac delta measure at x0 is the probability distribution of the constant random variable
X : Ω → E, X(ω) = x0 , so that P (X = x0 ) = 1. In the important case (E, E ) = (R, B(R)), the
corresponding distribution function is
(
1 x ≥ x0
FX (x) = P (X ≤ x) = .
0 x < x0
Note that δx0 is not the indicator function of the singleton {x0 }, since the former is a measure (so,
a set function), whereas the latter is a function. Rather, intuitively, δx0 could be thought of the
function attaining an infinitesimally thin infinitely high peak x0 , with unit mass, vanishing outside
of x0 . Using the Dirac delta notation it is convenient to represent the discrete distribution measures.
Example III.10. Let (Ω, F , P ) = ({0, 1}, P({0, 1}), P ) with
1 1
P = δ0 + δ1 ;
2 2
Then, thinking on R, 
0 A = ∅

P (A) = 21 A = {0} or A = {1} ,

1 A = {0, 1}
corresponding to a single toss of a fair coin.
Example III.11. Binomial Bin(n, p) distribution. Then the probability measure is:
n  
X n k
P= p (1 − p)n−k · δk : P({0, 1, . . . , n}) → [0, 1],
k=0
k
endowing a sample point k, 0 ≤ k ≤ n with the probability nk pk (1 − p)n−k . For example, for

A = {0, 1},
n  
X n k
P (A) = p (1 − p)n−k · δk (A) = (1 − p)n + np(1 − p)n−1 ,
k=0
k
(
1 k = 0, 1
since δk (A) = .
0 otherwise
Example III.12. Poisson distribution Pois(λ): the probability measure is

X λk −λ
P= e · δk (III.3)
k=0
k!
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 29

on Ω = Z≥0 , F = P(Z≥0 ).
Let us now consider the integral of some functions (expectation of random variables).
Example III.13. Let µ = δy for some y ∈ E. Then if g : E → R is a simple function of the form
m
P
g= ai χAi , then, by the definition (III.2) of the integral in this case,
i=1
Z m
X m
X
gdδy = ai · δy (Ai ) = ai · χAi (y) = g(y).
E i=1 i=1

For nonnegative measurable functions f : E → [0, ∞), we may take an increasing sequence fn ↑ f of
simple functions, so that it follows from the Monotone Convergence theorem, that
Z Z
f dδy = lim fn dδy = lim fn (y) = f (y).
n→∞ n→∞
E E

Finally, if f is an arbitrary measurable function f : E → R, then we decompose f = f + − f − , apply


±
R
the above result on f , and deduce that f dδy = f (y), which is consistent with our intuition that
E
P (X = y) = 1, so that E[f (X)] = f (y).
Example III.14. Poisson probability measure

X λk −λ
P= e δk .
k=0
k!
Then, formally exchanging the order of summation integration, one has
∞ ∞
λk −λ λk −λ
Z X Z X
f dP = e f dδk = e · f (k), (III.4)
k=0
k! k=0
k!
E E
R
upon using the result f dδk = f (k) we obtained in Example III.13; this is the well-known formula
E
yielding the expectation of E[f (X)], where X is a Pois(λ)-distributed random variable. To rigorously
justify (III.4), one needs to repeat the lines of the proof within Example III.13.

P Ω = Z, then we may write any measure P on Z as


P III.15. If Ω is countable, e.g.
Example
P = pk δk with pk = P ({k}), and pk = 1 (no matter what is F ). Conversely, any such
k∈Z k∈Z
collection {pk }k∈Z defines a probability measure on Z. Then, for measurable f : Z → R,
Z X
f dP = pk · f (k).
Z k∈Z


Example III.16. Let (E, E ) = (N, P(N)), and µ the counting measure µ =
P
δk (so that, in
k=1
particular, µ(N) = ∞, and the same for any infinite set). Then for f : N → R,
Z X∞ Z ∞
X
f dµ = f dδk = f (k).
k=1 k=1
(−1)k
For example, if f (k) = k
,
Z ∞
X 1
(f + + f − )dµ = = ∞,
k=1
k
N
30 IGOR WIGMAN

so in Lebesgue sense, f is not integrable w.r.t. µ, despite the fact that, as an alternating series,
conditionally

X (−1)k
= log 2,
k=1
k
since its the positive part cancels out the negative one, which is forbidden by the Lebesgue theory
(cf. Example III.8).
In all previous examples, the integrals over discrete domains reduced to a summation (with weights
according to the probability measure P ). The simple functions and monotone convergence were
invoked to evaluate the integral of arbitrary measurable functions.
Let us compare the Riemann integrals and the Lebesgue integrals on R. The Riemann integral
allows for integration on intervals. However, whenever we are to Lebesgue integrate a function
f : E → R over a measurable subset B ∈ E (as opposed to E), we may restrict the domain by
multiplying by the characteristic function:
Z Z
f dµ := f · χB dµ.
B E
R
Whenever E R = R equipped with the Borel σ-algebra and the Borel measure, we usually write f dx
in place of f dµ. To support this notation there is the following relation between the Riemann and
the Lebesgue integrals (whose proof is outside the scope of this module):
Theorem III.17. (1) If f : [a, b] → R is Riemann integrable and measurable, then also f is
Lebesgue integrable, and
Zb Z
f (x)dx = f (x)µ(dx),
a [a,b]

where the l.h.s. and the r.h.s. are Riemann integral and Lebesgue integral respectively.
(2) A function f : [a, b] → R is Riemann integrable, if and only f is bounded, and the set of
points of discontinuity of f is of Lebesgue measure 0. (Lebesgue measure extends the Borel
measure to a bigger σ-algebra than B(R).)
For example, the Dirichlet function χQ : R → R is not Riemann integrable (being discontinuous
everywhere), but easily Lebesgue integrable, with integral 0. On the other hand, an improper
Riemann integral might exist, even though the function is not Lebesgue integrable, cf. Example
III.8.
Recall that the Monotone Convergence Theorem III.4 allows to switch order of limit and integral
for a sequence of measurable functions fn ↑ f , where the monotonicity assumption is essential. The
following theorem, whose proof is outside the scope of this module, relaxes this condition.
Theorem III.18 (Dominated Convergence theorem). Let {fn }n≥1 , f : E → R be measurable func-
tions, so that fn (x) → f (x) a.e. on E. Suppose that there exists an integrable function g : E → R≥0 ,
so that for all n ≥ 1, x ∈ E, |fn (x)| ≤ g(x). Then the fn and f are integrable, and
Z Z
fn dµ → f dµ.
E E

The function g in Theorem III.18 is called the “dominating function”. It is important to stress that
what has to be dominated is the integrand fn , and not the outcome of the integration. In particular,
the conditions of Theorem III.18 hold if µ(E) < ∞ and fn are all uniformly bounded, i.e. there
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 31

exists a K > 0 so that for all n ≥ 1, x ∈ E, |fn | ≤ K (since in this case g ≡ K is integrable, so could
be chosen as a dominating function).
Example III.19. Let fn : [0, ∞) → R with fn (x) = e−nx . Then fn → 0 a.e. (outside of x = 0), and
R∞
for all n ≥ 1, fn (x) ≤ f1 (x) = e−x , and e−x dx = 1 < ∞. Hence the dominated convergence yields
0
Z∞ Z∞
−nx
e dx → 0dx = 0,
0 0
something that can be checked explicitly.
x
Example III.20. Let fn : [0, π] → R, fn (x) = sin x + e cos x

n
. Then fn (x) → sin(x) on x ∈ [0, π],
and |fn (x)| ≤ 1 with µ([0, π]) < ∞, so
Zπ Zπ
fn (x)dx → sin(x)dx = 2.
0 0

Example III.21. Let (Ω, F , P ) be a measure space, and An ⊆ Ω a sequence of events, so that, a.s.,
χAn → χA for some event A ∈ F , i.e. χAn (ω) → χA (ω) for ω ∈ Ω outside of a probability zero set.
Then, since |χAn | ≤ 1 (and in light of P (Ω) = 1 < ∞),
Z Z
P (An ) = χAn dP → χA dP = P (A).
Ω Ω

Example III.22. Take on (R, B(R), µ) the functions fn = n2 · χ(0,1/n) . Then fn → 0, but
R
fn dµ =
R
n → ∞, showing that a domination, violated in this example, is important for the application of
Dominated Convergence theorem.
Example III.23. Let (Ω, F , P ) be an arbitrary probability space, and Xn : Ω → R any sequence
of (a.s.) uniformly bounded random variables |Xn | ≤ M , so that a.s. Xn → X. Then, by the
dominated convergence,
Z Z
E[Xn ] = Xn (ω)dP → X(ω)dP = E[X].
Ω Ω

For instance, for the Poisson distribution of Example III.14, if Xn : Z≥0 → Z are uniformly
bounded, i.e. for some M > 0, for all n ≥ 1 and k ≥ 0, |Xn (k)| ≤ M , and for all k ≥ 0,
Xn (k) → X(k), then
∞ ∞
X λk −λ X λk −λ
E[Xn ] = e Xn (k) → e X(k) = E[X].
k=0
k! k=0
k!
For future reference we will require to differentiate under the integral sign, w.r.t. a parameter.
Theorem III.24 (Differentiation under the integral sign). Let (E, E , µ) be a measure space and
U ⊆ R open, and suppose that f : U × E → R be a function satisfying: (i) For all t ∈ U , the function
· 7→ f (t, ·) is integrable on E, (ii) For all x ∈ E, the function · 7→ f (·, x) is differentiable (on U ), and
(iii) There exists an integrable function g : E → R so that for all x ∈ E, t ∈ U ,

∂f
(t, x) ≤ g(x).
∂t
32 IGOR WIGMAN

Then, for all t ∈ U , the function x 7→ ∂f∂t


(t, x) is integrable, and
Z Z
d ∂f
f (t, x)dµ(x) = (t, x)dµ(x).
dt ∂t
E E

Proof. Denote Z
p(t) := f (t, x)dµ(x).
E
Now fix t0 ∈ U , and let {hn }n≥1 be an arbitrary sequence of numbers, hn 6= 0 so that hn → 0. Then
Z
1
(p(t0 + hn ) − p(t0 )) = gn (x)dµ(x),
hn
E

where gn (x) := f (t0 +hn ,x)−f


hn
(t0 ,x)
, so that, as n → ∞, gn (x) → ∂f (t , x), and, moreover, by the Mean
∂t 0
∂f
Value theorem, for every x ∈ E, gn (x) = ∂t (t0 + hn , x) with some 0 < |h0n | < |hn |. Hence, by
0

assumption, |gn (x)| ≤ g(x), and we may apply the Dominated Convergence theorem to yield
Z Z
1 ∂f
(p(t0 + hn ) − p(t0 )) = gn (x)dµ(x) → (t0 , x)dµ(x).
hn ∂t
E E

Since the vanishing sequence hn is arbitrary, the above implies that the limit
1
lim (p(t0 + h) − p(t0 ))
h→0 h

exists, and it is equal


Z Z
d d ∂f
f (t, x)dµ(x)|t=t0 = p(t)|t=t0 = (t0 , x)dµ(x).
dt dt ∂t
E E


III.2. Density functions. One way to prescribe a distribution is via the density function, defined
below.
Proposition III.25. Let (E, E , µ) be a measure space, and f : E → R≥0 a non-negative measurable
function. Define the set function ν : E → [0, ∞] as
Z Z
ν(A) = f χA dµ = f dµ. (III.5)
E A

Then ν is a measure on (E, E ), and, further, for every g : E → R integrable,


Z Z
gdν = f · gdµ. (III.6)
E E

f · χ∅ dµ = 0. Now, let {Ai }i≥1 be a sequence of disjoint sets in E . Then, by


R
Proof. First, ν(∅) =
E

P
the definition (III.5), and since χ ∞
S = χAi ,
Ai i=1
i=1


! Z Z ∞
! ∞ Z ∞
[ X X X
ν Ai = f ·χ∞
S dµ = f· χ Ai dµ = f · χAi dµ = ν(Ai ),
Ai
i=1 E i=1
E i=1 i=1 E i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 33

where we justify the exchange of order of summation and integration using the Monotone Convergence
theorem (or Dominated Convergence theorem, with f the dominating function). That shows that ν
is a measure. m
P
For the second assertion (III.6) of Proposition III.25, let g = ai χAi ∈ S (E ) be a simple function.
i=1
Then, by the definition of the integral for simple functions, and (III.5),
Z m
X m
X Z Z
gdν = ai ν(Ai ) = ai f χAi dµ = f gdµ. (III.7)
E i=1 i=1 E E

For g : E → R≥0 nonnegative, let {gn }n≥1 be a sequence of simple functions so that gn ↑ g,
prescribed by Lemma III.4. Then, using (III.7) and monotone convergence (for both gn ↑ g and
gn f ↑ gf ), we write Z Z Z Z
gdν = lim gn dν = lim f gn dµ = f gdµ.
n→∞ n→∞
E E E E
Finally, for g arbitrary integrable, decompose g = g − g , and apply the assertion on g + and g −
+ −

separately.

Definition III.26 (Density function). If a measure ν on (E, E ) is given by
Z
ν(A) = f χA dµ, (III.8)
E

A ∈ E , then the function f : E → R is called the density of ν w.r.t. µ. If


R
f dµ = 1, it is called
E
the probability density (or probability density function, abbreviated pdf) of ν (w.r.t. µ).
We observe that if the values of fX are amended on a µ-measure 0 sets, then the defining property
(III.8) of the density function fX is still satisfied; hence a density function is not unique. Below it
will be asserted that the said degree of freedom is the only one, i.e. the density is unique up to
µ-measure 0 sets (cf. Lemma III.35).
Recall that for a probability space (Ω, F , P ) and a random variable X : Ω → R, the distribution
of X is the probability measure PX : B(R) → [0, 1], given by
PX (A) = P (X ∈ A) = P (X −1 (A)).
If PX admits a density fX w.r.t. a measure µ, then its defining property is
Z Z
P (X ∈ A) = PX (A) = fX (x)µ(dx) = fX χA dµ.
A R
In case µ is the Borel measure, the function fX coincides with the usual notion of probability density
function discussed earlier on. In this case, the distribution function of X is
Z Zx
FX (x) := P (X ≤ x) = fX χ(−∞,x] dx = fX (t)dt. (III.9)
R
−∞

−x2 /2
Example III.27. The function f (x) = √12π e is the probability density function of the standard
Gaussian distribution. That is, for Z ∼ N (0, 1) and A ∈ B(R),
Z Z
1 2
P (Z ∈ A) = PZ (A) = f (z)χA dz = √ e−z /2 dz.

R A
34 IGOR WIGMAN

Recall that FX : R → [0, 1] is right-continuous. If FX is continuous and, in addition, continuously


differentiable except perhaps a finite set of points, then, by the fundamental theorem of calculus,
Zx
FX (x) = FX0 (u)du, (III.10)
−∞

where FX0 is defined arbitrary wherever FX is not continuously differentiable. Comparing (III.9) to
(III.10) shows that FX0 is the density function of PX w.r.t. the Borel measure (the formula (III.10)
extends to A ∈ B(R), since {(−∞, x] : x ∈ R} is a π-system generating B(R)).
Example III.28. Let FX : R → [0, 1] be the distribution function

0 x < 0

FX (x) = x 0 ≤ x ≤ 1 .

1 x > 1

Then FX is continuously differentiable except for x = 0, 1, so it has the density fX (x) = FX0 (x) =
χ[0,1] (x), i.e. corresponding to the U([0, 1]) distribution. The distribution is: for A ∈ B(R),
Z
PX (A) = P (X ∈ A) = χ[0,1] χA dx = µ([0, 1] ∩ A).

Example III.29. Let FX : R → [0, 1] be the distribution function


(
0 x<0
FX (x) = −x
.
1−e x≥0
The corresponding density is
(
0 x<0
fX (x) = FX0 (x) = = e−x χ[0,∞] (x),
e−x x≥0
and then Z Z
−x
P (X ∈ A) = e χ[0,∞) (x)dx = e−x dx.
A A∩[0,∞)
(
0 x<0
Example III.30. Let FX = , corresponding to the constant random variable P (X =
1 x≥0
0) = 1. Then, since FX has a jump at x = 0, it does not admit a density, at least in the above sense.
However, in a case with jumps, like this one, there is a way to endow the density with a proper
meaning as a generalised function, in this case, as the Dirac δ0 , out of the scope of this module.
Lemma III.31. Let (Ω, F , P ) be a probability space, X : Ω → R a random variable, and PX its
probability distribution (PX (·) = P (X −1 (·))). Then, for every measurable g : R → R, g(X) : Ω → R
is a random variable, and if, further, g(X) is integrable, then
Z
E[g(X)] = g(x)PX (dx). (III.11)
R

Proof. First, g(X) is a random variable, since it is a composition of two measurable functions on
m
(Ω, F ). In what follows we prove the identity (III.11). First, if g =
P
ai χAi ∈ S (B(R)) is a simple
i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 35

function, then,
m
X
g(X) = ai χX −1 (Ai ) ∈ S (F ).
i=1

We thereupon have
m
X m
X Z
−1
E[g(X)] = ai P (X (Ai )) = ai PX (Ai ) = g(x)PX (dx),
i=1 i=1 R

by the definition of PX . We can then use our usual strategy of extending the results for nonnegative
measurable functions using Lemma III.3 and monotone convergence, and then to all integrable g =
g + − g − by invoking the result for g ± separately, and linearity (see e.g. the proof of Proposition
III.25). The details are left out.


Lemma III.31 shows that in order to evaluate the expectation of g(X), one can perform the
computation directly in terms of the distribution of X, without invoking the “original” space Ω
and the corresponding probability measure (though g(X) is a random variable on (Ω, F )). It is a
generalisation of the usual transformation of coordinates
Zb Zf (b)
g(f (x)) · f 0 (x)dx = g(y)dy,
a f (a)

valid under suitable conditions on f (·). If the probability distribution PX has a density fX w.r.t. µ
(e.g. µ is the Borel measure on R), then one may combine Lemma III.31 with Proposition III.25 to
yield the usual formula
Z Z
E[g(X)] = g(x)PX (dx) = g(x) · fX (x)dx. (III.12)
R R

Example III.32. Let X be an exponential random variable with parameter 1, i.e. its probability
density w.r.t. the Borel measure is fX (x) = e−x χ[0,∞) . Then, by (III.12), for every g : R → R,
Z Z∞
E[g(X)] = g(x)fX (x)dx = g(x)e−x dx,
0

provided that the latter integral makes sense.


Example III.33. Let U be a uniform random variable on (0, 1]. Then the corresponding density
function w.r.t. the Borel measure is χ(0,1] , so that
Z1
E[g(U )] = g(u)du.
0

Example III.34. Let X ∼ Pois(1) be a Poisson random variable with parameter 1. Recall that
its probability measure P is given by (III.3) with λ = 1 (see Example III.12). Let µ0 be a measure

on the same measure space (Z≥0 , P(Z≥0 )) given by µ0 =
P
δk . Then P has a density function
k=0
36 IGOR WIGMAN

e−1
fX (k) = k!
. That is, for every A ∈ P(Z≥0 ),
Z X X e−1
P (X ∈ A) = fX χA dµ0 = fX (k) =
k∈A k∈A
k!
Z≥0

For every g : Z≥0 → R, so that E[g(X)] < ∞ (equivalent to the summation on the r.h.s. below
finite),
∞ ∞
e−1
Z X X
E[g(X)] = fX gdµ0 = fX (k) · g(k) = · g(k).
k=0 k=0
k!
Z≥0

Now we assert the aforementioned uniqueness of the density function, up to sets of µ-measure 0.
Lemma III.35. Let (E, E ) be a measurable space, and µ and ν two measures on (E, E ). Then if f
and g are two densities of ν w.r.t. µ, then f = g µ-a.e.
Proof. Recall the defining property (III.8) of a density, satisfied by both f and g in place of f . Denote
A = {x ∈ E : f (x) > g(x)}, and, applying (III.8) with this given A (and with f or g) yields that
Z Z
ν(A) = f · χf >g dµ = g · χf >g dµ.

Using the linearity of the integral, we obtain


Z
(f − g) · χf >g dµ = 0

Then, by Lemma III.7(4), and in light of the fact that the integrand (f − g) · χf −g ≥ 0 is nonnegative,
that forces (f − g) · χf −g = 0 outside of a µ-measure 0, i.e. f ≤ g µ-a.e. An identical argument gives
that g ≤ f µ-a.e., so all in all, f = g µ-a.e.

Example III.36. The functions χ(0,1] , χ(0,1) and χ[0,1] are equal outside of sets of 0 Borel measure,
therefore are probability densities corresponding to the same distribution. Hence the uniform random
variables U((0, 1]), U((0, 1)) and U([0, 1]) have the same distribution given by PU (A) = P (U ∈ A) =
µ((0, 1) ∩ A).
Example III.37. Recall that the exponential distribution has probability density function fX (x) =
e−x · χ[0,∞) (x). As an alternative, one can also take ff
X (x) = e
−x
· χ[0,∞) (x) · χR\Q (x), since fX = ff
X
a.e. w.r.t. the Borel measure.
Example III.38. Let X ∼ Pois(λ) with λ > 0, Y ∼ U((0, 1]) and N ∼ Bin(1, 1/2) be three
independent random variables. Set a new random variable
(
X N =0
Z= ,
Y N =1
i.e. Z is Pois(λ) or U((0, 1]) with probability 21 . Its distribution is
1 1
PZ = PX + PY ,
2 2
where PX and PY are the distributions of X (discrete) and Y (continuous) respectively, hence the

P
distribution of Z is neither continuous nor discrete. Recall that µ0 = δk is as in Example III.34,
k=0
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 37

and using the densities of X and Y , for A ∈ B(R),


Z Z
1 1 1 1
PZ (A) = P (Z ∈ A) = P (X ∈ A) + P (Y ∈ A) = fX χA dµ0 + fY χA dx
2 2 2 2
1 X λk −λ 1
Z
= e + dx.
2 k∈Z ∩A k! 2
≥0 (0,1]∩A

It can also be seen by the above, that Z has no probability density function w.r.t. either µ0 or the
Borel measure.
In general, a measure ν has a density w.r.t. another measure µ on the same measurable space
(E, E ), if and only if ν is absolutely continuous w.r.t. µ, but it is beyond the scope of this course
(see the Radon-Nikodym Theorem).
III.3. Transformation of random variables. In this section we are concerned in what happens
to the density function of a random variable X : Ω → R as X is transformed to Y = φ(X) : Ω → R
for some φ : R → R measurable. In this case Y is a random variable, and its distribution is
PY (A) = P (Y ∈ A) = P (φ(X) ∈ A) = P (X ∈ φ−1 (A)) = PX (φ−1 (A)),
A ∈ B(R). The distribution function of Y is then
FY (t) = P (Y ≤ t) = P (φ(X) ≤ t) = P (X ∈ φ−1 ((−∞, t])) = PX (φ−1 ((−∞, t])); (III.13)
in general it cannot be expressed in terms of FX only, depending on the properties of φ, and what is
φ−1 ((−∞, t]).
Suppose that φ is continuous and strictly increasing, i.e. for x < y, φ(x) < φ(y). Then, in
particular, φ is injective, and φ−1 : φ(R) → R is a continuous and strictly increasing injection on the
image of φ. Therefore, for t ∈ φ(R),
φ−1 ((−∞, t]) = (−∞, φ−1 (t)],
(mind that the φ−1 (·) on the l.h.s. is a pre-image, whereas φ−1 (·) on the r.h.s. is the inverse function).
In this light, if φ is continuous and strictly increasing, then, for t ∈ φ(R)) in the image of φ, (III.13)
reads
FY (t) = PX (φ−1 ((−∞, t])) = PX ((−∞, φ−1 (t)]) = FX (φ−1 (t)). (III.14)
0
Assuming that FX is continuously differentiable with density fX = FX , and, further, that φ is
differentiable at φ−1 (t), we have that
fX (φ−1 (t))
fY (t) = FY0 (t) = , (III.15)
φ0 (x)|x=φ−1 (t)
where we used
d −1 1
φ (t) = 0 .
dt φ (x)|x=φ−1 (t)
In case φ is not injective, the r.h.s. of (III.15) should be replaced by a summation over x ∈ φ−1 (t)
of expressions of this type, under suitable assumptions on φ (and FX ).
Example III.39. Let U ∼ U(0, 1], and Y = eU (i.e. φ(u) = eu ). Then Y ∈ (1, e] a.s., since U ∈ (0, 1]
a.s. Here φ(u) is strictly increasing, so that in this case both (III.14) and (III.15) are valid. We have
directly
FY (t) = P (eU ≤ t) = P (U ≤ log t) = log t
on t ∈ (1, e] and
1
fY (t) = FY0 (t) = · χ(1,e] (t).
t
38 IGOR WIGMAN

One the other hand, using (III.14) and (III.15), we obtain


FY (t) = FU (φ−1 (t)) = FU (log t) = log t,
and
FU0 (φ−1 (t)) χ(0,1] (log t) 1
fY (t) = = = · χ(1,e] (t), (III.16)
φ0 (u)|u=φ−1 (t) elog t t
consistent to our direct computations.
Example III.40. Let U ∼ U((0, 1]) as above, and Y = − log U . Then, since U ∈ (0, 1] a.s., we
have that Y ∈ [0, ∞) a.s.. Mind that ψ(u) = − log u is strictly decreasing, so (III.14) (and (III.15))
are not, strictly speaking, applicable without proper adjustments. Since ψ −1 (t) = e−t is also strictly
decreasing on t ∈ [0, ∞), we have that
FY (t) = P (ψ(U ) ≤ t) = P (U ≥ ψ −1 (t)) = P (U ≥ e−t ) = 1 − P (U < e−t ) = 1 − e−t ,
and differentiation gives
fY (t) = FY0 (t) = e−t · χ[0,∞) (t).
III.4. Product measures. In this section we deal with the following situation. Let (E1 , E1 , µ1 )
and (E2 , E2 , µ2 ) be two measure spaces. We are interested in constructing a measure on E1 × E2 ,
denoted µ = µ1 ⊗ µ2 so that for all A1 ∈ E1 and A2 ∈ E2 , (µ1 ⊗ µ2 )(A1 × A2 ) = µ1 (A1 ) · µ2 (A2 )
(recall that A1 × A2 = {(a1 , a2 ) : a1 ∈ A1 , a2 ∈ A2 } is the Cartesian product of A1 , A2 ). If µ1 ,
µ2 are the distributions of random variables X1 and X2 , then the new measure µ1 ⊗ µ2 is supposed
to represent the joint distribution of (X1 , X2 ) (and be defined on an appropriate σ-algebra), under
the independence assumption on X1 , X2 . Of course, one can use the abract Theorem II.13, but the
following ad hoc method is a concrete way to construct such a measure.
Definition III.41 (σ-finite measures). A measure µ on a measurable space (E, E ) is σ-finite, if E

En of sets of finite measure, i.e. for every n ≥ 1, Ei ∈ E and µ(Ei ) < ∞.
S
is a countable union E =
n=1

Example III.42. Any measure space (E, E , µ) so that µ(E) < ∞ is, in particular, σ-finite. That
includes all the probability spaces.
Example III.43. The Borel measure on (R, B(R)) is σ-finite, since

[
R= [−n, n]
n=1

(say).

Example III.44. The measure space (N, P(N), µ0 ), with µ0 =
P
δn the counting measure µ0 (A) =
( n=1
|A| A finite ∞
S
. Then µ0 is σ-finite, since N = {n}.
∞ A infinite n=1

Example III.45. On the other hand consider the measure space (R, B(R), ν), with
(
|A| A finite
ν(A) =
∞ A infinite
the counting measure, this time the set R being uncountable. Then ν is not σ-finite, since otherwise
S∞
R= Ei is a countable union of finite sets Ei , contradicting R being uncountable.
n=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 39

Now let (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) be two σ-finite measure spaces (for example, probability
spaces). It is easy to check that the collection
A := {A1 × A2 : A1 ∈ E1 , A2 ∈ E2 }
is a π-system of subsets of E1 × E2 . We define the product σ-algebra, denoted E1 ⊗ E2 to be the
σ-algebra generated by A , i.e.
E1 ⊗ E2 := σ(A ).
We now define the measure µ = µ1 ⊗ µ2 be prescribing it on the elements of A to be given by the
product of measures µ(A1 × A2 ) = µ1 (A1 ) · µ2 (A2 ). The following theorem, whose proof is outside
of the scope of this course, asserts the existence and the uniqueness of such a measure under the
σ-finiteness assumption on both µ1 and µ2 . While the existence of such a measure µ = µ1 ⊗ µ2
does not require such a σ-finiteness assumption, for its uniqueness the σ-finiteness of both µ1 , µ2
is essential. Note that the uniqueness does not follow in this case from Theorem I.13, since the
finiteness of µ(E1 × E2 ) is not assumed.
Theorem III.46. Let µ1 and µ2 be two σ-finite measures on the measurable spaces (E1 , E1 ) and
(E2 , E2 ) respectively. Then there exists a unique measure µ = µ1 ⊗ µ2 on (E1 × E2 , E1 ⊗ E2 ) so that
for all A1 ∈ E1 , A2 ∈ E2 ,
µ(A1 × A2 ) = µ1 (A1 ) · µ2 (A2 ).
Example III.47. Taking (E1 , E1 , µ1 ) = (E2 , E2 , µ2 ) = (R, B(R), µB ) with µ1 = µ2 = µB the Borel
measure on R, the resulting σ-algebra B(R2 ) = B(R)⊗B(R) on R2 is generated one of the π-systems
{(−∞, s] × (−∞, t] : s, t ∈ R} or {(a, b] × (c, d] : a < b, c < d}. The product measure µ = µB ⊗ µB
on (R2 , B(R2 )), uniquely prescribed by
µ((a, b] × (c, d]) = (b − a) · (d − c),
is the Borel measure on R . It is also possible to define inductively the Borel measure on Rn and the
2

Borel σ-algebra B(Rn ) for all n ≥ 1.


Using the Borel measure on Rn (Example III.47), it is possible to define the Lebesgue integration on
Rn without performing any extra work. Fortunately, our definitions were developed for any abstract
measure space, and did not depend on the particularities of the Borel measure on R.
Let X and Y be real-valued random variables with distributions PX and PY respectively. Then,
by Theorem III.46, there exists a unique probability distribution µ on (R2 , B(R2 )) = (R × R, B(R) ⊗
B(R)), such that
µ(A1 × A2 ) = PX (A1 ) · PY (A2 )
for all A1 , A2 ∈ B(R). Thus, under the joint distribution µ of (X, Y ), the random variables X and Y
are independent. Thus, product of measures allows for a construction of probability spaces carrying
independent random variables.
The following result is a generalization of Fubini’s theorem from standard calculus (relating be-
tween multiple and repeated integrals), valid for abstract measure spaces.
Theorem III.48 (Fubini’s theorem). Let µ1 , µ2 be σ-finite measures on measurable spaces (E1 , E1 )
and (E2 , E2 ) respectively, and µ = µ1 ⊗ µ2 .
(1) If f : E1 × E2 → [0, ∞) is nonnegative and E1 ⊗ E2 -measurable, then
Z Z Z Z Z
f dµ = f (x1 , x2 )µ2 (dx2 )µ1 (dx1 ) = f (x1 , x2 )µ1 (dx1 )µ2 (dx2 ), (III.17)
E1 ×E2 E1 E2 E2 E1

equality understood in the finite or infinite sense (i.e. if one of the three integrals is infinite,
then so are the other two).
40 IGOR WIGMAN

(2) Let f : E1 × E2 → R be E1 ⊗ E2 -measurable. Then if one of the integrals


Z Z Z Z Z
|f |dµ; |f (x1 , x2 )|µ2 (dx2 )µ1 (dx1 ); |f (x1 , x2 )|µ1 (dx1 )µ2 (dx2 )
E1 ×E2 E1 E2 E2 E1

is finite, then so are the other two, f is integrable, and the equality (III.17) holds.
The l.h.s. of (III.17) is called “double integral”, whereas the other two integrals are “repeated
integrals (in two different orders). Often times it is easier to test the finiteness of the repeated
integrals than the double integral. Without the σ-finiteness condition, Theorem III.48 fails decisively
(see counter-examples in the home assignment). R
The integrability condition in Theorem III.48(2) is that |f | < ∞; unlike what is asserted in (1)
for nonnegative functions, if the integrability condition fails for some f , without the nonnegativ-
ity assumption, some of the integrals in (III.17) might be finite, though the equality (III.17) fails
decisively (or that some are finite and some other are infinite).
2
Example III.49. Let I be the important Gaussian integral I = e−x /2 dx. Then, by Fubini, and
R
R
working in polar coordinates,
Z Z∞ Z2π h i ∞
2 −(x2 +y 2 )/2 −r2 /2 −r2 /2
I = e dxdy = re dθdr = 2π −e = 2π,
0
R2 0 0

so that I = 2π.
Example III.50. If (E, E , µ) is a σ-finite measure space, and f : E → [0, ∞) is E -measurable
fR(x)
and nonnegative, then, observing the obvious identity f (x) = dt, we change the order of the
0
integration
Z Z Z f (x) Z∞ Z Z∞
f dµ = dtdµ(x) = dµ(x)dt = µ ({x : f (x) ≥ t}) dt.
0
E E 0 {x: f (x)≥t} 0

For instance, if X is a nonnegative random variable with distribution PX , then it recovers the useful
formula (with f (x) = x):
Z Z∞ Z∞
E[X] = xPX (dx) = PX ({x : x ≥ t}) = P (X ≥ t)dt,
0 0
that, if the distribution function FX is continuous, can be expressed in terms of FX only.
Example III.51. If fn : R → R is a sequence of (Borel-)measurable functions, so that
∞ Z
X
|fn (x)|dx < ∞,
n=1
R
then ∞ Z ∞
Z X
X
fn (x)dx = fn (x)dx.
n=1 n=1
R R
This result follows from applying Fubini’s theorem equating the repeated integrals w.r.t. the measure

spaces (R, B(R), µ) with µ the Borel measure and (N, P(N), µ0 ) with µ0 =
P
δn the counting
n=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 41

R R R ∞
P
measure, and f (x, n) := fn (x). Here, by the definition, gdµ = g(x)dx, whereas hdµ0 = h(n),
R n=1
whenever these make sense.

Proposition III.52. Let X1 , . . . , Xn : Ω → R be random variables on a probability space (Ω, F , P ).


Set, as above, B(Rn ) := B(R)⊗. . .⊗B(R), and define X : Ω → Rn by X(ω) := (X1 (ω), . . . , Xn (ω)).
Then X is B(Rn )-measurable, and, further, the following are equivalent:
a. The random variables X1 , . . . , Xn are independent.
b. PX = PX1 ⊗ . . . ⊗ PXn .
c. For all bounded measurable functions f1 , . . . , fn : R → R,
" n # n
Y Y
E fi (Xi ) = E[fi (Xi )].
i=1 i=1

Proof. (a) ⇒ (b) : Let A be the π-system

A = {A1 × . . . × An : A1 , . . . An ∈ B(R)},

generating B(Rn ), by the definition of B(Rn ). Then, since by assumption, X1 , . . . , Xn are indepen-
dent, for every A = A1 × . . . × An ,
n
!
\
PX (A) = P (X ∈ A) = P (X1 ∈ A1 , . . . , Xn ∈ An ) = P {Xi ∈ Ai }
i=1
n n
(III.18)
Y Y
= P (Xi ∈ Ai ) = PXi (Ai ) = (PX1 ⊗ . . . ⊗ PXn )(A),
i=1 i=1

by the defining property of the product measure (see Theorem III.46). It then forces PX = PX1 ⊗
. . . ⊗ PXn on the whole of B(Rn ), by the uniqueness part of Theorem III.46, i.e. PX1 ⊗ . . . ⊗ PXn
is the only measure satisfying (III.18) for all A1 × . . . × An ∈ A (alternatively, by Theorem I.13,
whose assumptions are valid, since B(Rn ) is generated by the π-system A , and are dealing with
probability measures).
(b) ⇒ (c) : By using part (b) and Fubini, we get that
" n # Z n
! n Z n
Y Y Y Y
E fi (Xi ) = fi (xi ) dPX (x1 , . . . , xn ) = fi (xi )PXi (dxi ) = E[fi (Xi )].
i=1 Rn i=1 i=1 R i=1

(c) ⇒ (a) : Given A1 , . . . An ∈ B(R), we set fi = χAi , and employ part (c), while observing that
n
Q
χA1 ×...An (x1 , . . . , xn ) = χAi (xi ), to yield:
i=1
" n
#
Y
P (X1 ∈ A1 , . . . , Xn ∈ An ) = E[χA1 ×...An (X1 , . . . , Xn )] = E χAi (Xi )
i=1
n
Y n
Y
= E[χAi (Xi )] = P (Xi ∈ Ai ).
i=1 i=1


42 IGOR WIGMAN

IV. Limit laws and Gaussian random variables


IV.1. Strong and weak laws of large numbers. In this section we are given a sequence X1 , X2 , . . .
of independent random variables, and are concerned with the asymptotic law of X1 + X2 + . . . + Xn as
n → ∞. Implicitly, it is assumed that all the random variables are defined on a common probability
space, to make sense of the said sum X1 + X2 + . . . + Xn as a random variable, which is what will
be assumed without further notice. Recall that the variance of a random variable is given by
Var(X) = E[(X − E[X])2 ] = E[X 2 ] − (E[X])2 ,
whenever it makes sense. We aim to prove the following two result, the weak and the strong laws of
large numbers respectively:
Theorem IV.1 (Weak law of large numbers). Let X1 , X2 , . . . be a sequence of independent random
variables, so that for all i ≥ 1, E[Xi ] = µ < ∞, and so that there exists some number σ > 0 such
that, for all i ≥ 1, the variance of Xi is bounded
Var(Xi ) ≤ σ 2 . (IV.1)
Then the sample mean
X1 + . . . + . . . Xn P
X n := →µ
n
converges in probability to the mean µ as n → ∞.
That the mean E[Xi ] = µ is equal for all i is unimportant; by subtracting the mean from each
variable (and, accordingly, from X n ), we can assume that E[Xi ] = 0, whatever was the mean in the
first place. We do, however, require that all the Xi are integrable, i.e. E[|Xi |] < ∞ (which follows
implicitly by writing E[Xi ] < ∞). As it will be clear from the proof below, the condition ≤ σ 2 is not
used in its full strength. The significantly milder condition
n
1 X
Var(Xi ) → 0
n2 i=1

will be sufficient for all our needs. The conditions of Theorem IV.1 are satisfied, for example, in case
the random variables Xi are i.i.d. so long that Var(X1 ) < ∞; in this case a much stronger result
is applicable (cf. Theorem II.22), that will be proved under a superfluous extra assumption on the
boundedness of the 4th moment.
Theorem IV.2 (Strong law of large numbers (SLLN)). Let X1 , X2 , . . . be i.i.d. random variables so
that for all i ≥ 1, E[Xi ] = µ < ∞,
M := E[X14 ] < ∞. (IV.2)
Then
X1 + . . . + . . . Xn
X n := →µ
n
a.s. as n → ∞.
In what follows we prove theorems IV.1-IV.2. To this end we require a couple of lemmas.
Lemma IV.3 (Markov’s inequality). Let X ≥ 0 be a a.s. nonnegative random variable so that
E[X] < ∞. Then, for t > 0,
E[X]
P (X ≥ t) ≤ .
t
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 43

Proof. Fix a number t > 0, and write the inequality t · χX≥t ≤ X, which is satisfied trivially. Now,
integrating both sides w.r.t. the measure P gives
tP (X ≥ t) ≤ E[X],
which yields the stated inequality.

Lemma IV.4 (Chebyshev’s inequality). Let X be a random variable, with finite mean and variance:
E[X] = µ < ∞, and σ 2 = Var(X) < ∞. Then for t > 0,
1
P (|X − µ| ≥ tσ) ≤ 2 .
t
(Chebyshev’s inequality is only nontrivial for t > 1.)
Proof. Set Y = X − µ, so that E[Y 2 ] = σ 2 , and, by Markov’s inequality (note that Y 2 ≥ 0),
E[Y 2 ] Var(X) 1
P (|X − µ| ≥ tσ) = P (|Y | ≥ tσ) = P (Y 2 ≥ t2 σ 2 ) ≤ 2 2
= 2 2 = 2.
tσ tσ t

We are finally in a position to prove theorems IV.1-IV.2.
Proof of Theorem IV.1. Using the independence of the {Xi }i≥1 , we obtain the inequality
n
1 1 X σ2 · n σ2
Var(X n ) = 2 Var(X1 + . . . + Xn ) = 2 Var(Xi ) ≤ = ,
n n i=1 n2 n
by (IV.1); importantly, it shows that, as n → ∞,
Var(X n ) → 0,
that will be sufficient for the convergence in probability (via Chebyshev’s inequality). Now, employing
Chebyshev’s inequlity, for every  > 0, we have
 Var(X n ) σ2
P |X n − µ| ≥  ≤ ≤ → 0,
2 n · 2
which is the defining property of convergence in probability of X n to µ.

Proof of Theorem IV.2. We first aim to reduce to the case when E[Xi ] = 0. Indeed, the random
variables Yi := Xi − µ satisfy that
Yi4 ≤ (|Xi | + |µ|)4 ≤ (2 max{|Xi |, |µ|})4 ≤ 24 max{Xi4 , µ4 } ≤ 24 (Xi4 + µ4 ),
so that E[Yi4 ] ≤ 24 (M + µ4 ) < ∞, and Y1 +...+Y
n
n
→ 0 a.s., if and only if X n → µ a.s. Hence all the
assumptions of Theorem IV.2 hold with Yi in place of Xi (with M substituted by another constant).
In light of the above, we may assume, as we do from this point on, that for every i ≥ 1, E[Xi ] = 0.
Now, since the existence (and finiteness) of the higher moments imply the existence of the lower
moments (HW8, Q1(b)), our assumption E[Xi4 ] ≤ M < ∞ also implies that Xi , Xi2 and Xi3 are all
integrable. Moreover, by the Cauchy-Schwarz inequality, for all i, j ≥ 1,
E[Xi2 · Xj2 ] ≤ E[Xi4 ]1/2 · E[Xj4 ]1/2 ≤ M, (IV.3)
thanks to (IV.2) (recall that the Xi are i.i.d.). Since the {Xi }i≥1 are all independent with mean
E[Xi ] = µ = 0, it follows that for all distinct (in pairs) indexes i, j, k, `,
E[Xi · Xj3 ] = E[Xi · Xj · Xk2 ] = E[Xi · Xj · Xk · X` ] = 0 (IV.4)
44 IGOR WIGMAN

(cf. Proposition III.52(c) and HW7 Q5).


Denote the sum Sn := X1 + . . . + Xn , and expand out its 4th moment
" n #
X X
E[Sn4 ] = E[(X1 + . . . Xn )4 ] = E Xi4 + 6 Xi2 Xj2 ,
i=1 1≤i<j≤n

since all the other terms will vanish, by (IV.4). It then follows from (IV.2) and (IV.3), that
n(n − 1)
E[Sn4 ] ≤ n · M + 6 · M ≤ 3n2 M.
2
Hence,
" ∞  4 # ∞
X Sn X 1
E ≤ 3M · < ∞.
n=1
n n=1
n2
Denote the random variable
∞  4
X Sn
Y := ≥ 0.
n=1
n
The above argument demonstrates that E[Y ] < ∞, hence this implies that P (Y < ∞) = 1 (as
otherwise E[|Y |] would be infinite). That is, the defining series of Y is convergent a.s., hence the
4
corresponding summands vanish at infinity a.s., i.e. Snn4 → 0 as n → ∞ a.s., so X n = Snn → 0 a.s., as
asserted by Theorem IV.2.

Example IV.5. In statistics, often times one has independent observations X1 , X2 , . . . of the same
random variable X ∼ Xi , whose mean E[X] = E[Xi ] or variance σ 2 = Var(X) = Var(Xi ) we aim to
estimate. For example, one can approximate µ with the sample average, µ ≈ X n . Therefore, by the
SLLN, the sample mean and sample variance satisfy
n
1 X
X n → µ, Sn2 := (Xi − X n )2 → σ 2 a.s.
n − 1 i=1

Thus SLLN provides a means to estimate the mean and the variance of the distribution, but not
an estimate for the confidence interval. A 95%-confidence interval for µ, based on the Central Limit
Theorem below, is of the form
√ √
[X n − 1.96 · σ/ n, X n + 1.96 · σ/ n],
requiring an a priori knowledge of the variance σ 2 = Var(X), see Example IV.32 below. A confidence
interval depends on the variance of the underlying distribution, whereas the SLLN is independent of
the variance, nor gives an quantitative estimates how good the approximations are.
Example IV.6. Let X be a random variable, A ∈ B(R), and we would like to estimate the
probability P (X ∈ A) based on sample copies X1 , X2 , . . . of X (which are themselves i.i.d. random
variables). Then, by the SLLN,
n
#{Xi : Xi ∈ A} 1X
= χA (Xi ) → E[χXi ∈A ] = P (X ∈ A) a.s..
n n i=1

This situation is typical for the Bayesian statistics, where the probabilities are not given, and are
based on the samples.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 45

Example IV.7 (“Monte-Carlo”). Let f : [0, 1] → R, and suppose we would like to numerically
R1
approximate the integral f (x)dx. One would then discretize the interval [0, 1] into a large number
0
N of sub-intervals of [0, 1] of length N1 , and use the function values in each sub-interval to approximate
the integral (e.g. Riemann sums). There are some more sophisticated methods that can improve the
rate of convergence (such as the trapezoid method etc.), but the ideas are similar. Alternatively, let
U1 , U2 , . . . ∼ U([0, 1]) be i.i.d. uniform random variables, and set Xi = f (Ui ). Then
Z1
E[Xi ] = E[f (U )] = f (x)dx,
0

and by the SLLN,


n Z1
1X
Xn = f (Ui ) → f (x)dx a.s. as n → ∞.
n i=1
0
R
For a multidimensional integral f (x)dx, d ≥ 1, the numerical methods would divide the d-cube
[0,1]d
[0, 1]d into N d sub-cubes with spacing 1/N , and approximate the values of the function in each cube;
this requires N d points, which might be very large. On the other hand, the approach with uniform
random variables is the same, except that now the i.i.d. are U1 , U2 . . . ∼ U([0, 1]d ).
IV.2. Characteristic function. We will now define the Fourier transform of a measure on Rn ,
which is going to be a complex-valued function on Rn , given in terms of a certain integral of a
complex-valued function on Rn w.r.t. the Borel measure on Rn . Let g : Rn → C be a complex valued
function, and <(g) and =(g) its real and imaginary parts. The functionR g is measurable, if both
<(g), =(g) : Rn → R are B(Rn )−B(R) measurable; it is integrable if |<(g)|dµ, |=(g)|dµ < ∞,
R
Rn Rn
whence Z Z Z
gdµ := <(g)dµ + i · =(g)dµ.
Rn Rn Rn
Note that the domain of g is real, and the definition of the integral is thereupon different than for
complex functions (defined on C). The function
g(x) = gu (x) = eihu,xi = cos(hu, xi) + i sin(hu, xi)
is integrable for any finite measure µ (i.e. µ(Rn ) < ∞) on (Rn , B(Rn )), for every u = (u1 , . . . , un ) ∈
Rn , where hu, xi = u1 x1 + . . . + un xn is the standard inner product on Rn .
Definition IV.8. For a finite measure µ on the measurable space (Rn , B(Rn )) (i.e. µ(Rn ) < ∞)
b : Rn → C by
define its Fourier transform µ
Z
b(u) = eihu,xi µ(dx),
µ
Rn

u ∈ Rn .
By the above, the Fourier transform is well-defined as a function µ b : Rn → C, for every finite
measure µ. It enjoys the following properties:
(1) We have µ b(−u) = µ b(u), where · is the complex conjugate of a number. This follows from the
it −it
identity e = e , t ∈ R.
46 IGOR WIGMAN

(2) The function µ


b(·) is continuous. This follows from the Dominated Convergence theorem (as
|eit | = 1).
Definition IV.9 (Characteristic function). The characteristic function of random variable X =
(X1 , X2 , . . . , Xn ) : Ω → Rn is the function φX : Rn → C, defined by
Z
 ihu,Xi 
φX (u) := E e = eihu,X(ω)i P (dω). (IV.5)

By Lemma III.31, Z
φX (u) = eihu,xi PX (dx) = P
cX (u),
Rn
i.e. the characteristic function is the Fourier transform of the distribution PX of X. If,
further, PX has a density fX w.r.t. the Borel measure, then, by (III.12),
Z
φX (u) = eihu,xi fX (x)dx = fcX (u),

Rn
the “usual” Fourier transform of a function fX .
Example IV.10 (Standard Gaussian random variable). A random variable X on Rn is standard
Gaussian, if for every A ∈ B(Rn ),
Z
1 2
P (X ∈ A) = n/2
e−kxk /2 dx,
(2π)
A
 n
1/2
x2i
P
with kxk = k(x1 , . . . , xn )k = the standard Euclidean norm. Let us compute its char-
i=1
acteristic function for n = 1; for higher n, the variables separate, and the characteristic function is
merely the product of the ones in each variable.
Since E[|X|] < ∞, Theorem III.24 implies that φX is differentiable, and we can differentiate (IV.5)
under the integral sign in this case, to yield:
Z
d d 1  2

φX (u) = iuX iuX
E[e ] = E[iXe ] = √ (ieiux ) · xe−x /2 dx.
du du 2π
R
Now we use integration by parts to write
d 1   ∞ 1
Z
−x2 /2 2
iux
φX (u) = √ ie · −e − √ u eiux e−x /2 dx = −uφX (u),
du 2π
−∞ 2π
R

i.e. the function φX satisfies the differential equation φ0X (u)


= −uφX (u). This in turn implies
d  u2 /2  2 2 2
e φX (u) = ueu /2 φX (u) + eu /2 φ0X (u) = eu /2 (uφX (u) + φ0X (u)) = 0
du
2
by the above. So, the function eu /2 φX (u) ≡ φX (0) is constant, and
2 /2 2 /2
φX (u) = φX (0) · e−u = e−u ,
as
Z∞
1 2 /2
φX (0) = √ e−x dx = 1,

−∞
by Example III.49.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 47

Hence the characteristic function of the standard Gaussian is equal, up to a constant, to its density.
This is the underlying reason for the frequent occurrence of the Gaussian distribution in nature and
many aspects of life. The characteristic function of a random variable uniquely determines the
distribution, and hence, φX encodes the full information on PX . If φX is integrable, then we can
obtain an explicit inversion formula.
Theorem IV.11. The distribution PX of a Rn -valued random variable X is uniquely R determined
by its characteristic function φX . Moreover, if φX is integrable (this means that |<(φX )|du < ∞
R Rn
and |=(φX )|du < ∞), then X has a density function fX , given by
Rn
Z
∗ 1
fX (x) = (F φX )(x) := φX (u)e−ihu,xi du. (IV.6)
(2π)n
Rn
The operator F ∗ is called the inverse Fourier transform.
Before giving a proof to Theorem IV.11 we will require a lemma. For t > 0, x, y ∈ Rn define the
heat kernel
1 kx−yk2
− 2t
p(t, x, y) := e .
(2πt)n/2
For t > 0, x ∈ Rn fixed, p(t, x, ·) is the density of N (x, tIn ), the Gaussian random n-variable with
mean x and covariance matrix t · In , as asserted by the following lemma.
Lemma IV.12. Let Z be a standard Gaussian random variable on Rn , x ∈ Rn , and t > 0. Then:

(1) The random variable x + t · Z has density function p(t, x, ·) on Rn .
(2) For all y ∈ Rn , Z
1 2
p(t, x, y) = n
eihu,xi e−tkuk /2 e−ihu,yi du. (IV.7)
(2π)
Rn

The identity (IV.7) is a particular application of Theorem IV.11 in this case.



Proof. (1): Writing Z = (Z1 , . . . , Zn ), for 1 ≤ k ≤ n, the random variables Yk = xk + t · Zk are
(yk −xk )2
independent Gaussians with mean xk and variance t. Thus Yk has density fYk (yk ) = √ 1 e− 2t
√ 2πt
on R, and, by the independence, the density of x + tZ is the product
Yn
fYk (yk ) = p(t, x, y).
k=1

(2): Let X be a standard Gaussian random variable, and for w ∈ R and t > 0, we write
Z
1 −v2 /2t √ √ w2 t
eiwv √ e dv = E[eiw tX ] = φX (w t) = e− 2 , (IV.8)
2πt
R
−u2 /2
since φX (u) = e by Example IV.10. We substitute w = (xk − yk )/t, and transform the variables
uk = v/t, so that (IV.8) reads
Z √ Z
i(xk −yk )uk t −tu2k /2 1 −v2 /2t (xk −yk )2
e √ e duk = ei(xk −yk )v/t √ e dv = e− 2t .
2π 2πt
R R
By a standard manipulation, we can rearrange
Z
1 2 1 − (xk −yk )2
eiuk xk e−tuk /2 e−iuk yk duk = √ e 2t ,
2π 2πt
R
48 IGOR WIGMAN

and multiply for k = 1, . . . , n so that


Z n Z n
1 ihu,xi −tkuk2 /2 −ihu,yi
Y 1 iuk xk −tu2k /2 −iuk yk
Y 1 − (xk −yk )2
e e e du = e e e duk = √ e 2t = p(t, x, y).
(2π)n k=1
2π k=1
2πt
Rn R


Proof of Theorem IV.11. Let X be the given random variable. In what follows we are going to argue
that one can express E[g(X)] in terms of φX , for every continuous bounded function g : Rn → R.
We will then approximate from below any characteristic function χA for
n
Y
A= (ak , bk ], (IV.9)
k=1

elements of the π-system generating B(Rn ), by such functions g, and that will be sufficient to
prescribe PX (A) in terms of φX , which, in turn, is sufficient to prescribe PX , by the Uniqueness
Theorem I.13.
Now, let Z be a standard Gaussian random variable on Rn , independent of X. Then, for every
g : Rn → R continuous and bounded, and every √ t > 0, we have by Fubini’s theorem (and Lemma
III.31 that requires the integrability of g(X + tZ), which is √ why we assume that g is bounded in
the first place; this certainly yields the integrability of g(X + tZ)):
√ Z √ Z Z √ 1 2
E[g(X + tZ)] = g(x + tz)(dPX ⊗ dPZ )(x, z) = g(x + tz) n/2
e−kzk /2 dz PX (dx)
(2π)
Rn ×Rn Rn Rn
Z √
= E[g(x + tZ)]PX (dx),
Rn
(IV.10)
and, by Lemma IV.12, for every fixed x ∈ Rn ,
√ Z Z
1
Z
2 /2
E[g(x + tZ)] = g(y)p(t, x, y)dy = g(y) · eihu,xi e−tkuk e−ihu,yi dudy. (IV.11)
(2π)n
Rn Rn Rn

Hence, by substituting (IV.11) into (IV.10), we obtain


√ Z Z
1
Z
2
E[g(X + tZ)] = g(y) · n
eihu,xi e−tkuk /2 e−ihu,yi dudy PX (dx),
(2π)
Rn Rn Rn

which, upon using Fubini again, reads


√ Z
1
Z
−tkuk2 /2 −ihu,yi
Z
E[g(X + tZ)] = g(y) · e e eihu,xi PX (dx)dudy
(2π)n
Rn Rn Rn
Z

Z
 (IV.12)
1 2
= g(y)  n
e−tkuk /2 e−ihu,yi φX (u)du dy.
(2π)
Rn Rn

That demonstrates that one can express E[g(X + tZ)] in terms of φX (and g) for every g
√ and bounded. Hence, using the Dominated Convergence theorem with (IV.12), as t → 0,
continuous
E[g(X + tZ)] → E[g(X)], so we can also express E[g(X)] in terms of φX (as a dominating function
we can use any constant bounding |g(·)|, as it is integrable against any probability measure). Finally,
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 49

given A of the form (IV.9), it is possible to find a sequence {gn } of continuous functions gn , so that,
gn (x) → χA for a.e. x ∈ Rn , and, in addition, |gn | ≤ 1. Then, by Dominated Convergence,
E[gn (X)] → E[χA (X)] = PX (A),
which is sufficient for determining the full distribution PX .
Now we aim to prove (IV.6). If, as we assume, φX is integrable, then for every continuous g that
is, in addition, compactly supported (i.e. g vanishes outside of [−R, R]n for R sufficiently large; in
particular, g is bounded),
Z Z Z Z
|φX (u)| · |g(y)|dudy = |φX (u)|du · |g(y)|dy < ∞,
Rn Rn Rn Rn

and the function f (u, y) := |φX (u)| · |g(y)| dominates the integrand on the r.h.s. of (IV.12), for every
t > 0. Thus, by Fubini, and the Dominated Convergence theorem (applied to the double integral
w.r.t. the product measure du ⊗ dy), we may take t → 0 on the r.h.s.R of (IV.12), which (bearing
in mind that we already know that the l.h.s. converges to E[g(X)] = g(y)fX (y)dy by above and
Rn
(III.12)), then reads
 
Z Z Z
1
g(y)fX (y)dy = E[g(X)] = g(y)  e−ihu,yi φX (u)du dy,
(2π)n
Rn Rn Rn

valid for all g continuous and compactly supported. Finally, we can extend this for g = χA for A
as in (IV.9), and this is certainly sufficient to yield that (IV.6) is a density function of X, again by
appealing to the Uniqueness Theorem I.13.

As a consequence to Theorem IV.11, the characteristic functions play an important role in identi-
fying distributions. For instance, if X and Y are independent random variables, then
φX+Y (u) = E eihu,X+Y i = E eihu,Xi · eihu,Y i = φX (u) · φY (u),
   

a simple rule how to compute the characteristic function of a sum of independent random variables
(which is far less obvious for the distributions or density functions).
Example IV.13 (Poisson random variables). Let X ∼ Pois(λ) for some λ ≥ 0. Then
∞ ∞ ∞
iuX
X
iuk
X
iuk λk −λ −λ
X (eiu λ)k
φX (u) = E[e ]= e · P (X = k) = e · e =e
k=0 k=0
k! k=0
k!
−λ
=e · exp(e λ) = exp(e λ − λ) = exp(λ(eiu − 1)).
iu iu

By the above, if X ∼ Pois(λ) and Y ∼ Pois(µ) are independent, then


φX+Y (u) = φX (u) · φY (u) = exp(λ(eiu − 1)) · exp(µ(eiu − 1)) = exp((λ + µ)(eiu − 1)),
the characteristic function of a random variable distributed ∼ Pois(λ + µ). Hence, by the uniqueness
of distribution with the given characteristic function of Theorem IV.11, X + Y ∼ Pois(λ + µ).
The other way to prove the result above is using the moment generating function mX (t) = E[etX ]
(which, if one extends the characteristic function to a complex variable, is its restriction to the
imaginary line). However, one advantage of the characteristic function over the moment generating
function is that it is always defined. The following result shows that the characteristic function
factoring is equivalent to the independence of the involved random variables.
50 IGOR WIGMAN

Theorem IV.14. Let X = (X1 , . . . , Xn ) be a random variable on Rn . Then X1 , . . . , Xn are inde-


pendent, if and only if, for all u = (u1 , . . . , un ) ∈ Rn ,
φX (u) = φX1 (u1 ) · . . . · φXn (un ). (IV.13)
Proof. “Only if”: By Proposition III.52, if, as we assumed, X1 , . . . , Xn are independent, then
" n # n
Y Y
E fk (Xk ) = E[fk (Xk )].
k=1 k=1

The equality (IV.13) follows upon substituting fk (x) = eiuk xk .


“If”: Let X fn be independent random variables with P f = PX , k = 1, 2, . . . , n, i.e. they
f1 , . . . , X
Xk k
have the same marginal distributions. Then, for all k, φXfk = φXk , and by the “only if” statement
above,
n
Y n
Y
φXe (u) = φXfk (uk ) = φXk (uk ) = φX (u)
k=1 k=1
for all u = (u1 , . . . un ) ∈ Rn , by the assumptions of the “if” statement. But then PX̃ = PX by
Theorem IV.11, and then X1 , . . . , Xn are independent, since X f1 , . . . , X
fn are.

Characteristic functions also allow for the computation of the moments of the distribution, in a
way similar to the moment generating functions.
Proposition IV.15. Let X be a real-valued random variable, and k ≥ 1 so that E[|X|k ] < ∞.
Then
dk

k k

i E[X ] = k φX (u) .
du u=0

Proof. By the finiteness of E[|X|k ], we can differentiate k times under the integral (or expectation)
sign, thanks to Theorem III.24:
dk
 k 
d iuX
φX (u) = E e = E[(iX)k · eiuX ],
duk duk
which at u = 0 reads
dk


k
φX (u) = ik E[X k ],
du u=0
as required.

Example IV.16. Recall that we computed the characteristic function of the standard Gaussian
X ∼ N (0, 1) to be φX (u) = e−u /2 , see Example IV.10. Then φ0X (u) = −ue−u /2 , so that E[X] = 0,
2 2

2 2
and φ00X (u) = u2 e−u /2 − e−u /2 , so that −1 = −E[X 2 ], and Var(X) = E[X 2 ] = 1.
IV.3. Gaussian random variables. A Gaussian random variable X ∼ N (µ, σ 2 ) with mean µ ∈ R
and variance σ 2 > 0 has the density function
1 (x−µ)2
fX (x) = √ e− 2σ2 .
2πσ 2
on x ∈ R. The degenerate case X = µ a.s. corresponding to σ 2 = 0.
Lemma IV.17 (Homework assignment). Let X ∼ N (µ, σ 2 ). Then:
(1) The expectation of X is E[X] = µ.
(2) The variance of X is Var(X) = σ 2 .
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 51

(3) For every a, b ∈ R, aX + b ∼ N (aµ + b, a2 σ 2 ).


2 2
(4) The characteristic function is φX (u) = eiuµ−u σ /2 .
A random variable X = (X1 , . . . , Xn ) ∈ Rn is (multivariate) Gaussian, if for every u ∈ Rn , the
random variable
hu, Xi = u1 X1 + . . . + un Xn
is Gaussian (on R).
Example IV.18. If X = (X1 , . . . , Xn ) is Gaussian, then, for every i ≥ 1, Xi = hei , Xi is Gauss-
ian, ei = (0, . . . , 1, 0, . . . 0) with unit i’th entry. The converse is not true, i.e. if all the marginal
distributions of X are Gaussian, it does not imply that X is Gaussian (see further Example IV.21).
Let (X1 , . . . , Xn ) be independent standard N (0, 1) Gaussians, and X = (X1 , . . . , Xn ). Then, for
every u ∈ Rn ,
n n n n
2 2 2 2
Y Y Y Y
φhu,Xi (t) = φu1 X1 +...+un Xn (t) = φuk Xk (t) = E[eituk Xk ] = φXk (tuk ) = e−t uk /2 = e−t kuk /2 ,
k=1 k=1 k=1 k=1

which we identify as the characteristic function of N (0, kuk ). Hence, in light of the uniqueness of
2

the characteristic function, hu, Xi ∼ N (0, kuk2 ), and thus X is Gaussian.


Recall that for two real-valued random variables X and Y of means µX and µY respectively, the
variance and the covariance are
Var(X) = E[(X − µX )2 ] = E[X 2 ] − µ2X ,
and
Cov(X, Y ) = E[(X − µX ) · (Y − µY )] = E[X · Y ] − µX · µY .
We have that Var(X) = 0, if and only if X = µX a.s. by Lemma III.7(3)-(4). If X and Y are
independent, then X and Y are uncorrelated, i.e. Cov(X, Y ) = 0, with the converse being false. If
X is a Rn -valued random variable, then its covariance matrix of X is the symmetric n × n matrix
Cov(X) = (Cov (Xi , Xj ))ni,j=1
(also denoted Var(X)). In the matrix notation, we can write concisely
Cov(X) = E[(X − µX ) · (X − µX )T ],
where X − µX is a column vectors and (X − µX )T (the transpose of X − µX ) is a row vector, so that
Cov(X) is a n × n matrix. (More generally, if X ∈ Rn and Y ∈ Rm , Cov(X, Y ) is the n × m matrix
Cov(X, Y ) = E[(X − µX ) · (Y − µY )T ].)
The covariance is bilinear: for all a, b ∈ R and X, Y, Z real-valued random variables, we have
Cov(aX + bY, Z) = a Cov(X, Z) + b Cov(Y, Z).
Recall that a symmetric n × n matrix A is positive semi-definite, if for all x ∈ Rn , xT Ax ≥ 0.
Proposition IV.19. A covariance matrix Cov(X) of a Rn -valued random variable is positive semi-
definite.
Proof. By the bi-linearity of the covariance, for a ∈ Rn ,
n X
n n n
!
X X X
aT Cov(X)a = ai Cov(Xi , Xj )aj = Cov ai X i , aj Xj = Var(aT X) ≥ 0.
i=1 j=1 i=1 j=1


52 IGOR WIGMAN

If X ∼ N (µ, σ 2 ), then X = µ+σZ with Z := X−µ√


σ2
∼ N (0, 1) standard Gaussian (assuming σ > 0,
otherwise X = µ a.s.). To generalise this for multivariate Gaussian random variables we will have to
make sense of the “square root of the covariance matrix”. Let V = (Vij ) be the positive semi-definite
covariance matrix of a Gaussian random variable. Since V is symmetric, V could be diagonalized
by an orthogonal matrix, i.e. it has n real-valued eigenvalues {λi }ni=1 and eigenvectors {vi }ni=1 that
are an orthonormal basis of Rn . Since V is positive semi-definite, we all the λi ≥ 0 are nonnegative.
n
A vector x ∈ Rn could be expressed as a linear combination of the {vi } as x = hx, vi ivi , so that,
P
i=1
applying V , we have
n n n n
!
X X X X
Vx= hx, vi iV vi = hx, vi iλi vi = λi vi (viT x) = λi vi viT x.
i=1 i=1 i=1 i=1
It follows that n
X
V = λi vi viT ,
i=1
and we define the square root of V , any positive semi-definite symmetric matrix, to be the sym-
metric, positive semi-definite matrix
n p
X
1/2
V := λi vi viT .
i=1

Since viT vj = δij (recall that {vi }ni=1is an orthonormal basis of Rn ), the matrix V 1/2 satisfies
Xn X n n
X
p
V 1/2 · V 1/2 = λi λj vi viT vj vjT = λi vi viT = V, (IV.14)
i=1 j=1 i=1

as expected.
Theorem IV.20. Let X be a Gaussian random variable on Rn . Then:
(1) For every n × n matrix A and b ∈ Rn , AX + b is Gaussian.
(2) The distribution of X is fully determined by µ = E[X] and V = Cov(X), denoted X ∼
N (µ, V ).
(3) The characteristic function of X is
 
1
φX (u) = exp ihu, µi − hu, V ui ,
2
n
u∈R .
(4) If V is nonsingular, then X has a density on Rn given by
 
1 1 T −1
fX (x) = √ exp − (x − µ) V (x − µ) ,
(2π)n/2 · det V 2
x ∈ Rn .
(5) If X = (Y, Z) ∈ Rn , with Y ∈ Rm and Z ∈ Rp , m + p = n, and the covariance matrix of X
has the block structure
 
Cov(Y ) 0
Cov(X) = ,
0 Cov(Z)
then Y and Z are independent.
Proof. Recall that for u, v ∈ Rn and A a n × n real-valued matrix we have hu, vi = uT v and
(Av)T = v T AT .
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 53

(1) For u ∈ Rn ,
hu, AX + bi = uT AX + uT b = hAT u, Xi + hu, bi,
Gaussian by Lemma IV.17(3), since hAT u, Xi is Gaussian by assumption.
(2) Follows from (3) below and Theorem IV.11.
(3) For u ∈ Rn ,
E[hu, Xi] = uT E[X] = uT µ = hu, µi,
and
Var(hu, Xi) = E[(uT X − uT µ)2 ] = E[(uT X − uT µ) · (uT X − uT µ)T ]
= E[(uT X − uT µ) · (X T u − µT u)] = E[uT (X − µ) · (X T − µT )u]
= uT E[(X − µ) · (X T − µT )]u = uT V u = hu, V ui.
Hence, by the Gaussianity assumption, hu, Xi ∼ N (hu, µi, hu, V ui), and
 
ihu,Xi 1
φX (u) = E[e ] = φhu,Xi (1) = exp ihu, µi − hu, V ui ,
2
thanks to Lemma IV.17)(4).
(4) Let Y1 , . . . , Yn be i.i.d. standard Gaussians, and Y = (Y1 , . . . , Yn ), a Rn -valued Gaussian
random variable. Then Y has the density
n
Y 1 −kyk2 /2
fY (y) = fYk (yk ) = e , (IV.15)
k=1
(2π)n/2

y ∈ Rn , and the Rn -valued random variable X e := µ + V 1/2 Y is Gaussian by (1). Moreover,


e = µ + V 1/2 E[Y ] = µ, and its covariance matrix is V , since
its mean is E[X]
e = E[(V 1/2 Y ) · (V 1/2 Y )T ] = E[V 1/2 Y · Y T (V 1/2 )T ] = V 1/2 E[Y · Y T ]V 1/2 = V 1/2 · V 1/2 = V,
Cov(X)
since V 1/2 is symmetric, and light of (IV.14).
Therefore, the distributions of X and X e are equal by (2), and, since V is nonsingular, we
can invert the dependence of X e on Y to write

Y = V −1/2 (X
e − µ).

Now we use the transformation rule (generalization of (III.16)), upon substituting


kyk2 = kV −1/2 (x − µ)k2 = hx − µ, V −1 (x − µ)i
into (IV.15) and take into account the Jacobian | det(V −1/2 )| = √det
1
V
to obtain the asserted
density for X,
e and a forteriori, for X.
(5) Let V = Cov(X), U = Cov(Y ), W = Cov(Z). Hence, by the block structure of V , for all
vectors u ∈ Rm , w ∈ Rp ,
h(u, w), V (u, w)i = hu, U ui + hw, W wi.
We also separate the mean µX = (µY , µZ ). Hence the characteristic function φX = φ(Y,Z)
splits
φX ((u, w)) = eihu,µY i−hu,U ui · eihw,µZ i−hw,W wi = φY (u) · φZ (w),
and therefore Y and Z are independent by Theorem IV.14.

54 IGOR WIGMAN

Theorem IV.20 shows by induction, that if the Rn -valued random variable X = (X1 , . . . Xn ) is
(jointly) Gaussian, then the random variables X1 , . . . Xn are independent, if and only if the covariance
matrix V = Cov(X) is diagonal. The joint Gaussianity assumption of this result is of essence, as
merely the marginal distributions being Gaussian is not sufficient for the independence conclusion
to hold, as illustrated in the following example.
Example IV.21. Let X ∼ N (0, 1) be a standard Gaussian random variable, fix a number a > 0,
and define (
X |X| < a
Y = .
−X |X| ≥ a
On the one hand, by the symmetry of the Gaussian distribution around the origin, −X ∼ N (0, 1).
Therefore, for every Borel set A ∈ B(R),
P (Y ∈ A) = P (Y ∈ A ∩ (−a, a)) + P (Y ∈ A ∩ (R \ (−a, a)))
= P (X ∈ A ∩ (−a, a)) + P (−X ∈ A ∩ (R \ (−a, a)))
= P (X ∈ A ∩ (−a, a)) + P (X ∈ A ∩ (R \ (−a, a))) = P (X ∈ A),
and therefore Y ∼ N (0, 1) too.
But, on the other hand
P (X + Y = 0) = P (X = −Y ) = P (|X| ≥ a) ∈ (0, 1),
and therefore X +Y = h(1, 1), (X, Y )i is not a Gaussian random variable (as otherwise P (X +Y = 0)
would either vanish or be equal to 1 in case it is degenerate), which means that the R2 -valued random
variable (X, Y ) is not Gaussian. It is possible to even choose a > 0 so that Cov(X, Y ) = 0, but
it does not contradict Theorem IV.20(5), as it assumed that (X, Y ) is (jointly) Gaussian, which we
know it is not.
Example IV.22. Let X ∼ N (0, 1) and Y = 0 a.s. (it is a degenerate Gaussian with Var(Y ) = 0
and E[Y ] = 0). Then, for every u = (u1 , u2 ) ∈ R2 , u1 X +u2 Y = u1 X ∼ N (0, u21 ), so that Z = (X, Y )
is Gaussian with mean µ = (0, 0), and singular covariance matrix
 
1 0
Cov(Z) = .
0 0
In this light (since the covariance is singular), (X, Y ) has no density w.r.t. the Borel measure on
R2 . Finally, X and Y are independent, manifested by Cov(X, Y ) = 0 in the top right corner of the
covariance matrix.
IV.4. Convergence in distribution and Central Limit Theorem. Suppose that X = −Y are
two random variables defined on the same probability space (Ω, F , P ). If, further, X ∼ N (0, 1)
i.e., X has the standard Gaussian distribution, then so does Y . However, for  > 0, the probability
P (|X − Y | > ) is rather large, so despite having identical distributions, X, Y are not close in
probability, nor a.s. convergence is applicable here. There must be a notion that detects the proximity
of the distributions of X, Y regardless of their realizations on the probability space, even if X, Y are
defined on different probability spaces.
Definition IV.23 (Convergence in distribution). Let X1 , X2 , . . . be a sequence of real-valued ran-
dom variables with distribution functions FX1 , FX2 , . . ., and let X be another real-valued random
variable, with distribution function FX . The random variables Xn → X in distribution (denoted
d
Xn → X), if
FXn (x) → FX (x)
for every continuity point x ∈ R of FX (i.e. if FX is continuous at x ∈ R, then FXn (x) → FX (x)).
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 55

In particular, the definition of convergence in distribution does not require the random variables to
be defined on the same probability space, and only involves their distribution functions (unlike con-
vergence a.s. or in probability). Even though we speak about convergence of the random variables,
what is meant is the convergence of the associated probability distributions PXn to a limit distribu-
tion PX . The requirement of convergence for continuity points only is important or non-trivial, as
demonstrated in Example IV.25 below.
Example IV.24. Let Xn have the distribution function
( n
1 − 1 − nx x ∈ [0, n)
FXn (x) = .
0 otherwise
Then, for all x ∈ R, as n → ∞,
(
1 − e−x x ∈ [0, ∞)
lim Fn (x) = ,
n→∞ 0 otherwise
which is is identified as the distribution function of the exponential random variable X ∼ exp(1).
Example IV.25. Let Un ∼ U 0, n1 be a sequence of random variables uniformly distributed in

the interval (0, 1/n). Their corresponding distribution functions are

0
 x≤0
FUn (x) = nx 0 < x < n1 .
x ≥ n1

1

Then for all x 6= 0, (


0 x<0
lim FUn (x) = F (x) := ,
n→∞ 1 x≥0
that is the distribution function of the constant random variable U = 0 a.s. At x = 0,
lim FUn (0) = 0 6= 1 = FU (0).
n→∞

However, since FU is not continuous at x = 0, the convergence is not imposed at x = 0, and therefore,
Un → U indeed.
Example IV.26. If Xn ∼ δn , i.e. Xn = n a.s. Then
(
0 x<n
FXn (x) = .
1 x≥n
The limit Fn (x) → 0 for every x ∈ R, so F (x) ≡ 0 is a candidate for a limit distribution function.
However, since F ≡ 0 is not a distribution function of any random variable, Xn does not converge in
distribution to a limit.
Example IV.27. Let Xn ∼ Bin 1, 21 + n1 , i.e. it is a single coin toss with probability of heads

1
2
+ n1 . Then

0
 x<0
1 1
FXn (x) = 2 − n 0 ≤ x < 1 .

1 x≥1
Then, for every x ∈ R, FXn (x) → FX (x), where FX is the distribution function of X ∼ Bin 1, 12 , so

Xn → X in distribution.
56 IGOR WIGMAN

Proposition IV.28. If X and {Xn }n≥1 are random variables, all defined on the same probability
space (Ω, F , P ), and Xn → X in probability, then Xn → X in distribution.
Proof. Let x ∈ R a continuity point of FX , and let  > 0. Then, by the said continuity, there exists
a δ > 0, so that

FX (x − δ) > FX (x) − (IV.16)
2
and

FX (x + δ) < FX (x) + (IV.17)
2
(recall that FX is monotone increasing).
Since Xn → X in probability, there exists a number N ≥ 1 sufficiently large, so that for all n ≥ N ,

P (|Xn − X| > δ) < . (IV.18)
2
Therefore,
FXn (x) = P (Xn ≤ x) = P (Xn ≤ x, X ≤ x + δ) + P (Xn ≤ x, X > x + δ)
  (IV.19)
≤ P (X ≤ x + δ) + P (|Xn − X| > δ) ≤ (FX (x) + ) + = FX (x) + ,
2 2
by (IV.17) and (IV.18), and
FX (x − δ) = P (X ≤ x − δ) = P (X ≤ x − δ, Xn ≤ x) + P (X ≤ x − δ, Xn > x)

≤ P (Xn ≤ x) + P (|Xn − X| > δ) < FXn (x) + ,
2
by (IV.18).
Combining this with (IV.16), we have

FXn (x) > FX (x − δ) − > FX (x) − . (IV.20)
2
Finally, (IV.19) together with (IV.20) yield
FXn (x) −  < FX (x) < FXn (x) + ,
i.e. |FXn (x) − FX (x)| < , holding for all n ≥ N , which, since  > 0 was arbitrary, implies that
FXn (x) → FX (x), as asserted.


Hence, for random variables defined on the same probability space,


P d
Xn → X a.s. ⇒ Xn → X ⇒ Xn → X.
Convergence in distribution is strictly weaker than convergence in probability, as could be inferred
from Example IV.27. For let us assume that in that example, {Xn }n≥1 are independent and defined
probability space. Then, for 0 <  < 1, P (|Xn − X| > ) = P (Xn 6= X) = 21 12 + n1 +

on the same
1 1
− n1 = 21 , which does not vanish, so that Xn does not converge in probability to X. However,

2 2
if the limit random variable is a.s. constant, then the convergence in probability and in distribution
are equivalent.
Proposition IV.29 (Homework assignment). If {Xn }n≥1 are defined on the same probability space
d P
(Ω, F , P ) and, as n → ∞, Xn → c ∈ R a constant, then Xn → c.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 57

d
If Xn → X, and the distribution function FX of X is continuous everywhere, then it means that
for all x ∈ R, FXn (x) → FX (x). It then follows, that, for every −∞ ≤ a < b ≤ +∞,
P (a < Xn ≤ b) = FXn (b) − FXn (a) → FX (b) − FX (a) = P (a < X ≤ b). (IV.21)
In fact, the limit (IV.21) could be made uniform w.r.t. x ∈ R (again, under the continuity assumption
of FX on x ∈ R), i.e.
sup |FXn (x) − FX (x)| → 0,
R

known as Pólya’s theorem.


The following result, whose proof is outside the scope of this module, relates between convergence
in distribution, expectation of test functions, and the characteristic function.
Theorem IV.30. Let X1 , X2 , . . . be a sequence of real-valued random variables with characteristic
functions φX1 , φX2 , . . ., and let X be another real-valued random variable, with characteristic function
φX . Then the following are equivalent:
d
(1) Xn → X.
(2) For all continuous bounded functions f : R → R, E[f (Xn )] → E[f (X)].
(3) For all u ∈ R, φXn (u) → φX (x).
The condition (2) of Theorem IV.30 is equivalent to
Z Z
f dPXn → f dPX

for all bounded continuous f : R → R, known as weak convergence of (probability) measures PXn to
d
PX (equivalent to Xn → X), rather than the convergence of random variables. The condition (3) of
Theorem IV.30 is merely pointwise convergence.
Theorem IV.31 (Central Limit theorem (CLT)). Let {Xn }n≥1 be a sequence of i.i.d. random
d
variables defined on the same probability space, of mean zero and variance 1. Then X1 +...+X

n
n
→Z
with Z ∼ N (0, 1), i.e., for all −∞ ≤ a < b ≤ +∞,
  Zb
X1 + . . . + Xn 1 2
P a≤ √ ≤b → √ e−y /2 dy,
n 2π
a
as n → ∞.
Proof. Let
φ(u) = φX1 (u) = E[eiuX1 ] (IV.22)
be the characteristic function of X1 (and thus of all the Xn ). Since, by assumption, E[X12 ] =
Var(X1 ) = 1 < ∞, we can differentiate (IV.22) under the integral (expectation) sign twice to obtain
φ(0) = E[e0 ] = 1, φ0 (0) = E[iX1 ] = 0, φ00 (0) = E[(iX1 )2 )] = −1.
Hence, by Taylor expanding φ(·) at the origin, we have
u2
φ(u) = 1 − + ou→0 (u2 ) (IV.23)
2
2
φ(u)−(1− u2 )
(that means that u2
→ 0 as u → 0).
58 IGOR WIGMAN

Now denote the random variable Yn := X1 +...+X √


n
n
, and let φn (u) = E[eiuYn ] be its characteristic
function. Then, by the independence of the {Xn }, and substituting (IV.23), we have
n
√ n u2

√  h √ in
iu(u/ n)(X1 +...+Xn ) i(u/ n)X1 2
φn (u) = E[e ]= E e = φ(u/ n) = 1 − + o(u /n) . (IV.24)
2n
We would like to show that the latter expression converges for every fixed u and n → ∞ to the
2
characteristic function e−u /2 of the standard Gaussian random variable, which would then conclude
the proof of the Central Limit theorem via Theorem IV.30(3). While it is possible to see it directly,
let us first take the logarithm of the expression on the l.h.s. and the r.h.s. of (IV.24). A certain care
should be observed though, since, in general φn (u) attains complex values.
u2
However, in this case, for every fixed u, the expression 2n is arbitrarily small, and here log(1 + z) =
z + oz→0 (z), so there is no problem with the logarithm. Indeed, we write
u2 u2
 
2
log φn (u) = n log 1 − + o(u /n) = − + on→∞ (1).
2n 2
Taking an exponential of both sides of this result, we have
u2 u2
φn (u) = e− 2 · eo(1) → e− 2

as n → ∞, which, as it was mentioned above, is sufficient to yield the convergence in distribution to


standard Gaussian.


A few remarks are due. If X1 , X2 , . . . are i.i.d. with mean E[X1 ] = µ and Var(X1 ) = σ 2 (instead
of mean 0 and unit variance), then the Central Limit theorem is applicable to Zn = Xnσ−µ , so that

 n 
1
P
n n n Xi − µ
1 X i=1 d
√ Zi = → N (0, 1).
n i=1 σ
n
1 √σ Z, with Z ∼ N (0, 1).
P
Hence, for large n, the distribution of n
Xi ≈ µ + n
i=1
Here the integrability conditions E[|X1 |] < ∞ and E[X12 ] < ∞ are important for the CLT to hold.
1
For the Cauchy distributed i.i.d random variables with the density function f (x) = π(1+x 2 ) , one can

compute (Homework assignment) the characteristic function φX1 (u) = e , and then φXn (u) = e−|u| .
−|u|

Hence, the sample mean Xn is Cauchy distributed for every n ≥ 1, and the Central Limit theorem
does not hold in this case (which is not a contradiction, since the sufficient conditions are not
satisfied).
Theorem IV.31 is only the simplest version of the Central Limit theorem. There are a lot of
results holding for not independent random variables (where some measure of independence should
be assumed) or not identically distributed ones.
Example IV.32 (Confidence interval). Suppose X1 , X2 , . . . are i.i.d. random variables, whose vari-
ance Var(X1 ) = σ 2 < ∞ is known, and whose mean we wish to estimate. (For example, these could
be independent samples of the same random variable.) By the strong law of large numbers,
n
1X
X n := Xi → µ
n i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 59

a.s., which is a natural estimate, and we would like to find a confidence interval for these approximants
for a given large n. By the CLT,

n(X n − µ) d
→ Z ∼ N (0, 1).
σ
Then, for a < b,
   √ 
bσ aσ n(X n − µ)
P Xn − √ ≤ µ < Xn − √ =P a< ≤ b → P (a < Z ≤ b)
n n σ
as n → ∞. One can then observe in the Gaussian distribution tables, that for b = −a = 1.96,
P (−1.96 < Z ≤ 1.96) = 0.95 . . ., so that
   √ 
1.96σ 1.96σ n(X n − µ)
P Xn − √ ≤ µ < Xn + √ =P a< ≤ b → 0.95 . . . .
n n σ
h i
Hence, µ ∈ X n − 1.96σ√ , X n + 1.96σ
n

n
with confidence asymptotic to 95%, i.e. the said interval is a
95% confidence interval for the mean µ.
In case σ is unknown, one usually replaces σ 2 it with the sample variance
n
2 1 X
S := (Xi − X n )2 .
n − 1 i=1

(Here the division is by n − 1 and not by n as this way the estimator is unbiased, i.e. E[S 2 ] = σ 2 .)
Example IV.33 (Hypothesis testing). Consider the same situation of X1 , X2 , . . . i.i.d. with mean
µ unknown, and variance σ 2 known. Let H0 be the hypothesis that µ = µ0 for some µ0 ∈ R, and
√ d
the alternative hypothesis HA , that µ 6= µ0 . Under H0 , n(X n − µ0 )/σ → N (0, 1), so for n large,
X n ≈ N (µ0 , σ 2 /n), and we can test the latter based on the observed data.
For example, if µ0 = 0, and for n = 100 observed X n = 0.16, and σ = 1. Then, under H0 ,
X n ∼ N 0, 1001
, so the probability of observing X n = 0.16 or a larger deviation is
P (|X n | ≥ 0.16|H0 ) ≈ P (|Z| ≥ 0.16 · 10) ≈ 0.11.
If we take the significance level of our test to be, for example 5% (this is the number we determined
below which the hypothesis is rejected), then, since 0.11 > 0.05, the null hypothesis H0 is not rejected.
Example IV.34 (Binomial distribution). Let Yn ∼ Bin(n, p), where n is large. Then Yn has the
n
P
same distribution as Xi , where Xn ∼ Bin(1, p), i.e. it measures the number of heads in n successive
i=1
tosses of the same p-coin. Then, since E[X1 ] = p and Var(X1 ) = p(1 − p), the CLT gives
n

P
Xi − np
n(X n − p) i=1 d
p =p → Z ∼ N (0, 1),
p(1 − p) np(1 − p)
as n → ∞. Hence, for large n, Yn ≈ np+ np(1 − p)Z ∼ N (np, np(1−p)), that is the Gaussian with
p

the same mean and variance as Yn . Since the limit Gaussian distribution is continuous everywhere,
it follows from Pólya’s theorem that
! !
Yn − np t − np
sup P − P (Z ≤ y) = sup P (Yn ≤ t) − P Z ≤ → 0.

p p
y∈R np(1 − p) y∈R np(1 − p)

You might also like