Lecture Notes Probability39
Lecture Notes Probability39
IGOR WIGMAN
Contents
0. Preliminaries: countable sets 1
I. Measures 4
I.1. Motivation 4
I.2. Discrete measures 6
I.3. Sets generating σ-algebras 7
I.4. Extending to measures 8
I.5. Independence 12
I.6. The Borel-Cantelli lemma 15
II. Measurable functions and random variables 17
II.1. Measurable functions 17
II.2. Random variables 19
II.3. Modes of convergence 21
III. Lebesgue theory of integration 24
III.1. Lebesgue integrals: definition and fundamental properties 24
III.2. Density functions 32
III.3. Transformation of random variables 37
III.4. Product measures 38
IV. Limit laws and Gaussian random variables 42
IV.1. Strong and weak laws of large numbers 42
IV.2. Characteristic function 45
IV.3. Gaussian random variables 50
IV.4. Convergence in distribution and Central Limit Theorem 54
This course was inspired to high extent by the lecture notes of the course given by Kolyan Ray at King’s College
London, during the academic year 2019 − 2020.
1
2 IGOR WIGMAN
(2) We say that f is surjective (surjection), if for every c ∈ B there exists a ∈ A so that
f (a) = c. In other words, f is surjective, if every element c of B has at least one element in
A mapping to c by f .
(3) We say that f is bijective (bijection), if f is injective and surjective.
If f : A → B is bijective, then it is invertible, i.e. there exists a map g = f −1 : B → A so that
g ◦ f = idA and f ◦ g = idB . If f : A → B is a map, C ⊆ A is a subset of A, and D ⊆ B is a subset
of B, then the (direct) image f (C) = {f (a) : a ∈ C} ⊆ B of C is the collection of all the images of
elements in C, and the inverse image f −1 (D) = {a ∈ A : f (a) ∈ D} ⊆ B of D is the collection of
all the elements of A mapping into D, regardless of whether f is invertible or not. If f : A → B is
injective, then the restriction of its range f : A → f (A) is bijective.
For A a finite set, its cardinality |A| is the number of elements in A. It is easy to check that if
A, B are both finite, then if there exists an injection f : A → B, then |A| ≤ |B|; if there exists a
surjection f : A → B then |A| ≥ |B|. In light of the above, if there exists a bijection f : A → B,
then |A| = |B|. We then extend the definition of cardinality to arbitrary sets (finite or infinite), by
making the latter into a definition.
Definition 0.2 (Cardinality, countability). (1) For two arbitrary sets A, B, |A| = |B| (we say
that A, B have the same cardinality), if there exists a bijection f : A → B.
(2) A set A is countably infinite if there exists a bijection f : A → N, with N = {1, 2, 3, . . .}
the set of positive integers. In this case we denote the cardinality of A: ℵ0 = |N|.
(3) We say that A is countable if either A is countably infinite or A is finite.
(4) A set A is uncountable if A is not countable. In particular, A has to be infinite.
The above defines an equivalence relation A ∼ B if |A| = |B|, with corresponding equivalence
classes being all the possible different cardinalities of sets. The countable infinite sets is the smallest
infinite cardinality, in the sense that every infinite set contains a countably infinite set, see also
Lemma 0.7 below; it is the equivalence class of the positive integers. If A is countably infinite, and
if f : A → N is the bijection postulated by the definition of countability, then we may list all the
elements of A sequentially:
A = {a1 = f −1 (1), a2 = f −1 (2), . . .};
if A is finite, then the same makes sense, except that the corresponding sequence is finite. In other
words, A is countable if it could be enumerated as a sequence (finite or infinite).
Example 0.3. Let Z be the collection of all integers. We define f : Z → N by
(
2n n>0
f (n) = .
1 − 2n n ≤ 0
That is, we enumerate Z = {0, 1, −1, 2, −2, . . .} (here 0 = f −1 (1), 1 = f −1 (2), 2 = f −1 (3),...). Since
f is a bijection (i.e., our enumeration indeed covers each element of Z precisely once), the set Z is
countable. It is a bit counter-intuitive, as Z properly contains N, and “twice as big”, but, nevertheless,
equals to N by cardinality.
Example 0.4. Any infinite subset of N is countably infinite, since it could be enumerated.
Example 0.5. The set N × N = {(n, m) : n, m ∈ N} is countable. To see that, we can enumerate
N × N in the following way: (1, 1), (2, 1), (1, 2), (3, 1), (2, 2), (1, 3), (4, 1), (3, 2), (2, 3), (1, 4), . . ., i.e. if
one draws a picture (which is instructive), this corresponds to listing all elements of each consecutive
diagonal, starting from the bottom right one. This corresponds to the function h : N × N → N,
a
h((a, b)) = b − a + 1.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 3
We can infer from examples 0.8 and 0.9 that “most” of the real numbers are irrational.
Lemma 0.10. A countable union of countable sets is countable. Namely, S if {Ai }α∈I is a family of
sets, where I is countable, and for all α ∈ I, Ai is countable, then A := Ai is a countable set.
α∈I
I. Measures
I.1. Motivation. One can recall that a fair die falls on each of its 6 sides with probability 61 . In this
case, the sample space of all possible outcomes of a die toss is the finite set Ω = {1, 2, 3, 4, 5, 6},
and in this case one may ask what is the probability of getting as an outcome of the elements of
a given set (event) A ⊆ Ω, equalling to P (A) = 16 · |A|, i.e. the number of elements of A times
the probability 16 of the outcome being each of the elements. Alternatively, one may consider an
arbitrary finite sample space Ω = {x1 , . . . , xk } with k = |Ω|, endow each element of Ω with a given
k
P
probability pi =: P (xi ) ≥ 0, subject to the constraint pi = 1, and set for an event A ⊆ Ω:
i=1
X
P (A) = P (x).
x∈A
In some cases, this construction could be generalized, more or less as is, namely for countable sample
spaces (see §I.2). However, what if the sample space is uncountably infinite?
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 5
Suppose, for example, that a random number x ∈ [0, 1] is drawn, for a moment without specifying
what is meant by “random number in [0, 1]”. We could ask questions like “what is the probability that
0 < x < 12 or, more generally, for an interval (a, b) ⊆ [0, 1], “what is the probability that x ∈ (a, b)”.
It would make perfect sense that, if indeed x is drawn randomly with probability “spread equally”
in [0, 1], then the probability of x belonging to an interval is proportional to its length, so that, the
probability of x ∈ (a, b) is b − a, and, in particular, the probability that 0 < x < 12 is 12 .
One could ask more elaborate questions, like “what is the probability that x is rational”, or “what
∞
bi
P
is the probability that in the binary expansion x = 2i
, bi ∈ {0, 1}, asymptotically half of the bits
i=1
∞
ai
P
{bi }i≥1 are bi = 0, or “what is the probability that in the ternary expansion x = 3i
, ai ∈ {0, 1, 2}
i=1
only digits ai = 0, 2 appear” (i.e., ai = 1 never occurs). Since, as we have seen above, the “vast
majority” of real number are irrational, it would be reasonable to assume that, if the answer to the
former question makes sense, then this probability is precisely 0, whereas the answers to the latter
two questions are less obvious, if at all make sense.
But how do we formulate (and answer) all the above questions in a rigorous way, and which of the
above make sense, i.e. what is the repertoire of all possible questions whose answer makes sense in
this theory? It turns out that the “correct” answer for the repertoire of questions is “σ-algebra”
(“sigma-algebra”), whose elements are, in this language, events whence the requested probability is
a measure (in our case, a probability measure). The σ-algebra is going to be the domain of the
measure.
Definition I.1 (σ-algebra). Let E be a set (finite or infinite, countable or uncountable. A σ-algebra
E on E is a collection of subsets of E, subject to the following axioms:
(1) The empty set is in E , i.e. ∅ ∈ E .
(2) For all A ∈ E , its complement E \ A ∈ E .
∞
(3) For every sequence {An }n≥1 ⊆ E of elements of E , the union is An ∈ E . I.e. countable
S
n=1
unions of elements of E are in E .
The tuple (E, E ) is called a measurable space, and each A ∈ E is called a measurable set
(that later, in probabilistic contexts will be referred to as “event”).
The “σ” in “σ-algebra” designates the possible countably infinite unions in (3) of Definition I.1.
Otherwise, if these are only allowed to be finite, the corresponding collection of subsets of E is an
algebra (which we will not use in this module).
Example I.2. The following examples are σ-algebras for E arbitrary. It is easy to check directly
that they satisfy (1)-(3) of Definition I.1.
(1) The trivial σ-algebra E = {∅, E}.
(2) Take A ⊆ E arbitrary, and E = EA = {∅, A, E \ A, E}.
(3) Assuming E 6= ∅, take E the power set of E:
E = P(E) = {A : A ⊆ E}.
This σ-algebra is commonly used for E countable (e.g. E = Z).
Lemma I.3. If (E, E ) is a measurable space, then:
(1) E ∈ E .
(2) A countable intersection of elements of E is in E , i.e. if {An }n≥1 ⊆ E is a family of elements
∞
of E , then An ∈ E .
T
n=1
6 IGOR WIGMAN
(3) If A, B ∈ E , then B \ A ∈ E .
Proof. (1) Follows directly from (1) and (2) of Definition I.1.
(2) By De Morgan’s law:
[∞ ∞
\
An = E \ (E \ An ),
n=1 n=1
and the statement follows directly from (2) and (3) of Definition I.1.
(3) We have B \ A = B ∩ (E \ A) ∈ E , by (2) of Definition I.1 and (2) of Lemma I.3.
Having the stage set, we are now going to discuss the central notion of measure, whose domain
is going to be the postulated σ-algebra.
Definition I.4 (Measure). Let (E, E ) be a measurable space. A measure on (E, E ) is a function
µ : E → [0, ∞] assigning a non-negative real number (or infinity) to every element of E , satisfying
he following properties: µ(∅) = 0, and for every sequence {An }n≥1 of disjoint elements in E (i.e. for
every n 6= m, An ∩ Am = ∅),
∞
! ∞
[ X
µ An = µ(An ).
n=1 n=1
(We say that µ is σ-additive, i.e. additive w.r.t. a countable family of elements of E . The
disjointness of {An } is crucial, as otherwise we could have taken all the An to be a fixed set, whence
µ must vanish identically.)
The triplet (E, E , µ) is called a measure space.
Definition I.5 (Probability measure). If (E, E , µ) is a measure space so that µ(E) = 1, then we say
that µ is a probability measure, and (E, E , µ) is a probability space. In this case the notation
(Ω, F , P ) is often used.
I.2. Discrete measures. Let E = Z, E = P(Z) the power set, and suppose thatS µ is an arbitrary
measure on (Z, P(Z)). Then, since any A ⊆ Z is countable, we may write A = {k}, and thus, by
k∈A
the σ-additivity in Definition I.4, X
µ(A) = µ({k}) (I.1)
k∈A
(mind that {k} is a singleton, the set containing one element, k, which is not the same as that
element). Denote the function m : Z → R≥0 as m(k) := µ({k}), so that (I.2) reads
X
µ(A) = m(k). (I.2)
k∈A
Conversely, any function m : Z → [0, ∞] defines a measure µ via (I.2). The function m(·) is called
the mass function of µ. The measure µ is a probability measure, if and only if m(·) satisfies
X
m(k) = 1,
k∈Z
in which case m is the probability mass function. For example, a Poisson random variable X
with parameter λ > 0 has the probability mass function
( k
λ −λ
e k≥0
m(k) = P (X = k) = k! .
0 otherwise
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 7
Similarly, the Binomial distribution Bin(n, p), n ≥ 0, p ∈ [0, 1] has m(k) = nk pk (1 − p)n−k for
k = 0, 1, . . . , n, and otherwise m(k) = 0.
I.3. Sets generating σ-algebras. For E = [0, 1], we would like to define a (probability) measure
that would manifest x ∈ E drawn randomly (uniformly) discussed in §I.1. If such a measure µ
exists, then for an interval (a, b) ⊆ E, it would be equal to µ((a, b)) = b − a. In the anticipation of
extending µ to a measure, it is essential to first discuss its domain, i.e. a σ-algebra that contains all
the intervals (a, b) ⊆ E, of primary concern of this section. This is a particular case of the following
general situation:
Definition I.6 (σ-algebra generated by a set.). Let E be an arbitrary set (finite or infinite, countable
or uncountable), and A some collection of subsets of E. The σ-algebra generated by A is the
intersection of all σ-algebras containing A :
\
σ(A ) := E. (I.3)
A ⊆E
E is a σ-algebra on E
Equivalently, B is the σ-algebra generated by all open sets in R (those sets A ⊆ R that contain
every point together with some neighbourhood). The σ-algebra B is properly contained in the power
set P(R), i.e. there exist subsets of R that are not Borel, but constructing these is nontrivial.
Problem I.10. Show that B is generated by the open intervals {(a, b) : a < b}, or, alternatively,
by the semi-closed intervals {(a, b] : a < b}.
I.4. Extending to measures. In this section we are concerned with the situation when we are
given a collection A of subsets of E, and the values of µ(A) for all A ∈ A , and are interested in
extending µ to the whole σ-algebra σ(A ). Both the existence of such a measure and its uniqueness
are of fundamental importance.
Definition I.11 (π-system). Let E be a set, and A a collection of subsets of E. We say that A
is a π-system, if (i) ∅ ∈ A , and (ii) for all A, B ∈ A , A ∩ B ∈ A .
Example I.12. The collections {(−∞, a] : a ∈ R} ∪ {∅}, {(a, b) : a < b} ∪ {∅} and {(a, b] : a <
b} ∪ {∅} are all π-systems that generate the Borel σ-algebra.
From this point on, we will not care about ∅, so, for instance, treat the collection {(−∞, a] : a ∈ R}
as if it were a π-system. In what follows we will be concerned in proving the following important
result, asserting the uniqueness of the extension to a measure from π-systems to the generated
σ-algebra.
Theorem I.13 (Uniqueness of extension to measure). Let A be a π-system on E, E = σ(A ), and
µ1 , µ2 be two measures on (E, E ). Then, if µ1 = µ2 on A (i.e. for every A ∈ A , µ1 (A) = µ2 (A)),
and, further, µ1 (E) = µ2 (E) < ∞, it follows that µ1 = µ2 on E .
That is, prescribing the values of a measure on a π-system A , implies uniqueness of the measure
on σ(A ) with the given values. For example, prescribing the values {µ((−∞, x]) : x ∈ R} (thinking
of these as P (X ≤ x)), prescribes uniquely a probability measure on the Borel σ-algebra B(R) on
R. However, we didn’t assert that such a measure exists at all, and, in general, unless we will impose
further assumptions, it might not exist.
Most of the rest of this section is dedicated to proving Theorem I.13. The following notion of
d-systems is important in the course of the proof of the uniqueness of the extension from π-systems
to the full σ-algebra.
Definition I.14 (d-system (λ-system)). Let E be a set, and A a collection of subsets of E. We
say that A is a d-system (also called λ-system), if it satisfies the following axioms:
i. E ∈ A .
ii. For all A ∈ A , E \ A ∈ A , i.e. the complements of elements of A are in A .
∞
iii. For all disjoint sequences {An }n≥1 ⊆ A of elements in A , the union An ∈ A is in A .
S
n=1
i.e. σ(A ) is the intersection of all d-systems containing A . This certainly implies the statement
T I.16, since the postulated D is one of those d-systems on the r.h.s. of (I.4). Denote
of Proposition
D0 := D, i.e. to be the set on the r.h.s. of (I.4), that we claim to be, in fact, a σ-algebra.
D is a d-system
A ⊆D
This is sufficient to yield σ(A ) ⊆ D0 , since σ(A ) is the minimal σ-algebra containing A (see Lemma
I.7), whereas the inclusion σ(A ) ⊇ D0 is trivial from (I.4), since σ(A ) is one of those D on the r.h.s.
of (I.4).
It then remains to prove that D0 is a σ-algebra. On one hand, it is easy to check that it is a
d-system as an intersection of d-systems. In what follows we will establish that D0 is a π-system,
which will conclude the proof of Proposition I.16, thanks to Lemma I.15. First, since ∅ ∈ A ⊆ D0 ,
it follows that ∅ ∈ D0 , which is (i) of Definition I.11. We are then reduced to proving that for all
A, B ∈ D0 , A ∩ B ∈ D0 . We prove that in 3 steps:
(1) Step 1: If A, B ∈ A , then A ∩ B ∈ A ⊆ D0 , since A is a π-system.
(2) Step 2: We fix B ∈ D0 and consider the collection
CB := {A ⊆ E : A ∩ B ∈ D0 }
of subsets of E, eventually (after Step 3) aiming to prove that D0 ⊆ CB ; this is a merely
rephrasing of the statement of the statement of Proposition I.16. By Step 1, for every B ∈ A
(rather than B ∈ D0 , unfortunately, something we will have to relax during Step 3) A ⊆ CB .
Therefore, for B ∈ A to prove that D0 ⊆ CB , it is sufficient to prove that CB is a d-system,
since then it is one of the sets in the intersection on the r.h.s. of (I.4) (which is, by definition,
D0 ).
First, clearly, E ∈ CB , since E ∩ B = B, which is (i) of Definition I.14. Next, if A ∈ CB ,
then
(E \ A) ∩ B = E \ ((A ∩ B) ∪ (E \ B)) ∈ D0 ,
since D0 is a d-system, and (A ∩ B) ∪ (E \ B) is a union of disjoint sets, so that E \ A ∈ CB ;
this is (ii) of Definition I.14. Finally, if {An }n≥1 is a disjoint sequence of elements in CB , then
∞
! ∞
[ [
An ∩ B = (An ∩ B) ∈ D0 ,
n=1 n=1
∞
since {An ∩ B}n≥1 is a disjoint sequence in the π-system D0 , so that An ∈ CB , which is
S
n=1
(iii) of Definition I.14. The above shows that CB is a d-system, for every B ∈ A , which, as it
was mentioned above, implies that CB ⊇ D0 , i.e. for every A ∈ D0 and B ∈ A , A ∩ B ∈ D0 .
(3) Step 3: This time we fix B ∈ D0 , and define again CB := {A ⊆ E : A ∩ B ∈ D0 }; now we
can infer by Step 2, that CB ⊇ A , and we already proved in Step 2 that CB is a d-system
(regardless of whether B ∈ A or not). Hence, by the above logic, D0 ⊆ CB , i.e. for every
A, B ∈ D0 , A ∩ B ∈ D0 , hence D0 is a π-system, as claimed.
Proof of Theorem I.13. Let D := {A ∈ σ(A ) : µ1 (A) = µ2 (A)}. We know by assumption that
A ⊆ D, and wish to show that D is a d-system, for then it contains σ(A ) by Proposition I.16,
which is a restatement of Theorem I.13.
We are then to check properties (i)-(iii) of Definition I.14. (i) E ∈ D by the assumptions of
Theorem I.13. (ii) Let A ∈ D, then E \ A ∈ D, and for i = 1, 2, we have
µi (A) + µi (E \ A) = µi (E) < ∞,
10 IGOR WIGMAN
by the additivity of µi (particular case of σ-additivity). Hence if µ1 (A) = µ2 (A), that forces µ1 (E \
A) = µ2 (E \ A), hence E \ A ∈ D, which concludes (ii) of Definition I.14. (iii) If {An }n≥1 is a disjoint
∞
sequence of elements of D, and A =
S
An , then by the σ-additivity of both µ1 and µ2 , it follows
n=1
that ∞ ∞
X X
µ1 (A) = µ1 (An ) = µ2 (An ) = µ2 (A),
n=1 n=1
which demonstrates that also A ∈ D.
Having dealt with the uniqueness of extension of a set function to a measure, we are now concerned
with its existence.
Definition I.17 (Ring). Let E be a set, and A a collection of subsets of E. We say that A is a
ring on E, if (i) ∅ ∈ A and (ii) for all A, B ∈ A , both B \ A ∈ A and A ∪ B ∈ A .
Theorem I.18 (Carathèodory’s Extension Theorem). Let A be a ring on E, and µ : A → [0, ∞] a
function satisfying: (i) µ(∅) = 0 and (ii) for all sequences {An }n≥1 ⊆ A of disjoint elements of A
∞
An ∈ A , one has
S
so that also
n=1
∞
! ∞
[ X
µ An = µ(An ).
n=1 n=1
note that, despite the lack of uniqueness of a representation (I.5) of the given A, the corresponding
number µ(A) is well-defined, e.g. for A = (0, 1] = (0, 1/2] ∪ (1/2, 1], we have µ(A) = 1 = 21 + 12 . Now
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 11
we wish to apply Carathèodory’s Extension Theorem I.18 on µ, to which end we need to check the
countable additivity of µ, the main difficulty of this theorem.
First, µ(∅) = 0, and also the (finite) additivity µ(A ∪ B) = µ(A) + µ(B) for all A, B ∈ A
disjoint is easy. Now suppose that {Ai }i≥1 ⊆ A is a disjoint sequence of elements of A so that also
∞ n
Ai ∈ A , and define the increasing sequence Bn :=
S S
A := Ai , so that Bn ↑ A (i.e. Bn ⊆ Bn+1 ,
i=1 i=1
∞
S
and Bn = A). Since, by the additivity of µ,
n=1
n
X ∞
X
µ(Bn ) = µ(Ai ) −→ µ(Ai ),
n→∞
i=1 i=1
thanks to (I.7). Hence, this, together with the assumption that µ(Cn ) ≥ 20 , implies that µ(D1 ∩
. . . Dn ) ≥ 0 , in particular, that for every n ≥ 1, D1 ∩ . . . ∩ Dn 6= ∅, and also
Kn := D1 ∩ . . . ∩ Dn 6= ∅.
To summarize, the sequence {Kn }n≥1 is a decreasing sequence of bounded non-empty closed sets
∞
T
in R. Then, by Cantor’s Lemma, Kn 6= ∅, but, upon recalling that, by construction (and (I.6)),
n=1
Kn ⊆ Cn , hence
∞
\ ∞
\
∅ 6= Kn ⊆ Cn ,
n=1 n=1
∞
Cn = ∅. All in all, this proves the σ-additivity of µ defined on A ,
T
contradicting the assumption
n=1
hence, in light of Theorem I.18, the existence of a measure µ on σ(A ) = B(R), as required.
12 IGOR WIGMAN
For the uniqueness we wish to apply Theorem I.13 on µ. However, here the assumption µ(E) < ∞
is violated. We bypass this setback by intersecting the sets with finite intervals, as follows. Let
µ be two measures on (R, B(R)) with λ((a, b]) = µ((a, b]) = b − a for all a < b (e.g. µ is the
Borel measure whose existence was established above, and λ any other measure with the defining
properties on the π-system). Take a number n ∈ Z, and define the measures µn and λn on (R, B(R))
as µn (A) = µ((n, n + 1] ∩ A) and λn (A) = λ((n, n + 1] ∩ A). Then, for every n ∈ Z, λn and µn
are probability measures on (R, B(R)) so that λn = µn on the π-system {(a, b] : a < b} generating
B(R).
Since λn and µn satisfy the conditions of Theorem I.13 (unlike λ and µ), it follows that λn = µn
on B(R). But then for every A ∈ B,
X X X X
µ(A) = µ(A ∩ (n, n + 1]) = µn (A) = λn (A) = λ(A ∩ (n, n + 1]) = λ(A),
n∈Z n∈Z n∈Z n∈Z
Proof. To use the σ-additivity of µ, we are going to decompose the sequence {An } into disjoint sets
in the following way. Set B1 := A1 ∈ E , and, for n ≥ 2, Bn := An \ An−1 ∈ E . Then the {Bi }i≥1 are
disjoint sets in E , with
[n
An = Bi
i=1
for all n ≥ 1, and, in particular,
∞
[ ∞
[
A= An = Bi .
n=1 i=1
Therefore, by the σ-additivity of µ,
∞
! ∞ n n
!
[ X X [
µ(A) = µ Bi = µ(Bi ) = lim µ(Bi ) = lim µ Bi = lim µ(An ).
n→∞ n→∞ n→∞
i=1 i=1 i=1 i=1
I.5. Independence. Recall that a probability space is the triple (Ω, F , P ), where Ω is sample
space, i.e. the set of all possible outcomes of some random experiment, F is the collection of
events, i.e. the set of all observable sets of outcomes (that are subsets of Ω), and P is a probability
measure on Ω, i.e. for an event A ∈ F , P (A) is the probability of the occurrence of A.
Definition I.21 (Almost sure events). An event A ∈ F occurs almost surely (A is an almost
sure event), abbreviated a.s., if P (A) = 1.
Example I.22. Let Ω = [0, 2π), F the Borel σ-algebra on Ω (this is B(R) restricted to Ω), and
P = 2π1
µ with µ the Borel measure on Ω (so that P (Ω) = 1, as it should be). In this case (Ω, F , P )
corresponds to a random point drawn uniformly on the circle, i.e. θ ∈ [0, 2π) is a random uniform
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 13
angle. Let A = [0, 2π) \ Q be the event of drawing an irrational angle. Then P (A) = 1, since, by the
σ-additivity of µ, the probability of its complement is
X
P (Q ∩ [0, 2π)) = P ({x}) = 0.
x∈Q∩[0,2π)
Therefore, with probability 1, the randomly drawn angle is irrational, i.e. the event A of drawing an
irrational angle is almost sure.
Example I.23 (Infinite toss of a fair coin). Here we assume that a fair coin (i.e. the probability of
a head or a tail is 21 ) is tossed infinitely (countably) many times. The corresponding sample space is
Ω = {0, 1}N = {(ω1 , ω2 , . . .) : ∀i ≥ 1. ωi ∈ {0, 1}},
where ωi = 1 means that the i’th coin toss is a head. As a σ-algebra we take F to be generated by
the events of the form
A(1 ,...,n ) = {ω1 = 1 , . . . , ωn = n }
as n ∈ N, and (1 , . . . , n ) ∈ {0, 1}n (“cylinder sets”). As a probability measure, we take the one
defined on the cylinder sets as P (A(1 ,...,n ) ) = 2−n .
Definition I.24 (Independence). Let (Ω, F , P ) be a probability space, and I be a countable index
set.
(1) The events {Ai }i∈I are independent, if for all finite J ⊆ I,
!
\ Y
P Ai = P (Ai ). (I.8)
i∈J i∈J
(2) Let {Ei }i∈I be a family of σ-algebras so that for all i ∈ I, Ei ⊆ F (i.e. it is a family of
sub-σ-algebras of F ). We say that the {Ei }i∈I are independent, if every sequence {Ai }i∈I
so that for all i ∈ I, Ai ∈ Ei , is independent (as a sequence of events).
Intuitively, independence means that one random experiment does not convey any information
on another one. One might think that the logic behind the use of (I.8) is circular, as in order to
exploit independence, it is necessary to check its definition (I.8) in the first place, but then it gains
nothing. But, usually, it is clear from the context, when two events (or σ-algebras) are independent.
For example, if a fair coin is tossed several times, then the outcomes of these tosses are independent
(unless the coin is deformed during these). The following example shows that for a sequence of events
to be independent, it is not sufficient that every pair of events is independent (i.e. independence is
strictly stronger than pairwise independence).
Example I.25. A fair coin is tossed twice, so that the sample space is
Ω = {HH, HT, T H, T T },
where each outcome has probability 41 . Take
A1 = {T T, T H} = “T first”
A2 = {T T, HT } = “T second”
A3 = {T H, HT } = “exactly one T”
Here we have
1
P (A1 ) = P (A2 ) = P (A3 ) = ,
2
1
P (A1 ∩ A2 ) = P (A2 ∩ A3 ) = P (A1 ∩ A3 ) = ,
4
14 IGOR WIGMAN
but
1
P (A1 ∩ A2 ∩ A3 ) = 0 6= = P (A1 ) · P (A2 ) · P (A3 ).
8
So the events {A1 , A2 , A3 } are not independent, even though every pair of these events are indepen-
dent. The reason for this is that the outcome of any pair of these events determines the outcome of
the third one.
Suppose that one needs to test the independence of two σ-algebras. Of course one can check
whether the equality (I.8) is satisfied for any events lying in the respective σ-algebras. However, it
is sufficient to apply it on a generating set only, as asserted in the following theorem.
Theorem I.26. Let (Ω, F , P ) be a probability space, and A1 , A2 ⊆ F two π-systems contained in
F . If the equality
P (A1 ∩ A2 ) = P (A1 ) · P (A2 )
is satisfied for all A1 ∈ A1 , A2 ∈ A2 , then σ(A1 ) and σ(A2 ) are independent.
Proof. Fix a set A1 ∈ A1 , whence it is easy to check that the set functions µ(A), ν(A) defined on
A ∈ F as
µ(A) := P (A1 ∩ A); ν(A) := P (A1 ) · P (A)
are measures. Moreover, by the assumptions of Theorem I.26, one has µ(A) = ν(A) on the π-system
A2 . Hence, by Theorem I.13, for all A2 ∈ σ(A2 ),
µ(A2 ) = ν(A2 ),
i.e.
P (A1 ∩ A2 ) = P (A1 ) · P (A2 )
for all A1 ∈ A1 , A2 ∈ σ(A2 ). Now fix A2 ∈ σ(A2 ) (not restricted to A2 ∈ A2 ), and apply the same
argument with
µ(A) := P (A ∩ A2 ); ν(A) := P (A) · P (A2 ),
to yield that for all A1 ∈ σ(A1 ),
P (A1 ∩ A2 ) = µ(A1 ) = ν(A1 ) = P (A1 ) · P (A2 ),
as asserted by Theorem I.26.
Example I.27. For two dice roll, the corresponding sample space is
Ω = {(ω1 , ω2 ) : ωi ∈ {1, 2, . . . , 6}} = {1, 2, . . . , 6}2 ,
the σ-algebra of events is the whole F = P(Ω) power set, and P is the uniform distribution on Ω
1
(i.e. each atomic outcome gets probability 36 ). We take the events Ak = {first die has score k}, Bk =
{sum of two scores is k}, Ck = {second die has score k}, k = 1, . . . , 6. Then P (A6 ) = P (B7 ) = 16 ,
and
1
P (A6 ∩ B7 ) = = P (A6 ) · P (B7 ),
36
since the only way that both A6 and B7 occur is (6, 1). Hence the events A6 and B7 are independent.
But clearly, the outcome of the first die has an effect on the total sum, so Ak can’t be independent
of Bj for all possible values of k, j, e.g.
1 1
P (A2 ∩ B11 ) = 0 6= P (A2 ) · P (B11 ) =· .
6 36
On the other hand, clearly, all the events Ak are independent of all the events Cj .
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 15
I.6. The Borel-Cantelli lemma. Let (Ω, F , P ) be a probability (or measure) space, and
S {An }n≥1
be a sequence of events. If for some n ≥ 1 and some sample point x ∈ Ω, we have x ∈ Am , that
m≥n
means that x is in some Am with m ≥ n, i.e. x is in some tail of the sequence {An }n≥1 . Therefore, if
∞ S
T
x∈ Am , then x ∈ An with n arbitrarily large, or, equivalently, x is an element of an infinite
n=1 m≥n
subsequence of {An }n≥1 . We then define
∞ [
\
{An infinitely often} = {An i.o.} = Am ,
n=1 m≥n
also denoted ∞ [
\
lim sup An := Am . (I.9)
n→∞
n=1 m≥n
The set lim sup An is an event, by (I.9), and the usual operations inside a σ-algebra F .
n→∞ T
Similarly, if for some n ≥ 1, x ∈ Am , then it means that x ∈ Am for all m ≥ n, and then if
m≥n
∞ T
S
we allow n to be arbitrary by taking x ∈ Am , that means that x ∈ An for n sufficiently large
n=1 m≥n
(“eventually”). We then define
∞ \
[
lim inf An = {An eventually} = {An ev.} := Am . (I.10)
n→∞
n=1 m≥n
Moreover, by De Morgan’s law, if we denote for an event A ∈ F its complement w.r.t. Ω, Ac := Ω\A,
we may write
c ∞ [
!c ∞
!c
\ [ [
lim sup An = Am = Am
n→∞
n=1 m≥n n=1 m≥n
∞
(I.11)
[ \
= Acm = lim inf Acn .
n→∞
n=1 m≥n
Lemma I.28 (Borel-Cantelli). Let (Ω, F , P ) be a probability space, and {An }n≥1 ⊆ F be a sequence
of events. Then:
∞
P
(1) If P (An ) < ∞, then P ({An i.o.}) = 0.
n=1
∞
P
(2) If the events {An }n≥1 are independent and P (An ) = ∞, then P ({An i.o.}) = 1.
n=1
Intuitively, the Borel-Cantelli Lemma I.28 asserts, that, if the probabilities P (An ) decay sufficiently
rapidly, then, almost surely, An does not occur infinitely often. Its counterpart, valid only under
independence assumption, states, that if these probabilities do not decay sufficiently rapidly, then,
almost surely, An occur infinitely often. In this case (i.e. if {An }n≥1 are assumed to be independent),
∞
P
An occur infinitely often almost surely, if and only if the series P (An ) diverges. The analogue
n=1
16 IGOR WIGMAN
of the Borel-Cantelli lemma, properly rephrased, holds for measure spaces in place of probability
spaces.
Proof. (1) Recall that
∞ [
\
{An i.o.} = An .
n=1 m≥n
S
Therefore, for every n0 ≥ 1, {An i.o.} ⊆ An , so that its probability is bounded above by
m≥n0
! ∞
[ X
P ({An i.o.}) ≤ P An ≤ µ(An ),
m≥n0 n=n0
that could be made arbitrarily small by taking n0 large, as a tail of a convergent series. Since
P ({An i.o.}) does not depend on n0 , that it is of zero probability follows.
(2) We set an = P (An ), and will use the inequality 1 − a ≤ e−a , valid for a ∈ R, for a = an . Since
the events {An }n≥1 are independent, so are {Acn }n≥1 . Hence we have for n ≥ 1 and N ≥ n
that ! !
N
\ YN YN XN
P Acm = (1 − am ) ≤ e−am = exp − am → 0
m=n m=n m=n m=n
N
Acm ⊆ Acm , its probability is P Acm
T T T
as N → ∞ (n is fixed). Hence, since = 0,
m≥n m=n m≥n
valid for all n ≥ 1, and so, by the σ-additivity of P ,
∞ \
! ∞
!
[ X \
c c
P ({Am ev.}) = P Am ≤ P Acm = 0.
n=1 m≥n n=1 m≥n
∞
P
It then follows, that the series P (An ) is convergent, if and only if α > 1. Hence, by the
n=1
Borel-Cantelli lemma, for every > 0,
P (Xn / log n ≥ 1 + i.o.) = 0,
whereas
P (Xn / log n ≥ 1 i.o) = 1.
It then follows that max Xi grows like log n precisely, in the following sense. Recall that for a
1≤i≤n
sequence an of real numbers, lim sup an and lim inf an were defined. These always exist, unlike a
n→∞ n→∞
limit that might or might not exist for a given sequence, and lim an exist, if and only
n→∞
Xn
mind that lim log n
does not exist almost surely, as the sequence fluctuates, and P (Xn ≤ 1 i.o.) = 1
n→∞
can be similarly proved.
since the sequence {sup am }n≥1 is monotone decreasing. The conclusion then follows upon ap-
m≥n
plying (iii)-(iv) of this proposition sequentially. Alternatively, one can reuse Lemma II.4.
vi. Similarly to above, this time with the use of
lim inf an = lim inf am = sup inf am .
n→∞ n→∞ m≥n n≥1 m≥n
One can get from the definitions that the composition of measurable functions is measurable.
Lemma II.7. If f : E → G and g : G → F are measurable functions (w.r.t. the corresponding
measurable spaces), then g ◦ f is measurable.
II.2. Random variables. A random variable X : Ω → R induces a new measure, the probability
distribution on R from the measure on Ω, that is very informative about X.
Definition II.8 (Probability distribution). The probability distribution of a random variable
X : Ω → R on a probability space (Ω, F , P ) is a probability measure PX : B(R) → [0, 1] on
(R, B(R)), defined as
PX (A) = P (X ∈ A) = P (X −1 (A)).
This makes sense, since X is measurable, hence X −1 (A) ∈ F .
We also define the distribution function FX : R → [0, 1] as
FX (x) = P (X ≤ x) = PX ((−∞, x]) = P (X −1 ((−∞, x])).
By the definition, FX is non-decreasing, satisfying
lim FX (x) = 0; lim FX (x) = 1.
x→−∞ x→+∞
Moreover, we claim that FX is right-continuous. To see that, we use Lemma I.20 with
∞
\ 1
∅= x0 , x0 + ,
n=1
n
so that
1 1
lim FX x0 + − FX (x0 ) = lim PX x0 , x0 + = 0,
n→∞ n n→∞ n
sufficient for the right-continuity.
Further, since A = {(−∞, x] : x ∈ R} is a π-system of subsets of R generating B(R), Theorem
I.13 demonstrates that prescribing FX , defined on R, uniquely determines the whose of PX defined
on B(R). That is, FX encodes all the information about the full probability distribution PX of X.
Example II.9. Set Ω = N, F = P(N), λ > 0, and
X e−λ
P (A) = λk ,
k∈A
k!
the Poisson probability measure. Then the identity map X : Ω → R, X(ω) = ω is a Poisson
random variable. Here PX = P , i.e. the probability distribution of X (restricted to N) equals to
the probability measure on Ω.
20 IGOR WIGMAN
that counts the number of heads in the m consecutive tosses between (k − 1)m + 1 and km. Since
the tosses of Xk and Xk0 , k 6= k 0 are disjoint, all the {Xk }k≥1 are i.i.d., distributed ∼ Bin m, 21 , i.e.
X m −m
PXk (A) = P (Xk ∈ A) = 2 ,
i
i∈A∩{0,1,...,m}
A ∈ B(R). This shows that PXk (on B(R)) is much simpler than the underlying measure P on Ω.
Example II.12. Take Ω = (0, 1], F = B(0, 1]), and P the Borel measure on (0, 1]. Then the
identity map X(ω) = ω is a uniform random variable on (0, 1], i.e. X ∼ U(0, 1].
As we have seen, the same probability space is not exclusively restricted to one random variable,
but can accommodate several random variables with varying distribution measures, depending on
how “rich” this probability space is. For example, the trivial σ-algebra F = {∅, Ω} can only admit
the constant random variables (that have to be measurable w.r.t. F ). The following theorem,
whose proof is outside the scope of this course, shows that ((0, 1], B((0, 1])) is sufficiently rich to
accommodate random variables with unrestricted distributions, and further, these are not bound to
be dependent.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 21
Theorem II.13. Let (Ω, F , P ) be the probability space Ω = (0, 1], F = B((0, 1]) and P the Borel
measure on B((0, 1]). Suppose (Fn )n≥1 is an arbitrary sequence of distribution functions (i.e. for
every n ≥ 1, Fn is a distribution function of some random variable). Then there exist a sequence
{Xn }n≥1 of independent random variables on (Ω, F , P ) so that, for every n ≥ 1, the distribution
function of Xn is Fn .
II.3. Modes of convergence. Let {Xn }n≥1 be a sequence of random variables on a probability
space (Ω, F , P ), and X another random variable on the same space. Here we are interested in
whether Xn converges to X. A number of different modes of convergence could be of relevance, the
most immediate (but not necessarily the most “correct”) one is the pointwise convergence, i.e., for
every ω ∈ Ω,
lim Xn (ω) = X(ω).
n→∞
This condition seems to be too strong to impose, as the following example shows.
Example II.14. Return to Example I.23 of infinite toss of a fair coin, and set Xn (ω) = ωn ∈ {0, 1}.
It is reasonable to request that
n
1X 1
lim Xi (ω) = ,
n→∞ n 2
i=1
i.e. the proportion of heads should asymptotically approach 12 as the number of tosses grows. How-
ever, this does not hold pointwise, for example for ω = (0, 0, . . .), occurring with probability 0.
Recall Definition I.21 of almost sure events and their defining properties (a.s). Similarly, if (E, E , µ)
is a measure space. We say that A (or its defining property) holds almost everywhere (a.e.), if
µ(E \ A) = 0.
Definition II.15 (Convergence a.e. and in measure). Let fn : E → R be a sequence of measurable
functions and f : E → R another measurable function.
(1) We have fn → f almost everywhere (almost surely), denoted a.e. (a.s.), if
µ ({x ∈ E : fn (x) 6→ f (x)}) = 0.
(2) We have fn → f in measure (in probability), if for every > 0,
µ (x ∈ E : |fn (x) − f (x)| > ) → 0.
µ P
Convergence in measure (resp. in probability) is denoted fn → f (resp. fn → f ).
In practice, we neglect events of measure 0, so the pointwise convergence will be irrelevant for
us, and instead, the a.e. (a.s.) convergence will be of interest. Unrolling the definition of a.e.
convergence, that means that there exists a set A so that the measure of its complement µ(E \A) = 0
vanishes, so that for every x ∈ A and every > 0, there exists a number N = N (x, ) ∈ N sufficiently
large, so that for all n > N , |fn (x) − f (x)| < . Importantly, N is allowed to depend on the point
x. Otherwise, if N only depends on , the resulting notion is the much stronger on of uniform
convergence. In general, uniform convergence implies pointwise convergence (everywhere), which, in
its turn, implies a.e. convergence.
Example II.16. Let fn : R → R, fn (x) = χ(n,n+1] , where R is equipped with the Borel measure.
Then for every x ∈ R, if n > x+1, then fn (x) = 0, hence fn → 0 a.e. (in fact, everywhere). However,
for 0 < < 1, µ({x : |fn | ≥ }) = µ((n, n + 1]) = 1, so that fn 6→ 0 in measure.
Example II.17. On the same space as above, take
(
2n − 2n2 x x ∈ [0, 1/n]
fn (x) = .
0 otherwise
22 IGOR WIGMAN
Then, for every x 6= 0, fn (x) = 0 with n sufficiently large (depending on x). Hence fn (x) → 0 for all
x 6= 0, and thereupon fn → 0 a.e. Further, for every > 0,
1
µ({|fn (x)| > }) ≤ → 0,
n
so also fn → 0 in measure.
Example II.18. Let fn : [0, 1] → R, the sequence of the characteristic functions of dyadic intervals:
f1 = χ[0,1] , f2 = χ[0,1/2] , f3 = χ[1/2,1] , f4 = χ[0,1/4] , f5 = χ[1/4,1/2] ,..., formally defined as
fn (x) = χh n−2k , n−2k +1 i (x),
2k 2k
k ≥ 0, 2k ≤ n < 2k+1 . Then for > 0, µ({x : |fn (x)| > }) ≤ 2−k → 0 as n → ∞, i.e. fn → 0 in
measure.
However, for a fixed x ∈ [0, 1], fn (x) attains both the values 0, 1 with n arbitrarily large, hence
the limit lim fn (x) does not exist, i.e. fn does not converge a.e.
n→∞
The above examples show that any of convergence a.e. or convergence in measure does not imply
the other one. Since the measure 0 sets are ignored, the limit of a sequence fn , either a.e. or in
measure, is only defined up to a measure zero set. For example, fn ≡ 0 converge to f ≡ 0, but, we
can also take f = χQ , with no convergence impaired. In either case, to define convergence (a.e. or in
measure, resp. a.s. or in probability), all the functions (resp. random variables) have to be defined
on the same measure space (resp. probability space). In what follows we will mainly be concerned
with a.s. convergence and convergence in probability of sequence of random variables Xn defined
on the same probability space. A sufficient condition for almost sure convergence could be easily
derived from the Borel-Cantelli lemma I.28; mind that Xn → X a.s. (resp. in probability), if and
only if Xn − X → 0 a.s. (resp. in probability).
Corollary II.19. (1) If for every > 0,
∞
X
P (|Xn | > ) < ∞,
n=1
then Xn → 0 a.s.
(2) Conversely, if the {Xn }n≥1 are independent, and Xn → 0 a.s., then, for every > 0,
∞
X
P (|Xn | > ) < ∞.
n=1
Proof. (1) By the Borel-Cantelli’s Lemma, under the conditions of Corollary II.19(1), for every > 0
P (|Xn | > i.o.) = 0. The subtlety is that we need to find a set of ω ∈ Ω of full probability, so that
the above holds for all > 0. Let us denote the event
A = {|Xn | ≤ outside of finitely many n} = {ω ∈ Ω : ∃N = N (ω) : ∀n > N. |Xn | ≤ }
⊆ lim sup |Xn | ≤ .
n→∞
By theTabove, for all > 0, P (A ) = 1, and the sought after statement holds on the intersection
A := A , i.e. on ω ∈ A, for every > 0, there exists N = N (ω, ), so that for all n > N , |Xn | < .
>0
How to show that P (A) = 1? By the definition, A is an intersection of a continuous family of
events, of full probability each. Unfortunately, the σ-additivity of P is thereupon unable to yield
the same about the probability of A. However, we may restrict to = n1 , so that to use the density
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 23
∞
T
of these numbers around 0, and then A = A1/n , whence the σ-additivity (alternatively, the
n=1
Continuity Lemma I.20, or, rather its variant for decreasing sequences) of P does yield P (A) = 1
∞
(P (A) = 1 − P (Ac ), and P (Ac ) ≤ P (Ac1/n ) = 0).
P
n=1
(2) Homework assignment.
Example II.20. Let Xn ∼ Bin(1, n−α ) be independent coin tosses with probability of heads 1
nα
.
Then for 0 < < 1,
P (|Xn | > ) = P (Xn = 1) = n−α
(otherwise if ≥ 1, P (|Xn | > ) = 0). Then, by Corollary II.19, (1) and (2), Xn → 0 a.s., if and
only if α > 1.
Example II.21. Let X1 , X2 , . . . be Gaussian random variables (that need not be independent),
Xn ∼ N (0, σn2 ),
1
σn2 = → 0.
(log n)2
We need to estimate the probability that P (|Xn | > ), first by evaluating the probability of a
deviation for the standard Gaussian Z ∼ N (0, 1):
Z∞ r Z∞ r ∞ r
1 −x2 /2 2 x −x2 /2 1 2 h −x2 /2 i 1 2 −t2 /2
P (|Z| > t) = 2 · √ e dx ≤ · e dx = −e = t πe .
2π π t t π t
t t
whence Xn → 0 a.s. by Corollary II.19. To this end, given > 0 we set n to be the minimal integer
satisfying 2 log(n )/2 > 1, and set δ > 0 so that 2 log(n )/2 = 1 + δ. Then
∞ r
X X 1 2 X 1 −1−δ
P (|Xn | > ) ≤ 1+ n < ∞.
n=1 n≤n
π n>n
log n
Note that the sufficient condition of Corollary II.19(1) in particular implies the convergence Xn → 0
in probability (since the summands of a convergent series vanish). The following result shows that,
under the assumption µ(E) < ∞, the convergence a.e. is stronger than convergence in measure.
Conversely, from convergence in measure one can infer a.e. convergence along an infinite subsequence.
Theorem II.22. Let (E, E , µ) be a measure space and {fn }n≥1 a sequence of measurable functions
on E, and f a measurable function on E.
(1) Assume µ(E) < ∞. If fn → f a.e., then fn → f in measure.
(2) If fn → f in measure, then there exists a subsequence fnk , so that fnk → f a.e.
24 IGOR WIGMAN
P
In probabilistic language, Theorem II.22 states that (1) Xn → X a.s. implies Xn → X (the finite
P
measure condition is automatically satisfied), and (2) Xn → X implies Xnk → X a.s. for some
subsequence Xnk .
Proof. (1) It is convenient to work with gn := fn −f , so that the postulated convergence is gn → 0
a.e. Now take an arbitrary > 0, and consider the set An := {x ∈ E : |gn (x)| ≤ }, and
∞
T ∞
S
Bn = Am . Then Bn are increasing, and Bn = {x ∈ E : |gn (x)| ≤ ev.}, see (I.10).
m=n n=1
Then, thanks to Lemma I.20,
µ(Bn ) → µ({x ∈ E : |gn (x)| ≤ ev.}) = µ(E)
by assumption (i.e. that gn → 0 a.e). Since, by the definition of Bn , Bn ⊆ An , we have that
µ(E) ≥ µ({x : |gn (x)| ≤ }) = µ(An ) ≥ µ(Bn ) → µ(E),
and so, µ({x : |gn (x)| ≤ }) → µ(E). Using the finiteness of µ(E) (and this is the only place
this assumption is used), we finally have
µ({x : |gn (x)| > }) = µ(E) − µ({x : |gn (x)| ≤ }) → 0,
as required.
(2) Again, we denote gn := fn − f , and assume that gn → 0 in measure. Given a number k ≥ 1,
1 1
µ |gn | > k → 0, hence there exists a number nk sufficiently large, so that µ |gnk | > k <
2−k . We claim that thus obtained sequence {nk } satisfy the postulated properties. Indeed,
∞
X ∞
X
µ(|gnk | > 1/k) ≤ 2−k < ∞,
k=1 k=1
hence, by the Borel-Cantelli lemma (1) (which is valid on general measure spaces, not only
probability spaces),
µ(|gnk | > 1/k i.o.) = 0.
However, if, for some x ∈ E, |gnk (x)| ≤ k1 for k sufficiently large, then gnk (x) → 0, hence
E \ {|gnk | > 1/k i.o.} ⊆ {gnk → 0}, and
µ({gnk → 0}c ) ≤ µ(|gnk | > 1/k i.o.) = 0,
i.e. gnk → 0 a.e.
where for m ≥ 1 one has 0 ≤ ai < ∞, and {Ai }1≤i≤m ⊆ E . We denote the class of simple
functions (III.1) by S (E ), and the corresponding integral to be
Z Z Z Xm
f dµ = f dµ = f (x)µ(dx) := ai · µ(Ai ), (III.2)
E E i=1
under the convention 0 · ∞ := 0, in case some of the indicator sets Ai are of infinite measure (we can
also allow for ai = ∞, under the same convention, i.e. f to attain the infinite value on a measure
zero set, which we will ignore from this point on, but strictly speaking, f is allowed to attain +∞,
and later on −∞, on measure 0 sets). We observe that the representation (III.1) of a simple function
is not unique, but nevertheless, by the properties of µ, their integral is well-defined. In particular,
for A ∈ E , Z
χA dµ = µ(A),
E
which, in probabilistic language, will state
E[χA ] = P (A),
with Z
E[ · ] := · dP
Ω
denoting the expectation of a random variable.
The following properties of the integral of simple functions are immediate from the definition.
Lemma III.1. For f, g ∈ S (E ) and α, β ≥ 0, one has
(1) Linearity: Z Z Z
(α · f + β · g)dµ = α f dµ + β gdµ.
E E E
(2) If for (a.e.) x ∈ E, f (x) ≤ g(x) then
Z Z
f (x)dµ ≤ g(x)dx.
E E
R
(3) We have f dµ = 0, if and only if f = 0 a.e. (recall that for f ∈ S (E ) we require f ≥ 0).
E
The following definition extends the definition of the integral to measurable non-negative functions.
Definition III.2. Let f : E → R≥0 be a (E − B(R))-measurable function. The integral of f is
Z Z
f dµ := sup gdµ : g ∈ S (E ), 0 ≤ g ≤ f ,
E E
finite or infinite.
Thus, if f is a measurable, it is approximated from below by simple functions, and whatever
resulting “optimal” value of the area covered by those simple function is the value of the integral
of f . It is manifested by the following lemma (in combination with the follow-up theorem). Recall
that for a sequence of functions fn , we write fn ↑ f (a.e.), if for (resp. a.e.) every x ∈ E, fn (x) is
monotone increasing, and fn (x) → f (x).
26 IGOR WIGMAN
Lemma III.3. Let f : E → R≥0 be a (E − B(R))-measurable function. Then there exists a sequence
{fn } of (non-negative) simple functions, so that fn ↑ f as n → ∞.
Proof. Take the functions
(
n f (x) ≥ n
fn (x) = .
k · 2−n f (x) ∈ 2kn , k+1
2n , k = 0, 1, . . . , n · 2n − 1
First, it is easy to check that fn (x) ↑ f (x). That fn are simple follows from the properties of the
measure, so that each set
k k+1
An,k (f ) := x ∈ E : f (x) ∈ n , n
2 2
is measurable.
The following theorem shows, in particular, that the approximation quality of the simple functions
as in Lemma III.3 is sufficiently fine, so that the corresponding integrals are well-approximated. In
general, it is important to find sufficient conditions, when convergence a.e. of a sequence of functions
ensures the same for the integrals, i.e. we can exchange the limits and the integration. One of the
sufficient conditions is given by the following Dini’s Monotone Convergence Theorem, that we cite
unproved.
Theorem III.4 (Monotone Convergence Theorem). Let (E, E , µ) be a measure space, {fn }n≥1 be
a sequence of measurable non-negative functions fn : E → R, and f : E → R a measurable non-
negative function, so that fn ↑ f a.e. on E. Then
Z Z
fn dµ ↑ f dµ.
E E
We now aim to extend the notion of integral to arbitrary measurable functions, thus no longer
imposing the nonnegativity. Given a measurable function f : E → R, we decompose f into its
positive and negative parts, f + and f − respectively, in the following way
Definition III.5 (Positive and negative parts of a measurable function). Let f : E → R be a
measurable function. Then
f + (x) := max(f (x), 0), f − = − min(f (x), 0).
Note that both f ± ≥ 0 are nonnegative and measurable, and, further, we have
|f | = f + + f − ,
and
f = f + − f −.
f − dµ < ∞, whence
R R R
It can be easily seen that |f |dµ < ∞, if and only if both f + dµ < ∞ and
E E E
Z Z Z
|f |dµ = f + dµ + f − dµ.
E E E
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 27
Within the context of a probability space (Ω, F , P ) and a random variable X : Ω → R, the
integral is referred to as expectation
Z Z
E[X] := XdP = X(ω)dP (ω).
Ω Ω
The following lemma extends the basic properties of the integral from simple functions to arbitrary
ones (cf. Lemma III.1).
Lemma III.7. For f, g integrable functions, and α, β ∈ R, one has:
(1) Linearity: Z Z Z
(α · f + β · g)dµ = α f dµ + β gdµ.
E E E
(2) If for (a.e.) x ∈ E, f (x) ≤ g(x) then
Z Z
f (x)dµ ≤ g(x)dx.
E E
R
(3) If f = 0 a.e., then f dµ = 0.
E R
(4) Conversely, if f ≥ 0 and f dµ = 0, then f = 0 a.e.
E
Proof. First we prove the three statement under the extra assumption that f, g ≥ 0 are nonnegative,
and α, β ≥ 0.
(1) We apply Lemma III.3 on f, g to yield two sequences fn , gn ∈ S (E ) of simple functions so
that fn ↑ f and gn ↑ g. Then, under our assumptions, αfn + βgn ↑ αf + βg, and
Z Z Z
(αfn + βgn )dµ = α fn dµ + β gn dµ,
whence the result follows upon an application of the Monotone Convergence Theorem III.4
on both the l.h.s. and the r.h.s.
(2) Follows directly from the definition
R of the integral of nonnegative functions.
(3) If f = 0 a.e., then f ∈ S (E ), and f dµ = 0 by the definition.
(4) Let fn ↑ f be the R sequence of simple functions prescribed by Lemma III.3. Then, since
fn ≤ f , it forces fn dµ = 0 by (2) of this lemma, and then fn = 0 a.e. by Lemma III.1(3),
E
so that f = 0 a.e.
For general f, g, decompose f = f + − f − and g = g + − g − , and apply the same to the negative
functions f ± , g ± separately.
Example III.8. Let f : R → R be defined by f (x) = sin(x)
x
, R equipped with the Borel σ-algebra
and the Borel measure, and we are interested in whether f is integrable on [0, ∞). It is possible to
28 IGOR WIGMAN
R∞ R∞
check that both +
f dµ(x) and f − dµ(x) are infinite, so f is not integrable. However, the Riemann
0 0
R∞ sin x
integral x
dx makes sense as an improper integral.
0
Definition III.9 (Dirac delta). Let (E, E ) be a measurable space, and x0 ∈ E an arbitrary point.
We denote the measure δx0 : E → {0, 1} defined by
(
1 x0 ∈ A
δx0 (A) = ;
0 x0 ∈
/A
it is usually called the “Dirac delta measure” at x0 , or “Dirac delta function” at x0 , even
though it is not a function.
The Dirac delta measure at x0 is the probability distribution of the constant random variable
X : Ω → E, X(ω) = x0 , so that P (X = x0 ) = 1. In the important case (E, E ) = (R, B(R)), the
corresponding distribution function is
(
1 x ≥ x0
FX (x) = P (X ≤ x) = .
0 x < x0
Note that δx0 is not the indicator function of the singleton {x0 }, since the former is a measure (so,
a set function), whereas the latter is a function. Rather, intuitively, δx0 could be thought of the
function attaining an infinitesimally thin infinitely high peak x0 , with unit mass, vanishing outside
of x0 . Using the Dirac delta notation it is convenient to represent the discrete distribution measures.
Example III.10. Let (Ω, F , P ) = ({0, 1}, P({0, 1}), P ) with
1 1
P = δ0 + δ1 ;
2 2
Then, thinking on R,
0 A = ∅
P (A) = 21 A = {0} or A = {1} ,
1 A = {0, 1}
corresponding to a single toss of a fair coin.
Example III.11. Binomial Bin(n, p) distribution. Then the probability measure is:
n
X n k
P= p (1 − p)n−k · δk : P({0, 1, . . . , n}) → [0, 1],
k=0
k
endowing a sample point k, 0 ≤ k ≤ n with the probability nk pk (1 − p)n−k . For example, for
A = {0, 1},
n
X n k
P (A) = p (1 − p)n−k · δk (A) = (1 − p)n + np(1 − p)n−1 ,
k=0
k
(
1 k = 0, 1
since δk (A) = .
0 otherwise
Example III.12. Poisson distribution Pois(λ): the probability measure is
∞
X λk −λ
P= e · δk (III.3)
k=0
k!
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 29
on Ω = Z≥0 , F = P(Z≥0 ).
Let us now consider the integral of some functions (expectation of random variables).
Example III.13. Let µ = δy for some y ∈ E. Then if g : E → R is a simple function of the form
m
P
g= ai χAi , then, by the definition (III.2) of the integral in this case,
i=1
Z m
X m
X
gdδy = ai · δy (Ai ) = ai · χAi (y) = g(y).
E i=1 i=1
For nonnegative measurable functions f : E → [0, ∞), we may take an increasing sequence fn ↑ f of
simple functions, so that it follows from the Monotone Convergence theorem, that
Z Z
f dδy = lim fn dδy = lim fn (y) = f (y).
n→∞ n→∞
E E
∞
Example III.16. Let (E, E ) = (N, P(N)), and µ the counting measure µ =
P
δk (so that, in
k=1
particular, µ(N) = ∞, and the same for any infinite set). Then for f : N → R,
Z X∞ Z ∞
X
f dµ = f dδk = f (k).
k=1 k=1
(−1)k
For example, if f (k) = k
,
Z ∞
X 1
(f + + f − )dµ = = ∞,
k=1
k
N
30 IGOR WIGMAN
so in Lebesgue sense, f is not integrable w.r.t. µ, despite the fact that, as an alternating series,
conditionally
∞
X (−1)k
= log 2,
k=1
k
since its the positive part cancels out the negative one, which is forbidden by the Lebesgue theory
(cf. Example III.8).
In all previous examples, the integrals over discrete domains reduced to a summation (with weights
according to the probability measure P ). The simple functions and monotone convergence were
invoked to evaluate the integral of arbitrary measurable functions.
Let us compare the Riemann integrals and the Lebesgue integrals on R. The Riemann integral
allows for integration on intervals. However, whenever we are to Lebesgue integrate a function
f : E → R over a measurable subset B ∈ E (as opposed to E), we may restrict the domain by
multiplying by the characteristic function:
Z Z
f dµ := f · χB dµ.
B E
R
Whenever E R = R equipped with the Borel σ-algebra and the Borel measure, we usually write f dx
in place of f dµ. To support this notation there is the following relation between the Riemann and
the Lebesgue integrals (whose proof is outside the scope of this module):
Theorem III.17. (1) If f : [a, b] → R is Riemann integrable and measurable, then also f is
Lebesgue integrable, and
Zb Z
f (x)dx = f (x)µ(dx),
a [a,b]
where the l.h.s. and the r.h.s. are Riemann integral and Lebesgue integral respectively.
(2) A function f : [a, b] → R is Riemann integrable, if and only f is bounded, and the set of
points of discontinuity of f is of Lebesgue measure 0. (Lebesgue measure extends the Borel
measure to a bigger σ-algebra than B(R).)
For example, the Dirichlet function χQ : R → R is not Riemann integrable (being discontinuous
everywhere), but easily Lebesgue integrable, with integral 0. On the other hand, an improper
Riemann integral might exist, even though the function is not Lebesgue integrable, cf. Example
III.8.
Recall that the Monotone Convergence Theorem III.4 allows to switch order of limit and integral
for a sequence of measurable functions fn ↑ f , where the monotonicity assumption is essential. The
following theorem, whose proof is outside the scope of this module, relaxes this condition.
Theorem III.18 (Dominated Convergence theorem). Let {fn }n≥1 , f : E → R be measurable func-
tions, so that fn (x) → f (x) a.e. on E. Suppose that there exists an integrable function g : E → R≥0 ,
so that for all n ≥ 1, x ∈ E, |fn (x)| ≤ g(x). Then the fn and f are integrable, and
Z Z
fn dµ → f dµ.
E E
The function g in Theorem III.18 is called the “dominating function”. It is important to stress that
what has to be dominated is the integrand fn , and not the outcome of the integration. In particular,
the conditions of Theorem III.18 hold if µ(E) < ∞ and fn are all uniformly bounded, i.e. there
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 31
exists a K > 0 so that for all n ≥ 1, x ∈ E, |fn | ≤ K (since in this case g ≡ K is integrable, so could
be chosen as a dominating function).
Example III.19. Let fn : [0, ∞) → R with fn (x) = e−nx . Then fn → 0 a.e. (outside of x = 0), and
R∞
for all n ≥ 1, fn (x) ≤ f1 (x) = e−x , and e−x dx = 1 < ∞. Hence the dominated convergence yields
0
Z∞ Z∞
−nx
e dx → 0dx = 0,
0 0
something that can be checked explicitly.
x
Example III.20. Let fn : [0, π] → R, fn (x) = sin x + e cos x
n
. Then fn (x) → sin(x) on x ∈ [0, π],
and |fn (x)| ≤ 1 with µ([0, π]) < ∞, so
Zπ Zπ
fn (x)dx → sin(x)dx = 2.
0 0
Example III.21. Let (Ω, F , P ) be a measure space, and An ⊆ Ω a sequence of events, so that, a.s.,
χAn → χA for some event A ∈ F , i.e. χAn (ω) → χA (ω) for ω ∈ Ω outside of a probability zero set.
Then, since |χAn | ≤ 1 (and in light of P (Ω) = 1 < ∞),
Z Z
P (An ) = χAn dP → χA dP = P (A).
Ω Ω
Example III.22. Take on (R, B(R), µ) the functions fn = n2 · χ(0,1/n) . Then fn → 0, but
R
fn dµ =
R
n → ∞, showing that a domination, violated in this example, is important for the application of
Dominated Convergence theorem.
Example III.23. Let (Ω, F , P ) be an arbitrary probability space, and Xn : Ω → R any sequence
of (a.s.) uniformly bounded random variables |Xn | ≤ M , so that a.s. Xn → X. Then, by the
dominated convergence,
Z Z
E[Xn ] = Xn (ω)dP → X(ω)dP = E[X].
Ω Ω
For instance, for the Poisson distribution of Example III.14, if Xn : Z≥0 → Z are uniformly
bounded, i.e. for some M > 0, for all n ≥ 1 and k ≥ 0, |Xn (k)| ≤ M , and for all k ≥ 0,
Xn (k) → X(k), then
∞ ∞
X λk −λ X λk −λ
E[Xn ] = e Xn (k) → e X(k) = E[X].
k=0
k! k=0
k!
For future reference we will require to differentiate under the integral sign, w.r.t. a parameter.
Theorem III.24 (Differentiation under the integral sign). Let (E, E , µ) be a measure space and
U ⊆ R open, and suppose that f : U × E → R be a function satisfying: (i) For all t ∈ U , the function
· 7→ f (t, ·) is integrable on E, (ii) For all x ∈ E, the function · 7→ f (·, x) is differentiable (on U ), and
(iii) There exists an integrable function g : E → R so that for all x ∈ E, t ∈ U ,
∂f
(t, x) ≤ g(x).
∂t
32 IGOR WIGMAN
Proof. Denote Z
p(t) := f (t, x)dµ(x).
E
Now fix t0 ∈ U , and let {hn }n≥1 be an arbitrary sequence of numbers, hn 6= 0 so that hn → 0. Then
Z
1
(p(t0 + hn ) − p(t0 )) = gn (x)dµ(x),
hn
E
assumption, |gn (x)| ≤ g(x), and we may apply the Dominated Convergence theorem to yield
Z Z
1 ∂f
(p(t0 + hn ) − p(t0 )) = gn (x)dµ(x) → (t0 , x)dµ(x).
hn ∂t
E E
Since the vanishing sequence hn is arbitrary, the above implies that the limit
1
lim (p(t0 + h) − p(t0 ))
h→0 h
III.2. Density functions. One way to prescribe a distribution is via the density function, defined
below.
Proposition III.25. Let (E, E , µ) be a measure space, and f : E → R≥0 a non-negative measurable
function. Define the set function ν : E → [0, ∞] as
Z Z
ν(A) = f χA dµ = f dµ. (III.5)
E A
∞
! Z Z ∞
! ∞ Z ∞
[ X X X
ν Ai = f ·χ∞
S dµ = f· χ Ai dµ = f · χAi dµ = ν(Ai ),
Ai
i=1 E i=1
E i=1 i=1 E i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 33
where we justify the exchange of order of summation and integration using the Monotone Convergence
theorem (or Dominated Convergence theorem, with f the dominating function). That shows that ν
is a measure. m
P
For the second assertion (III.6) of Proposition III.25, let g = ai χAi ∈ S (E ) be a simple function.
i=1
Then, by the definition of the integral for simple functions, and (III.5),
Z m
X m
X Z Z
gdν = ai ν(Ai ) = ai f χAi dµ = f gdµ. (III.7)
E i=1 i=1 E E
For g : E → R≥0 nonnegative, let {gn }n≥1 be a sequence of simple functions so that gn ↑ g,
prescribed by Lemma III.4. Then, using (III.7) and monotone convergence (for both gn ↑ g and
gn f ↑ gf ), we write Z Z Z Z
gdν = lim gn dν = lim f gn dµ = f gdµ.
n→∞ n→∞
E E E E
Finally, for g arbitrary integrable, decompose g = g − g , and apply the assertion on g + and g −
+ −
separately.
Definition III.26 (Density function). If a measure ν on (E, E ) is given by
Z
ν(A) = f χA dµ, (III.8)
E
−x2 /2
Example III.27. The function f (x) = √12π e is the probability density function of the standard
Gaussian distribution. That is, for Z ∼ N (0, 1) and A ∈ B(R),
Z Z
1 2
P (Z ∈ A) = PZ (A) = f (z)χA dz = √ e−z /2 dz.
2π
R A
34 IGOR WIGMAN
where FX0 is defined arbitrary wherever FX is not continuously differentiable. Comparing (III.9) to
(III.10) shows that FX0 is the density function of PX w.r.t. the Borel measure (the formula (III.10)
extends to A ∈ B(R), since {(−∞, x] : x ∈ R} is a π-system generating B(R)).
Example III.28. Let FX : R → [0, 1] be the distribution function
0 x < 0
FX (x) = x 0 ≤ x ≤ 1 .
1 x > 1
Then FX is continuously differentiable except for x = 0, 1, so it has the density fX (x) = FX0 (x) =
χ[0,1] (x), i.e. corresponding to the U([0, 1]) distribution. The distribution is: for A ∈ B(R),
Z
PX (A) = P (X ∈ A) = χ[0,1] χA dx = µ([0, 1] ∩ A).
Proof. First, g(X) is a random variable, since it is a composition of two measurable functions on
m
(Ω, F ). In what follows we prove the identity (III.11). First, if g =
P
ai χAi ∈ S (B(R)) is a simple
i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 35
function, then,
m
X
g(X) = ai χX −1 (Ai ) ∈ S (F ).
i=1
We thereupon have
m
X m
X Z
−1
E[g(X)] = ai P (X (Ai )) = ai PX (Ai ) = g(x)PX (dx),
i=1 i=1 R
by the definition of PX . We can then use our usual strategy of extending the results for nonnegative
measurable functions using Lemma III.3 and monotone convergence, and then to all integrable g =
g + − g − by invoking the result for g ± separately, and linearity (see e.g. the proof of Proposition
III.25). The details are left out.
Lemma III.31 shows that in order to evaluate the expectation of g(X), one can perform the
computation directly in terms of the distribution of X, without invoking the “original” space Ω
and the corresponding probability measure (though g(X) is a random variable on (Ω, F )). It is a
generalisation of the usual transformation of coordinates
Zb Zf (b)
g(f (x)) · f 0 (x)dx = g(y)dy,
a f (a)
valid under suitable conditions on f (·). If the probability distribution PX has a density fX w.r.t. µ
(e.g. µ is the Borel measure on R), then one may combine Lemma III.31 with Proposition III.25 to
yield the usual formula
Z Z
E[g(X)] = g(x)PX (dx) = g(x) · fX (x)dx. (III.12)
R R
Example III.32. Let X be an exponential random variable with parameter 1, i.e. its probability
density w.r.t. the Borel measure is fX (x) = e−x χ[0,∞) . Then, by (III.12), for every g : R → R,
Z Z∞
E[g(X)] = g(x)fX (x)dx = g(x)e−x dx,
0
Example III.34. Let X ∼ Pois(1) be a Poisson random variable with parameter 1. Recall that
its probability measure P is given by (III.3) with λ = 1 (see Example III.12). Let µ0 be a measure
∞
on the same measure space (Z≥0 , P(Z≥0 )) given by µ0 =
P
δk . Then P has a density function
k=0
36 IGOR WIGMAN
e−1
fX (k) = k!
. That is, for every A ∈ P(Z≥0 ),
Z X X e−1
P (X ∈ A) = fX χA dµ0 = fX (k) =
k∈A k∈A
k!
Z≥0
For every g : Z≥0 → R, so that E[g(X)] < ∞ (equivalent to the summation on the r.h.s. below
finite),
∞ ∞
e−1
Z X X
E[g(X)] = fX gdµ0 = fX (k) · g(k) = · g(k).
k=0 k=0
k!
Z≥0
Now we assert the aforementioned uniqueness of the density function, up to sets of µ-measure 0.
Lemma III.35. Let (E, E ) be a measurable space, and µ and ν two measures on (E, E ). Then if f
and g are two densities of ν w.r.t. µ, then f = g µ-a.e.
Proof. Recall the defining property (III.8) of a density, satisfied by both f and g in place of f . Denote
A = {x ∈ E : f (x) > g(x)}, and, applying (III.8) with this given A (and with f or g) yields that
Z Z
ν(A) = f · χf >g dµ = g · χf >g dµ.
Then, by Lemma III.7(4), and in light of the fact that the integrand (f − g) · χf −g ≥ 0 is nonnegative,
that forces (f − g) · χf −g = 0 outside of a µ-measure 0, i.e. f ≤ g µ-a.e. An identical argument gives
that g ≤ f µ-a.e., so all in all, f = g µ-a.e.
Example III.36. The functions χ(0,1] , χ(0,1) and χ[0,1] are equal outside of sets of 0 Borel measure,
therefore are probability densities corresponding to the same distribution. Hence the uniform random
variables U((0, 1]), U((0, 1)) and U([0, 1]) have the same distribution given by PU (A) = P (U ∈ A) =
µ((0, 1) ∩ A).
Example III.37. Recall that the exponential distribution has probability density function fX (x) =
e−x · χ[0,∞) (x). As an alternative, one can also take ff
X (x) = e
−x
· χ[0,∞) (x) · χR\Q (x), since fX = ff
X
a.e. w.r.t. the Borel measure.
Example III.38. Let X ∼ Pois(λ) with λ > 0, Y ∼ U((0, 1]) and N ∼ Bin(1, 1/2) be three
independent random variables. Set a new random variable
(
X N =0
Z= ,
Y N =1
i.e. Z is Pois(λ) or U((0, 1]) with probability 21 . Its distribution is
1 1
PZ = PX + PY ,
2 2
where PX and PY are the distributions of X (discrete) and Y (continuous) respectively, hence the
∞
P
distribution of Z is neither continuous nor discrete. Recall that µ0 = δk is as in Example III.34,
k=0
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 37
It can also be seen by the above, that Z has no probability density function w.r.t. either µ0 or the
Borel measure.
In general, a measure ν has a density w.r.t. another measure µ on the same measurable space
(E, E ), if and only if ν is absolutely continuous w.r.t. µ, but it is beyond the scope of this course
(see the Radon-Nikodym Theorem).
III.3. Transformation of random variables. In this section we are concerned in what happens
to the density function of a random variable X : Ω → R as X is transformed to Y = φ(X) : Ω → R
for some φ : R → R measurable. In this case Y is a random variable, and its distribution is
PY (A) = P (Y ∈ A) = P (φ(X) ∈ A) = P (X ∈ φ−1 (A)) = PX (φ−1 (A)),
A ∈ B(R). The distribution function of Y is then
FY (t) = P (Y ≤ t) = P (φ(X) ≤ t) = P (X ∈ φ−1 ((−∞, t])) = PX (φ−1 ((−∞, t])); (III.13)
in general it cannot be expressed in terms of FX only, depending on the properties of φ, and what is
φ−1 ((−∞, t]).
Suppose that φ is continuous and strictly increasing, i.e. for x < y, φ(x) < φ(y). Then, in
particular, φ is injective, and φ−1 : φ(R) → R is a continuous and strictly increasing injection on the
image of φ. Therefore, for t ∈ φ(R),
φ−1 ((−∞, t]) = (−∞, φ−1 (t)],
(mind that the φ−1 (·) on the l.h.s. is a pre-image, whereas φ−1 (·) on the r.h.s. is the inverse function).
In this light, if φ is continuous and strictly increasing, then, for t ∈ φ(R)) in the image of φ, (III.13)
reads
FY (t) = PX (φ−1 ((−∞, t])) = PX ((−∞, φ−1 (t)]) = FX (φ−1 (t)). (III.14)
0
Assuming that FX is continuously differentiable with density fX = FX , and, further, that φ is
differentiable at φ−1 (t), we have that
fX (φ−1 (t))
fY (t) = FY0 (t) = , (III.15)
φ0 (x)|x=φ−1 (t)
where we used
d −1 1
φ (t) = 0 .
dt φ (x)|x=φ−1 (t)
In case φ is not injective, the r.h.s. of (III.15) should be replaced by a summation over x ∈ φ−1 (t)
of expressions of this type, under suitable assumptions on φ (and FX ).
Example III.39. Let U ∼ U(0, 1], and Y = eU (i.e. φ(u) = eu ). Then Y ∈ (1, e] a.s., since U ∈ (0, 1]
a.s. Here φ(u) is strictly increasing, so that in this case both (III.14) and (III.15) are valid. We have
directly
FY (t) = P (eU ≤ t) = P (U ≤ log t) = log t
on t ∈ (1, e] and
1
fY (t) = FY0 (t) = · χ(1,e] (t).
t
38 IGOR WIGMAN
Example III.42. Any measure space (E, E , µ) so that µ(E) < ∞ is, in particular, σ-finite. That
includes all the probability spaces.
Example III.43. The Borel measure on (R, B(R)) is σ-finite, since
∞
[
R= [−n, n]
n=1
(say).
∞
Example III.44. The measure space (N, P(N), µ0 ), with µ0 =
P
δn the counting measure µ0 (A) =
( n=1
|A| A finite ∞
S
. Then µ0 is σ-finite, since N = {n}.
∞ A infinite n=1
Example III.45. On the other hand consider the measure space (R, B(R), ν), with
(
|A| A finite
ν(A) =
∞ A infinite
the counting measure, this time the set R being uncountable. Then ν is not σ-finite, since otherwise
S∞
R= Ei is a countable union of finite sets Ei , contradicting R being uncountable.
n=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 39
Now let (E1 , E1 , µ1 ) and (E2 , E2 , µ2 ) be two σ-finite measure spaces (for example, probability
spaces). It is easy to check that the collection
A := {A1 × A2 : A1 ∈ E1 , A2 ∈ E2 }
is a π-system of subsets of E1 × E2 . We define the product σ-algebra, denoted E1 ⊗ E2 to be the
σ-algebra generated by A , i.e.
E1 ⊗ E2 := σ(A ).
We now define the measure µ = µ1 ⊗ µ2 be prescribing it on the elements of A to be given by the
product of measures µ(A1 × A2 ) = µ1 (A1 ) · µ2 (A2 ). The following theorem, whose proof is outside
of the scope of this course, asserts the existence and the uniqueness of such a measure under the
σ-finiteness assumption on both µ1 and µ2 . While the existence of such a measure µ = µ1 ⊗ µ2
does not require such a σ-finiteness assumption, for its uniqueness the σ-finiteness of both µ1 , µ2
is essential. Note that the uniqueness does not follow in this case from Theorem I.13, since the
finiteness of µ(E1 × E2 ) is not assumed.
Theorem III.46. Let µ1 and µ2 be two σ-finite measures on the measurable spaces (E1 , E1 ) and
(E2 , E2 ) respectively. Then there exists a unique measure µ = µ1 ⊗ µ2 on (E1 × E2 , E1 ⊗ E2 ) so that
for all A1 ∈ E1 , A2 ∈ E2 ,
µ(A1 × A2 ) = µ1 (A1 ) · µ2 (A2 ).
Example III.47. Taking (E1 , E1 , µ1 ) = (E2 , E2 , µ2 ) = (R, B(R), µB ) with µ1 = µ2 = µB the Borel
measure on R, the resulting σ-algebra B(R2 ) = B(R)⊗B(R) on R2 is generated one of the π-systems
{(−∞, s] × (−∞, t] : s, t ∈ R} or {(a, b] × (c, d] : a < b, c < d}. The product measure µ = µB ⊗ µB
on (R2 , B(R2 )), uniquely prescribed by
µ((a, b] × (c, d]) = (b − a) · (d − c),
is the Borel measure on R . It is also possible to define inductively the Borel measure on Rn and the
2
equality understood in the finite or infinite sense (i.e. if one of the three integrals is infinite,
then so are the other two).
40 IGOR WIGMAN
is finite, then so are the other two, f is integrable, and the equality (III.17) holds.
The l.h.s. of (III.17) is called “double integral”, whereas the other two integrals are “repeated
integrals (in two different orders). Often times it is easier to test the finiteness of the repeated
integrals than the double integral. Without the σ-finiteness condition, Theorem III.48 fails decisively
(see counter-examples in the home assignment). R
The integrability condition in Theorem III.48(2) is that |f | < ∞; unlike what is asserted in (1)
for nonnegative functions, if the integrability condition fails for some f , without the nonnegativ-
ity assumption, some of the integrals in (III.17) might be finite, though the equality (III.17) fails
decisively (or that some are finite and some other are infinite).
2
Example III.49. Let I be the important Gaussian integral I = e−x /2 dx. Then, by Fubini, and
R
R
working in polar coordinates,
Z Z∞ Z2π h i ∞
2 −(x2 +y 2 )/2 −r2 /2 −r2 /2
I = e dxdy = re dθdr = 2π −e = 2π,
0
R2 0 0
√
so that I = 2π.
Example III.50. If (E, E , µ) is a σ-finite measure space, and f : E → [0, ∞) is E -measurable
fR(x)
and nonnegative, then, observing the obvious identity f (x) = dt, we change the order of the
0
integration
Z Z Z f (x) Z∞ Z Z∞
f dµ = dtdµ(x) = dµ(x)dt = µ ({x : f (x) ≥ t}) dt.
0
E E 0 {x: f (x)≥t} 0
For instance, if X is a nonnegative random variable with distribution PX , then it recovers the useful
formula (with f (x) = x):
Z Z∞ Z∞
E[X] = xPX (dx) = PX ({x : x ≥ t}) = P (X ≥ t)dt,
0 0
that, if the distribution function FX is continuous, can be expressed in terms of FX only.
Example III.51. If fn : R → R is a sequence of (Borel-)measurable functions, so that
∞ Z
X
|fn (x)|dx < ∞,
n=1
R
then ∞ Z ∞
Z X
X
fn (x)dx = fn (x)dx.
n=1 n=1
R R
This result follows from applying Fubini’s theorem equating the repeated integrals w.r.t. the measure
∞
spaces (R, B(R), µ) with µ the Borel measure and (N, P(N), µ0 ) with µ0 =
P
δn the counting
n=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 41
R R R ∞
P
measure, and f (x, n) := fn (x). Here, by the definition, gdµ = g(x)dx, whereas hdµ0 = h(n),
R n=1
whenever these make sense.
A = {A1 × . . . × An : A1 , . . . An ∈ B(R)},
generating B(Rn ), by the definition of B(Rn ). Then, since by assumption, X1 , . . . , Xn are indepen-
dent, for every A = A1 × . . . × An ,
n
!
\
PX (A) = P (X ∈ A) = P (X1 ∈ A1 , . . . , Xn ∈ An ) = P {Xi ∈ Ai }
i=1
n n
(III.18)
Y Y
= P (Xi ∈ Ai ) = PXi (Ai ) = (PX1 ⊗ . . . ⊗ PXn )(A),
i=1 i=1
by the defining property of the product measure (see Theorem III.46). It then forces PX = PX1 ⊗
. . . ⊗ PXn on the whole of B(Rn ), by the uniqueness part of Theorem III.46, i.e. PX1 ⊗ . . . ⊗ PXn
is the only measure satisfying (III.18) for all A1 × . . . × An ∈ A (alternatively, by Theorem I.13,
whose assumptions are valid, since B(Rn ) is generated by the π-system A , and are dealing with
probability measures).
(b) ⇒ (c) : By using part (b) and Fubini, we get that
" n # Z n
! n Z n
Y Y Y Y
E fi (Xi ) = fi (xi ) dPX (x1 , . . . , xn ) = fi (xi )PXi (dxi ) = E[fi (Xi )].
i=1 Rn i=1 i=1 R i=1
(c) ⇒ (a) : Given A1 , . . . An ∈ B(R), we set fi = χAi , and employ part (c), while observing that
n
Q
χA1 ×...An (x1 , . . . , xn ) = χAi (xi ), to yield:
i=1
" n
#
Y
P (X1 ∈ A1 , . . . , Xn ∈ An ) = E[χA1 ×...An (X1 , . . . , Xn )] = E χAi (Xi )
i=1
n
Y n
Y
= E[χAi (Xi )] = P (Xi ∈ Ai ).
i=1 i=1
42 IGOR WIGMAN
will be sufficient for all our needs. The conditions of Theorem IV.1 are satisfied, for example, in case
the random variables Xi are i.i.d. so long that Var(X1 ) < ∞; in this case a much stronger result
is applicable (cf. Theorem II.22), that will be proved under a superfluous extra assumption on the
boundedness of the 4th moment.
Theorem IV.2 (Strong law of large numbers (SLLN)). Let X1 , X2 , . . . be i.i.d. random variables so
that for all i ≥ 1, E[Xi ] = µ < ∞,
M := E[X14 ] < ∞. (IV.2)
Then
X1 + . . . + . . . Xn
X n := →µ
n
a.s. as n → ∞.
In what follows we prove theorems IV.1-IV.2. To this end we require a couple of lemmas.
Lemma IV.3 (Markov’s inequality). Let X ≥ 0 be a a.s. nonnegative random variable so that
E[X] < ∞. Then, for t > 0,
E[X]
P (X ≥ t) ≤ .
t
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 43
Proof. Fix a number t > 0, and write the inequality t · χX≥t ≤ X, which is satisfied trivially. Now,
integrating both sides w.r.t. the measure P gives
tP (X ≥ t) ≤ E[X],
which yields the stated inequality.
Lemma IV.4 (Chebyshev’s inequality). Let X be a random variable, with finite mean and variance:
E[X] = µ < ∞, and σ 2 = Var(X) < ∞. Then for t > 0,
1
P (|X − µ| ≥ tσ) ≤ 2 .
t
(Chebyshev’s inequality is only nontrivial for t > 1.)
Proof. Set Y = X − µ, so that E[Y 2 ] = σ 2 , and, by Markov’s inequality (note that Y 2 ≥ 0),
E[Y 2 ] Var(X) 1
P (|X − µ| ≥ tσ) = P (|Y | ≥ tσ) = P (Y 2 ≥ t2 σ 2 ) ≤ 2 2
= 2 2 = 2.
tσ tσ t
We are finally in a position to prove theorems IV.1-IV.2.
Proof of Theorem IV.1. Using the independence of the {Xi }i≥1 , we obtain the inequality
n
1 1 X σ2 · n σ2
Var(X n ) = 2 Var(X1 + . . . + Xn ) = 2 Var(Xi ) ≤ = ,
n n i=1 n2 n
by (IV.1); importantly, it shows that, as n → ∞,
Var(X n ) → 0,
that will be sufficient for the convergence in probability (via Chebyshev’s inequality). Now, employing
Chebyshev’s inequlity, for every > 0, we have
Var(X n ) σ2
P |X n − µ| ≥ ≤ ≤ → 0,
2 n · 2
which is the defining property of convergence in probability of X n to µ.
Proof of Theorem IV.2. We first aim to reduce to the case when E[Xi ] = 0. Indeed, the random
variables Yi := Xi − µ satisfy that
Yi4 ≤ (|Xi | + |µ|)4 ≤ (2 max{|Xi |, |µ|})4 ≤ 24 max{Xi4 , µ4 } ≤ 24 (Xi4 + µ4 ),
so that E[Yi4 ] ≤ 24 (M + µ4 ) < ∞, and Y1 +...+Y
n
n
→ 0 a.s., if and only if X n → µ a.s. Hence all the
assumptions of Theorem IV.2 hold with Yi in place of Xi (with M substituted by another constant).
In light of the above, we may assume, as we do from this point on, that for every i ≥ 1, E[Xi ] = 0.
Now, since the existence (and finiteness) of the higher moments imply the existence of the lower
moments (HW8, Q1(b)), our assumption E[Xi4 ] ≤ M < ∞ also implies that Xi , Xi2 and Xi3 are all
integrable. Moreover, by the Cauchy-Schwarz inequality, for all i, j ≥ 1,
E[Xi2 · Xj2 ] ≤ E[Xi4 ]1/2 · E[Xj4 ]1/2 ≤ M, (IV.3)
thanks to (IV.2) (recall that the Xi are i.i.d.). Since the {Xi }i≥1 are all independent with mean
E[Xi ] = µ = 0, it follows that for all distinct (in pairs) indexes i, j, k, `,
E[Xi · Xj3 ] = E[Xi · Xj · Xk2 ] = E[Xi · Xj · Xk · X` ] = 0 (IV.4)
44 IGOR WIGMAN
since all the other terms will vanish, by (IV.4). It then follows from (IV.2) and (IV.3), that
n(n − 1)
E[Sn4 ] ≤ n · M + 6 · M ≤ 3n2 M.
2
Hence,
" ∞ 4 # ∞
X Sn X 1
E ≤ 3M · < ∞.
n=1
n n=1
n2
Denote the random variable
∞ 4
X Sn
Y := ≥ 0.
n=1
n
The above argument demonstrates that E[Y ] < ∞, hence this implies that P (Y < ∞) = 1 (as
otherwise E[|Y |] would be infinite). That is, the defining series of Y is convergent a.s., hence the
4
corresponding summands vanish at infinity a.s., i.e. Snn4 → 0 as n → ∞ a.s., so X n = Snn → 0 a.s., as
asserted by Theorem IV.2.
Example IV.5. In statistics, often times one has independent observations X1 , X2 , . . . of the same
random variable X ∼ Xi , whose mean E[X] = E[Xi ] or variance σ 2 = Var(X) = Var(Xi ) we aim to
estimate. For example, one can approximate µ with the sample average, µ ≈ X n . Therefore, by the
SLLN, the sample mean and sample variance satisfy
n
1 X
X n → µ, Sn2 := (Xi − X n )2 → σ 2 a.s.
n − 1 i=1
Thus SLLN provides a means to estimate the mean and the variance of the distribution, but not
an estimate for the confidence interval. A 95%-confidence interval for µ, based on the Central Limit
Theorem below, is of the form
√ √
[X n − 1.96 · σ/ n, X n + 1.96 · σ/ n],
requiring an a priori knowledge of the variance σ 2 = Var(X), see Example IV.32 below. A confidence
interval depends on the variance of the underlying distribution, whereas the SLLN is independent of
the variance, nor gives an quantitative estimates how good the approximations are.
Example IV.6. Let X be a random variable, A ∈ B(R), and we would like to estimate the
probability P (X ∈ A) based on sample copies X1 , X2 , . . . of X (which are themselves i.i.d. random
variables). Then, by the SLLN,
n
#{Xi : Xi ∈ A} 1X
= χA (Xi ) → E[χXi ∈A ] = P (X ∈ A) a.s..
n n i=1
This situation is typical for the Bayesian statistics, where the probabilities are not given, and are
based on the samples.
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 45
Example IV.7 (“Monte-Carlo”). Let f : [0, 1] → R, and suppose we would like to numerically
R1
approximate the integral f (x)dx. One would then discretize the interval [0, 1] into a large number
0
N of sub-intervals of [0, 1] of length N1 , and use the function values in each sub-interval to approximate
the integral (e.g. Riemann sums). There are some more sophisticated methods that can improve the
rate of convergence (such as the trapezoid method etc.), but the ideas are similar. Alternatively, let
U1 , U2 , . . . ∼ U([0, 1]) be i.i.d. uniform random variables, and set Xi = f (Ui ). Then
Z1
E[Xi ] = E[f (U )] = f (x)dx,
0
u ∈ Rn .
By the above, the Fourier transform is well-defined as a function µ b : Rn → C, for every finite
measure µ. It enjoys the following properties:
(1) We have µ b(−u) = µ b(u), where · is the complex conjugate of a number. This follows from the
it −it
identity e = e , t ∈ R.
46 IGOR WIGMAN
By Lemma III.31, Z
φX (u) = eihu,xi PX (dx) = P
cX (u),
Rn
i.e. the characteristic function is the Fourier transform of the distribution PX of X. If,
further, PX has a density fX w.r.t. the Borel measure, then, by (III.12),
Z
φX (u) = eihu,xi fX (x)dx = fcX (u),
Rn
the “usual” Fourier transform of a function fX .
Example IV.10 (Standard Gaussian random variable). A random variable X on Rn is standard
Gaussian, if for every A ∈ B(Rn ),
Z
1 2
P (X ∈ A) = n/2
e−kxk /2 dx,
(2π)
A
n
1/2
x2i
P
with kxk = k(x1 , . . . , xn )k = the standard Euclidean norm. Let us compute its char-
i=1
acteristic function for n = 1; for higher n, the variables separate, and the characteristic function is
merely the product of the ones in each variable.
Since E[|X|] < ∞, Theorem III.24 implies that φX is differentiable, and we can differentiate (IV.5)
under the integral sign in this case, to yield:
Z
d d 1 2
φX (u) = iuX iuX
E[e ] = E[iXe ] = √ (ieiux ) · xe−x /2 dx.
du du 2π
R
Now we use integration by parts to write
d 1 ∞ 1
Z
−x2 /2 2
iux
φX (u) = √ ie · −e − √ u eiux e−x /2 dx = −uφX (u),
du 2π
−∞ 2π
R
Hence the characteristic function of the standard Gaussian is equal, up to a constant, to its density.
This is the underlying reason for the frequent occurrence of the Gaussian distribution in nature and
many aspects of life. The characteristic function of a random variable uniquely determines the
distribution, and hence, φX encodes the full information on PX . If φX is integrable, then we can
obtain an explicit inversion formula.
Theorem IV.11. The distribution PX of a Rn -valued random variable X is uniquely R determined
by its characteristic function φX . Moreover, if φX is integrable (this means that |<(φX )|du < ∞
R Rn
and |=(φX )|du < ∞), then X has a density function fX , given by
Rn
Z
∗ 1
fX (x) = (F φX )(x) := φX (u)e−ihu,xi du. (IV.6)
(2π)n
Rn
The operator F ∗ is called the inverse Fourier transform.
Before giving a proof to Theorem IV.11 we will require a lemma. For t > 0, x, y ∈ Rn define the
heat kernel
1 kx−yk2
− 2t
p(t, x, y) := e .
(2πt)n/2
For t > 0, x ∈ Rn fixed, p(t, x, ·) is the density of N (x, tIn ), the Gaussian random n-variable with
mean x and covariance matrix t · In , as asserted by the following lemma.
Lemma IV.12. Let Z be a standard Gaussian random variable on Rn , x ∈ Rn , and t > 0. Then:
√
(1) The random variable x + t · Z has density function p(t, x, ·) on Rn .
(2) For all y ∈ Rn , Z
1 2
p(t, x, y) = n
eihu,xi e−tkuk /2 e−ihu,yi du. (IV.7)
(2π)
Rn
(2): Let X be a standard Gaussian random variable, and for w ∈ R and t > 0, we write
Z
1 −v2 /2t √ √ w2 t
eiwv √ e dv = E[eiw tX ] = φX (w t) = e− 2 , (IV.8)
2πt
R
−u2 /2
since φX (u) = e by Example IV.10. We substitute w = (xk − yk )/t, and transform the variables
uk = v/t, so that (IV.8) reads
Z √ Z
i(xk −yk )uk t −tu2k /2 1 −v2 /2t (xk −yk )2
e √ e duk = ei(xk −yk )v/t √ e dv = e− 2t .
2π 2πt
R R
By a standard manipulation, we can rearrange
Z
1 2 1 − (xk −yk )2
eiuk xk e−tuk /2 e−iuk yk duk = √ e 2t ,
2π 2πt
R
48 IGOR WIGMAN
Proof of Theorem IV.11. Let X be the given random variable. In what follows we are going to argue
that one can express E[g(X)] in terms of φX , for every continuous bounded function g : Rn → R.
We will then approximate from below any characteristic function χA for
n
Y
A= (ak , bk ], (IV.9)
k=1
elements of the π-system generating B(Rn ), by such functions g, and that will be sufficient to
prescribe PX (A) in terms of φX , which, in turn, is sufficient to prescribe PX , by the Uniqueness
Theorem I.13.
Now, let Z be a standard Gaussian random variable on Rn , independent of X. Then, for every
g : Rn → R continuous and bounded, and every √ t > 0, we have by Fubini’s theorem (and Lemma
III.31 that requires the integrability of g(X + tZ), which is √ why we assume that g is bounded in
the first place; this certainly yields the integrability of g(X + tZ)):
√ Z √ Z Z √ 1 2
E[g(X + tZ)] = g(x + tz)(dPX ⊗ dPZ )(x, z) = g(x + tz) n/2
e−kzk /2 dz PX (dx)
(2π)
Rn ×Rn Rn Rn
Z √
= E[g(x + tZ)]PX (dx),
Rn
(IV.10)
and, by Lemma IV.12, for every fixed x ∈ Rn ,
√ Z Z
1
Z
2 /2
E[g(x + tZ)] = g(y)p(t, x, y)dy = g(y) · eihu,xi e−tkuk e−ihu,yi dudy. (IV.11)
(2π)n
Rn Rn Rn
given A of the form (IV.9), it is possible to find a sequence {gn } of continuous functions gn , so that,
gn (x) → χA for a.e. x ∈ Rn , and, in addition, |gn | ≤ 1. Then, by Dominated Convergence,
E[gn (X)] → E[χA (X)] = PX (A),
which is sufficient for determining the full distribution PX .
Now we aim to prove (IV.6). If, as we assume, φX is integrable, then for every continuous g that
is, in addition, compactly supported (i.e. g vanishes outside of [−R, R]n for R sufficiently large; in
particular, g is bounded),
Z Z Z Z
|φX (u)| · |g(y)|dudy = |φX (u)|du · |g(y)|dy < ∞,
Rn Rn Rn Rn
and the function f (u, y) := |φX (u)| · |g(y)| dominates the integrand on the r.h.s. of (IV.12), for every
t > 0. Thus, by Fubini, and the Dominated Convergence theorem (applied to the double integral
w.r.t. the product measure du ⊗ dy), we may take t → 0 on the r.h.s.R of (IV.12), which (bearing
in mind that we already know that the l.h.s. converges to E[g(X)] = g(y)fX (y)dy by above and
Rn
(III.12)), then reads
Z Z Z
1
g(y)fX (y)dy = E[g(X)] = g(y) e−ihu,yi φX (u)du dy,
(2π)n
Rn Rn Rn
valid for all g continuous and compactly supported. Finally, we can extend this for g = χA for A
as in (IV.9), and this is certainly sufficient to yield that (IV.6) is a density function of X, again by
appealing to the Uniqueness Theorem I.13.
As a consequence to Theorem IV.11, the characteristic functions play an important role in identi-
fying distributions. For instance, if X and Y are independent random variables, then
φX+Y (u) = E eihu,X+Y i = E eihu,Xi · eihu,Y i = φX (u) · φY (u),
a simple rule how to compute the characteristic function of a sum of independent random variables
(which is far less obvious for the distributions or density functions).
Example IV.13 (Poisson random variables). Let X ∼ Pois(λ) for some λ ≥ 0. Then
∞ ∞ ∞
iuX
X
iuk
X
iuk λk −λ −λ
X (eiu λ)k
φX (u) = E[e ]= e · P (X = k) = e · e =e
k=0 k=0
k! k=0
k!
−λ
=e · exp(e λ) = exp(e λ − λ) = exp(λ(eiu − 1)).
iu iu
Proof. By the finiteness of E[|X|k ], we can differentiate k times under the integral (or expectation)
sign, thanks to Theorem III.24:
dk
k
d iuX
φX (u) = E e = E[(iX)k · eiuX ],
duk duk
which at u = 0 reads
dk
k
φX (u) = ik E[X k ],
du u=0
as required.
Example IV.16. Recall that we computed the characteristic function of the standard Gaussian
X ∼ N (0, 1) to be φX (u) = e−u /2 , see Example IV.10. Then φ0X (u) = −ue−u /2 , so that E[X] = 0,
2 2
2 2
and φ00X (u) = u2 e−u /2 − e−u /2 , so that −1 = −E[X 2 ], and Var(X) = E[X 2 ] = 1.
IV.3. Gaussian random variables. A Gaussian random variable X ∼ N (µ, σ 2 ) with mean µ ∈ R
and variance σ 2 > 0 has the density function
1 (x−µ)2
fX (x) = √ e− 2σ2 .
2πσ 2
on x ∈ R. The degenerate case X = µ a.s. corresponding to σ 2 = 0.
Lemma IV.17 (Homework assignment). Let X ∼ N (µ, σ 2 ). Then:
(1) The expectation of X is E[X] = µ.
(2) The variance of X is Var(X) = σ 2 .
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 51
which we identify as the characteristic function of N (0, kuk ). Hence, in light of the uniqueness of
2
52 IGOR WIGMAN
Since viT vj = δij (recall that {vi }ni=1is an orthonormal basis of Rn ), the matrix V 1/2 satisfies
Xn X n n
X
p
V 1/2 · V 1/2 = λi λj vi viT vj vjT = λi vi viT = V, (IV.14)
i=1 j=1 i=1
as expected.
Theorem IV.20. Let X be a Gaussian random variable on Rn . Then:
(1) For every n × n matrix A and b ∈ Rn , AX + b is Gaussian.
(2) The distribution of X is fully determined by µ = E[X] and V = Cov(X), denoted X ∼
N (µ, V ).
(3) The characteristic function of X is
1
φX (u) = exp ihu, µi − hu, V ui ,
2
n
u∈R .
(4) If V is nonsingular, then X has a density on Rn given by
1 1 T −1
fX (x) = √ exp − (x − µ) V (x − µ) ,
(2π)n/2 · det V 2
x ∈ Rn .
(5) If X = (Y, Z) ∈ Rn , with Y ∈ Rm and Z ∈ Rp , m + p = n, and the covariance matrix of X
has the block structure
Cov(Y ) 0
Cov(X) = ,
0 Cov(Z)
then Y and Z are independent.
Proof. Recall that for u, v ∈ Rn and A a n × n real-valued matrix we have hu, vi = uT v and
(Av)T = v T AT .
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 53
(1) For u ∈ Rn ,
hu, AX + bi = uT AX + uT b = hAT u, Xi + hu, bi,
Gaussian by Lemma IV.17(3), since hAT u, Xi is Gaussian by assumption.
(2) Follows from (3) below and Theorem IV.11.
(3) For u ∈ Rn ,
E[hu, Xi] = uT E[X] = uT µ = hu, µi,
and
Var(hu, Xi) = E[(uT X − uT µ)2 ] = E[(uT X − uT µ) · (uT X − uT µ)T ]
= E[(uT X − uT µ) · (X T u − µT u)] = E[uT (X − µ) · (X T − µT )u]
= uT E[(X − µ) · (X T − µT )]u = uT V u = hu, V ui.
Hence, by the Gaussianity assumption, hu, Xi ∼ N (hu, µi, hu, V ui), and
ihu,Xi 1
φX (u) = E[e ] = φhu,Xi (1) = exp ihu, µi − hu, V ui ,
2
thanks to Lemma IV.17)(4).
(4) Let Y1 , . . . , Yn be i.i.d. standard Gaussians, and Y = (Y1 , . . . , Yn ), a Rn -valued Gaussian
random variable. Then Y has the density
n
Y 1 −kyk2 /2
fY (y) = fYk (yk ) = e , (IV.15)
k=1
(2π)n/2
Y = V −1/2 (X
e − µ).
Theorem IV.20 shows by induction, that if the Rn -valued random variable X = (X1 , . . . Xn ) is
(jointly) Gaussian, then the random variables X1 , . . . Xn are independent, if and only if the covariance
matrix V = Cov(X) is diagonal. The joint Gaussianity assumption of this result is of essence, as
merely the marginal distributions being Gaussian is not sufficient for the independence conclusion
to hold, as illustrated in the following example.
Example IV.21. Let X ∼ N (0, 1) be a standard Gaussian random variable, fix a number a > 0,
and define (
X |X| < a
Y = .
−X |X| ≥ a
On the one hand, by the symmetry of the Gaussian distribution around the origin, −X ∼ N (0, 1).
Therefore, for every Borel set A ∈ B(R),
P (Y ∈ A) = P (Y ∈ A ∩ (−a, a)) + P (Y ∈ A ∩ (R \ (−a, a)))
= P (X ∈ A ∩ (−a, a)) + P (−X ∈ A ∩ (R \ (−a, a)))
= P (X ∈ A ∩ (−a, a)) + P (X ∈ A ∩ (R \ (−a, a))) = P (X ∈ A),
and therefore Y ∼ N (0, 1) too.
But, on the other hand
P (X + Y = 0) = P (X = −Y ) = P (|X| ≥ a) ∈ (0, 1),
and therefore X +Y = h(1, 1), (X, Y )i is not a Gaussian random variable (as otherwise P (X +Y = 0)
would either vanish or be equal to 1 in case it is degenerate), which means that the R2 -valued random
variable (X, Y ) is not Gaussian. It is possible to even choose a > 0 so that Cov(X, Y ) = 0, but
it does not contradict Theorem IV.20(5), as it assumed that (X, Y ) is (jointly) Gaussian, which we
know it is not.
Example IV.22. Let X ∼ N (0, 1) and Y = 0 a.s. (it is a degenerate Gaussian with Var(Y ) = 0
and E[Y ] = 0). Then, for every u = (u1 , u2 ) ∈ R2 , u1 X +u2 Y = u1 X ∼ N (0, u21 ), so that Z = (X, Y )
is Gaussian with mean µ = (0, 0), and singular covariance matrix
1 0
Cov(Z) = .
0 0
In this light (since the covariance is singular), (X, Y ) has no density w.r.t. the Borel measure on
R2 . Finally, X and Y are independent, manifested by Cov(X, Y ) = 0 in the top right corner of the
covariance matrix.
IV.4. Convergence in distribution and Central Limit Theorem. Suppose that X = −Y are
two random variables defined on the same probability space (Ω, F , P ). If, further, X ∼ N (0, 1)
i.e., X has the standard Gaussian distribution, then so does Y . However, for > 0, the probability
P (|X − Y | > ) is rather large, so despite having identical distributions, X, Y are not close in
probability, nor a.s. convergence is applicable here. There must be a notion that detects the proximity
of the distributions of X, Y regardless of their realizations on the probability space, even if X, Y are
defined on different probability spaces.
Definition IV.23 (Convergence in distribution). Let X1 , X2 , . . . be a sequence of real-valued ran-
dom variables with distribution functions FX1 , FX2 , . . ., and let X be another real-valued random
variable, with distribution function FX . The random variables Xn → X in distribution (denoted
d
Xn → X), if
FXn (x) → FX (x)
for every continuity point x ∈ R of FX (i.e. if FX is continuous at x ∈ R, then FXn (x) → FX (x)).
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 55
In particular, the definition of convergence in distribution does not require the random variables to
be defined on the same probability space, and only involves their distribution functions (unlike con-
vergence a.s. or in probability). Even though we speak about convergence of the random variables,
what is meant is the convergence of the associated probability distributions PXn to a limit distribu-
tion PX . The requirement of convergence for continuity points only is important or non-trivial, as
demonstrated in Example IV.25 below.
Example IV.24. Let Xn have the distribution function
( n
1 − 1 − nx x ∈ [0, n)
FXn (x) = .
0 otherwise
Then, for all x ∈ R, as n → ∞,
(
1 − e−x x ∈ [0, ∞)
lim Fn (x) = ,
n→∞ 0 otherwise
which is is identified as the distribution function of the exponential random variable X ∼ exp(1).
Example IV.25. Let Un ∼ U 0, n1 be a sequence of random variables uniformly distributed in
the interval (0, 1/n). Their corresponding distribution functions are
0
x≤0
FUn (x) = nx 0 < x < n1 .
x ≥ n1
1
However, since FU is not continuous at x = 0, the convergence is not imposed at x = 0, and therefore,
Un → U indeed.
Example IV.26. If Xn ∼ δn , i.e. Xn = n a.s. Then
(
0 x<n
FXn (x) = .
1 x≥n
The limit Fn (x) → 0 for every x ∈ R, so F (x) ≡ 0 is a candidate for a limit distribution function.
However, since F ≡ 0 is not a distribution function of any random variable, Xn does not converge in
distribution to a limit.
Example IV.27. Let Xn ∼ Bin 1, 21 + n1 , i.e. it is a single coin toss with probability of heads
1
2
+ n1 . Then
0
x<0
1 1
FXn (x) = 2 − n 0 ≤ x < 1 .
1 x≥1
Then, for every x ∈ R, FXn (x) → FX (x), where FX is the distribution function of X ∼ Bin 1, 12 , so
Xn → X in distribution.
56 IGOR WIGMAN
Proposition IV.28. If X and {Xn }n≥1 are random variables, all defined on the same probability
space (Ω, F , P ), and Xn → X in probability, then Xn → X in distribution.
Proof. Let x ∈ R a continuity point of FX , and let > 0. Then, by the said continuity, there exists
a δ > 0, so that
FX (x − δ) > FX (x) − (IV.16)
2
and
FX (x + δ) < FX (x) + (IV.17)
2
(recall that FX is monotone increasing).
Since Xn → X in probability, there exists a number N ≥ 1 sufficiently large, so that for all n ≥ N ,
P (|Xn − X| > δ) < . (IV.18)
2
Therefore,
FXn (x) = P (Xn ≤ x) = P (Xn ≤ x, X ≤ x + δ) + P (Xn ≤ x, X > x + δ)
(IV.19)
≤ P (X ≤ x + δ) + P (|Xn − X| > δ) ≤ (FX (x) + ) + = FX (x) + ,
2 2
by (IV.17) and (IV.18), and
FX (x − δ) = P (X ≤ x − δ) = P (X ≤ x − δ, Xn ≤ x) + P (X ≤ x − δ, Xn > x)
≤ P (Xn ≤ x) + P (|Xn − X| > δ) < FXn (x) + ,
2
by (IV.18).
Combining this with (IV.16), we have
FXn (x) > FX (x − δ) − > FX (x) − . (IV.20)
2
Finally, (IV.19) together with (IV.20) yield
FXn (x) − < FX (x) < FXn (x) + ,
i.e. |FXn (x) − FX (x)| < , holding for all n ≥ N , which, since > 0 was arbitrary, implies that
FXn (x) → FX (x), as asserted.
d
If Xn → X, and the distribution function FX of X is continuous everywhere, then it means that
for all x ∈ R, FXn (x) → FX (x). It then follows, that, for every −∞ ≤ a < b ≤ +∞,
P (a < Xn ≤ b) = FXn (b) − FXn (a) → FX (b) − FX (a) = P (a < X ≤ b). (IV.21)
In fact, the limit (IV.21) could be made uniform w.r.t. x ∈ R (again, under the continuity assumption
of FX on x ∈ R), i.e.
sup |FXn (x) − FX (x)| → 0,
R
for all bounded continuous f : R → R, known as weak convergence of (probability) measures PXn to
d
PX (equivalent to Xn → X), rather than the convergence of random variables. The condition (3) of
Theorem IV.30 is merely pointwise convergence.
Theorem IV.31 (Central Limit theorem (CLT)). Let {Xn }n≥1 be a sequence of i.i.d. random
d
variables defined on the same probability space, of mean zero and variance 1. Then X1 +...+X
√
n
n
→Z
with Z ∼ N (0, 1), i.e., for all −∞ ≤ a < b ≤ +∞,
Zb
X1 + . . . + Xn 1 2
P a≤ √ ≤b → √ e−y /2 dy,
n 2π
a
as n → ∞.
Proof. Let
φ(u) = φX1 (u) = E[eiuX1 ] (IV.22)
be the characteristic function of X1 (and thus of all the Xn ). Since, by assumption, E[X12 ] =
Var(X1 ) = 1 < ∞, we can differentiate (IV.22) under the integral (expectation) sign twice to obtain
φ(0) = E[e0 ] = 1, φ0 (0) = E[iX1 ] = 0, φ00 (0) = E[(iX1 )2 )] = −1.
Hence, by Taylor expanding φ(·) at the origin, we have
u2
φ(u) = 1 − + ou→0 (u2 ) (IV.23)
2
2
φ(u)−(1− u2 )
(that means that u2
→ 0 as u → 0).
58 IGOR WIGMAN
A few remarks are due. If X1 , X2 , . . . are i.i.d. with mean E[X1 ] = µ and Var(X1 ) = σ 2 (instead
of mean 0 and unit variance), then the Central Limit theorem is applicable to Zn = Xnσ−µ , so that
√
n
1
P
n n n Xi − µ
1 X i=1 d
√ Zi = → N (0, 1).
n i=1 σ
n
1 √σ Z, with Z ∼ N (0, 1).
P
Hence, for large n, the distribution of n
Xi ≈ µ + n
i=1
Here the integrability conditions E[|X1 |] < ∞ and E[X12 ] < ∞ are important for the CLT to hold.
1
For the Cauchy distributed i.i.d random variables with the density function f (x) = π(1+x 2 ) , one can
compute (Homework assignment) the characteristic function φX1 (u) = e , and then φXn (u) = e−|u| .
−|u|
Hence, the sample mean Xn is Cauchy distributed for every n ≥ 1, and the Central Limit theorem
does not hold in this case (which is not a contradiction, since the sufficient conditions are not
satisfied).
Theorem IV.31 is only the simplest version of the Central Limit theorem. There are a lot of
results holding for not independent random variables (where some measure of independence should
be assumed) or not identically distributed ones.
Example IV.32 (Confidence interval). Suppose X1 , X2 , . . . are i.i.d. random variables, whose vari-
ance Var(X1 ) = σ 2 < ∞ is known, and whose mean we wish to estimate. (For example, these could
be independent samples of the same random variable.) By the strong law of large numbers,
n
1X
X n := Xi → µ
n i=1
LECTURE NOTES FUNDAMENTALS OF PROBABILITY THEORY 59
a.s., which is a natural estimate, and we would like to find a confidence interval for these approximants
for a given large n. By the CLT,
√
n(X n − µ) d
→ Z ∼ N (0, 1).
σ
Then, for a < b,
√
bσ aσ n(X n − µ)
P Xn − √ ≤ µ < Xn − √ =P a< ≤ b → P (a < Z ≤ b)
n n σ
as n → ∞. One can then observe in the Gaussian distribution tables, that for b = −a = 1.96,
P (−1.96 < Z ≤ 1.96) = 0.95 . . ., so that
√
1.96σ 1.96σ n(X n − µ)
P Xn − √ ≤ µ < Xn + √ =P a< ≤ b → 0.95 . . . .
n n σ
h i
Hence, µ ∈ X n − 1.96σ√ , X n + 1.96σ
n
√
n
with confidence asymptotic to 95%, i.e. the said interval is a
95% confidence interval for the mean µ.
In case σ is unknown, one usually replaces σ 2 it with the sample variance
n
2 1 X
S := (Xi − X n )2 .
n − 1 i=1
(Here the division is by n − 1 and not by n as this way the estimator is unbiased, i.e. E[S 2 ] = σ 2 .)
Example IV.33 (Hypothesis testing). Consider the same situation of X1 , X2 , . . . i.i.d. with mean
µ unknown, and variance σ 2 known. Let H0 be the hypothesis that µ = µ0 for some µ0 ∈ R, and
√ d
the alternative hypothesis HA , that µ 6= µ0 . Under H0 , n(X n − µ0 )/σ → N (0, 1), so for n large,
X n ≈ N (µ0 , σ 2 /n), and we can test the latter based on the observed data.
For example, if µ0 = 0, and for n = 100 observed X n = 0.16, and σ = 1. Then, under H0 ,
X n ∼ N 0, 1001
, so the probability of observing X n = 0.16 or a larger deviation is
P (|X n | ≥ 0.16|H0 ) ≈ P (|Z| ≥ 0.16 · 10) ≈ 0.11.
If we take the significance level of our test to be, for example 5% (this is the number we determined
below which the hypothesis is rejected), then, since 0.11 > 0.05, the null hypothesis H0 is not rejected.
Example IV.34 (Binomial distribution). Let Yn ∼ Bin(n, p), where n is large. Then Yn has the
n
P
same distribution as Xi , where Xn ∼ Bin(1, p), i.e. it measures the number of heads in n successive
i=1
tosses of the same p-coin. Then, since E[X1 ] = p and Var(X1 ) = p(1 − p), the CLT gives
n
√
P
Xi − np
n(X n − p) i=1 d
p =p → Z ∼ N (0, 1),
p(1 − p) np(1 − p)
as n → ∞. Hence, for large n, Yn ≈ np+ np(1 − p)Z ∼ N (np, np(1−p)), that is the Gaussian with
p
the same mean and variance as Yn . Since the limit Gaussian distribution is continuous everywhere,
it follows from Pólya’s theorem that
! !
Yn − np t − np
sup P − P (Z ≤ y) = sup P (Yn ≤ t) − P Z ≤ → 0.
p p
y∈R np(1 − p) y∈R np(1 − p)