Probability Theory
Probability Theory
P. Ouwehand
iii
CONTENTS CONTENTS
iv
Contents v
B Convergence 133
B.1 Convergence of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.1 “Infinitely Often” and “Eventually” . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.2 Formal Definition of Convergence of Sequences . . . . . . . . . . . . . . . . . 134
B.2 Sup and Inf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.3 lim sup and lim inf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B.4 Cauchy Sequences and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
v
Chapter 1
Example 1.1.1 A die is rolled once. The possible outcomes are an integer between one and six.
Thus the sample space can be taken to be Ω = {1, 2, . . . , 6}. We may be interested in the following
events:
Each of these events can be described by a subset of the sample space. Thus if A, B, C are the
subsets corresponding to the events (a), (b), (c), then
A = {1}
B = {2, 4, 6}
C = {3, 5}
The probabilities of these events, by elementary reasoning, are P(A) = 16 , P(B) = 12 and P(C) = 13 ,
provided that the die is fair. Every subset of Ω is a “permissible” event, and thus F = P(Ω).
Mathematically, an event is a set, i.e. events are just subsets of the sample space. The outcome
of any random experiment must be some element ω of the sample space Ω. Now Ω is itself a subset
1
2 Motivation for Measure Theory
of Ω and thus corresponds to some event. We call it the certain event, since we are certain that
ω ∈ Ω. We must always have P(Ω) = 1. The empty set ∅ is also a subset of Ω and thus corresponds
to some event. We call it the impossible event, since it is impossible that an outcome ω is in ∅.
We will always have P(∅) = 0.
Note that the sample space corresponding to a random experiment need not be unique. Con-
sider, for example, the random experiment
n of rolling two dice. Then o we can choose the sample
space to be the 36 element set Ω1 = (i, j) : i, j ∈ {1, 2, 3, 4, 5, 6} . The probabilities for each
1
outcome are then the same: P(ω) = 36 for each ω ∈ Ω1 . This is a so–called uniform distribution.
On the other hand, we can choose the sample space to be the 11–element set Ω2 = {2, 3, 4, . . . , 12}
corresponding to the total of the two dice. In this case, the probability distribution is non–uniform:
P({7}) = 61 whereas P({2}) = 36 1
. Choosing the sample space and the corresponding probability
distribution for a particular situation is part of the art of probabilistic modelling.
Example 1.1.2 A coin is flipped until the first head turns up. This may happen on the first toss
or the second, or. . . or never. Thus the sample space is Ω = {ω1 , ω2 , . . . , ω∞ }, where the outcome
ωn denotes the event that the first head turns up on the nth toss, and ω∞ denotes the event of
never flipping heads. It is clear from elementary probability that P({ωn }) = 21n (provided that the
coin is fair). We may now consider various composite events, such as:
(a) Let A be the event that the first head appears on either the third or the fourth toss. Then
A = {ω3 } ∪ {ω4 } = {ω3 , ω4 }. Clearly P(A) = 213 + 214 .
(b) Let B be the event that the first head appears after an even number of tosses. Thus B =
∞
1
= 31 . Did you think that the probability that the first head
S P
n∈N {ω 2n } and P(B) = 22n
n=1
appears after an even number of tosses is 12 ? If so, note that the probability that the first
head appears on the first toss is 12 , and the probability that the first head appears after an odd
number of tosses is therefore greater than 12 .
(c) Let C be the event that both events A and B occur. Clearly C = {ω4 } = A ∩ B.
(d) Let D be the event that a head does occur after a finite number of tosses. Thus D is the
complement of the event that heads never occurs. Thus D = Ω − {ω∞ } = {ω1 , ω2 , . . . }. Hence
∞
P 1
P(D) = 2n = 1. This can also be seen from the fact that P({ω∞ }) = 0.
n=1
First, we explore the idea of counting. To say that A = {a, b, c} has 3 elements is equivalent
to saying that there is a one-to-one correspondence or bijection between the sets A and {1, 2, 3}:
.When we count “One, two, three”, pointing our finger at a, b, c, we are defining a map
f : {1, 2, 3} → A : 1 7→ a, 2 7→ b, 3 7→ c
Thus, when we count a finite collection of objects, we mentally form a list. To count how many
people there are in a room, we may form a list such as
1 2 3 4 ... 27
l l l l ... l
Bob Mary Stoffel Sannie ... Cyril
{1, 2, 3, 4 . . . , 27}
A={ a b c ... }
l l l ...
X={ x y z ... }
Clearly, if there is a surjection from A onto B, than A has “more” elements than B, and if there is
an injection from A into B than A has “fewer” elements than B.
We can use these ideas to measure infinite sets. We say that a set A of objects is countable if
we can make a finite or infinite list all of its elements
A = {a1 , a2 , a3 , . . . , an } or A = {a1 , a2 , a3 , a4 , . . . }
Here a1 is the first element on the list, a2 the second, etc. If we allow lists with repetitions, then
we see that a set A is countable if and only if there is a list
A = {a1 , a2 , a3 , a4 . . . }
i.e. the elements of A can be listed 1st , 2nd , 3rd etc., without leaving any out.
Definition 1.1.3 A non–empty set A is countable if and only if there is a surjective map
f : N → A from the set of natural numbers onto A.
The empty set is also defined to be countable.
A set which is not countable is said to be uncountable.
4 Motivation for Measure Theory
Intuitively, as set is countable if its size is smaller than (or equal to) the size of the set of natural
numbers.
Observe that we take 0 to be an element of N. Following common mathematical practice, a
listing of a set A may often start with a zeroth element: A = {a0 , a1 , a2 , . . . }: We begin counting
from 0.
Example 1.1.4 The Hilbert Hotel is a hotel with infinitely many rooms, numbered 0, 1, 2, 3, . . . .
Imagine that you are the manager of the Hilbert Hotel.
1. One day, someone arrives at your desk and requests a room. You look at the guest list, and
notice that every room is full. What do you do?
2. One day, a busload of infinitely many people (numbered 0, 1, 2 . . . , m, . . . ) arrive at your desk
with a request for accommodation. You look at the guest list, and notice that every room is
full. What do you do?
3. One day infinitely many buses (numbered 0, 1, 2, . . . , n, . . . ) arrive, each bus filled with infinitely
many people, so that the mth person on the nth bus is numbered (n, m). You look at the guest
list, and notice that every room is full. What do you do?
What you do is this:
1. Move every person in the hotel to the next room, i.e. the person in room n will move to room
n + 1. Now room 0 is empty!
2. Move every person in the hotel as follows: The person in room n goes to room 2n. Then all the
odd–numbered rooms are empty. Put bus-passenger m in room 2m + 1.
3. It can be done! Wait and see until after Proposition 1.1.6.
Examples 1.1.5 A non–empty set A is countable if there is a list of all its elements, possibly with
repetitions.
A = {a0 , a1 , a2 , . . . , an , . . . }
(a) Every finite set is countable: If A = {a0 , a1 , . . . , am }, pick an arbitrary a ∈ A, and let an := a
for n > m. Then A = {an : n ∈ N}.
(b) N = {0, 1, 2, 3, . . . } is countable: Take an = n
(c) The set {0, 2, 4, 6, . . . } of even natural numbers is countable: Take an = 2n.
(d) The set Z of integers is countable:
n
n even
2
Z = {0, 1, −1, 2, −2, 3, −3, . . . } an =
−n + 1
n odd
2
If you think of the positive integers as the people already staying in the Hilbert Hotel, and the
negative integers as the people in an arriving bus, this shows how to accommodate everyone.
Events and Probabilities 5
(b) Suppose that each of A0 , A1 , . . . , Ak , . . . are countable sets. Then their union
[
A := Ak = A0 ∪ A1 ∪ · · · ∪ Ak ∪ . . .
k∈N
Proof: (a) Suppose that B := {b0 , b1 , b2 , . . . } is a countable set, and let A ⊆ B. If A = ∅, then A
is countable by definition. If A 6= ∅, pick a ∈ A. Now define an (n ∈ N) as follows:
(
bn if bn ∈ A
an :=
a else
1
• Find biggest k such that 2
k(k + 1) ≤ n.
1
• Let i = n − 2
k(k + 1) and j = k − i. Then an := ai,j
a
Remarks 1.1.7 If you think of A0 as the people already staying in the Hilbert Hotel, and of An as
the nth bus (for n = 1, 2, 3, . . . ), then the proof of Proposition 1.1.6(b) shows how to accommodate
infinitely many busloads (cf. Example 1.1.4).
Proof:
Q = A1 ∪ A2 ∪ · · · ∪ An ∪ . . .
is a union of countably many sets A1 , A2 , A3 , . . . , where An := {0, n1 , − n1 , n2 , − n2 , n3 , − n3 . . . } is the
set of rational numbers with denominator n. Clearly each An is countable. a
In light of the above, the following proposition may come as a complete surprise:
Proof: Suppose we are given an arbitrary list r0 , r1 , r2 , . . . , rk , . . . (k ∈ N) of real numbers in [0, 1].
Write
r0 = 0.r00 r01 r02 r03 r04 r05 . . .
r1 = 0.r10 r11 r12 r13 r14 r15 . . .
r2 = 0.r20 r21 r22 r23 r24 r25 . . .
r3 = 0.r30 r31 r32 r33 r34 r35 . . .
.. .. ..
. . .
where rkn is the nth digit in the decimal representation of rk . Pick a digit an ∈ {0, 1, . . . , 9} so that
an 6= rnn , and define a ∈ [0, 1] by
a := 0.a0 a1 a2 a3 a4 a5 . . .
(i.e. the nth digit in the decimal representation of a is an .)
By construction, the nth digit in the decimal representation of a differs from the nth digit in
the decimal representation of rn . Hence a 6= rn (because a and rn do not have the same decimal
expansion1 ). But as this holds for any n, we are forced to conclude that the number a is not on
the list!
Thus, given any list of real numbers in [0, 1], there is a number in [0, 1] which is not on the list.
It follows that there can be no list of all the numbers in [0, 1]. Thus [0, 1] is uncountable.
1
For the purists: Because decimal expansions of numbers are not unique (e.g. 0.2499999 · · · = 0.250000 · · · = 0.25)
we have to be a little more careful here. Every non-zero real number does have a unique non-terminating decimal
expansion, i.e. one that does not terminate in all zeroes from some point onwards. So one to way to get around the
problem of non-unique decimal expansions is to insist that all the real numbers rn (except 0) be expressed in their
non-terminating decimal expansion, and that each digit an is chosen so that an 6= 0.
Events and Probabilities 7
Thus the rational numbers can fit into the Hilbert Hotel, but the real numbers cannot!
It follows easily how to calculate the area of triangles, and thus that of polygons. But what
about non–polygonal sets? For example, how do we justify that the area of a circle of radius r is
πr2 ? Before we do this, do the following exercise.
Exercise 1.1.10 Use the properties (1)-(4) of the area function to show the following:
(a) |∅| = 0.
Exercise 1.1.11 We use the properties of the area function to show that the area of an open circle
A of radius r is |A| = πr2 .
8 Motivation for Measure Theory
(a) For n = 1, 2, 3, . . . , let An be the regular open polygon with 2n+1 sides, inscribed in a circle of
Sr. Thus A1 is a square, A2 an octagon, etc. Note that A1 ⊆ A2 ⊆ A3 ⊆ . . . . Also note
radius
that n An = A. (This is why we need the sets An and A to be open subsets of R2 ).
(b) An consists of 2n+1 congruent isosceles triangles, constructed by joining each of the sides of
the polygon to the centre of the circle. Explain why each such triangle has area 21 r2 sin 2πn , and
conclude that |An | = 2n r2 sin 2πn .
π
sin 2n
(c) Conclude that |A| = limn πr2 π
n
= πr2 .
2
The technique used to compute the area in Exercise 1.1.11 relies on the set A being approximated
from the inside by triangles. Not every set can be so approximated, however. Take for example the
set A := {(p, q) : p, q are rational numbers with 0 ≤ p, q ≤ 1}. It is not clear what the area of A
should be: On the one hand, the set A is dense inside the unit rectangle, so one might guess that
|A| = 1. If we approximate A from the outside however, we obtain a convincing argument that
|A| = 0:
Exercise 1.1.12 Recall that the set of rational numbers is countable. So we can write the elements
of A in a list: A = {(pn ,Sqn ) : n ∈ N}. Fix ε > 0. For P
n ∈ N, let Rn be a square centred at (pn , qn )
with area 2εn . Let B = n Rn , and show that |B| ≤ n |Rn | = ε. Also show that A ⊆ B so that
|A| ≤ ε. Since ε > 0 was arbitrary, we have |A| ≤ ε for all ε > 0, i.e. |A| = 0.
Observe that the technique used in Exercise 1.1.12 can be used to prove that every countable
subset of R2 has zero area. It is therefore necessary that the set of real numbers be uncountable
for the concept of area to have a useful meaning!
Not only intervals or unions of intervals have length. Some quite complicated sets can be
measured. Consider the following example:
Exercise 1.1.13 The Cantor set:
The Cantor set C is a subset of [0, 1] which is constructed as follows: Let C0 := [0, 1]. Now let C1
be C0 with its middle third removed, i.e. C1 := [0, 31 ] ∪ [ 32 , 1]. Now remove the middle thirds of the
two intervals that make up C1 to form C2 , i.e. C2 := [0, 91 ] ∪ [ 29 , 13 ] ∪ [ 23 , 79 ] ∪ [ 89 , 1]. Continue in this
way, removing the middle thirds of each of the intervals comprising Cn to form Cn+1 . Finally, let
∞
T
C := Cn . Then C is called the Cantor set.
n=0
(a) Show that λ(C) = 0, where λ(A) denotes the length of a subset A ⊆ R. [Hint: First calculate
λ(Cn )]
(b) Every real number a ∈ [0, 1] has a non–terminating ternary (base 3–) expansion a = (0.a1 a2 a3 . . . )3 :=
∞ a
P i
i
, where ai = 0, 1 or 2 — cf. remarks on base n–expansions in the box below. Show that
i=1 3
the Cantor set is formed by removing all numbers which have a 1 occurring in their non–
terminating ternary expansion.
[Hint: Show that C1 is formed by removing all numbers which have a 1 in the first digit, C2 is
formed by removing all numbers in C1 which have a 1 in the second digit, and so on.]
Events and Probabilities 9
(c) Show that there are as many elements in C as there are real numbers. Conclude that C is
uncountable. [Hint: Define f : C → R so that f ((0.202202 . . . )3 ) = (0.101101 . . . )2 .]
Base n–expansions: We are familiar with the fact that every real number has a base
10–expansion, i.e. a decimal expansion using digits in {0, 1, . . . , 9}, e.g.
3 1 4 1 5
π = 3.14159265 · · · = 100
+ 101
+ 102
+ 103
+ 104
+ ...
Such an expansion is typically unique, except for one small problem: (0.(n − 1)(n − 1)(n −
P∞ n−1 n−1 1
1) . . . )n = k=1 nk = n · 1− 1 = 1 = (1.000 . . . )n , i.e. every number whose base
n
n–expansion ends in zeroes also has one that ends in (n − 1)’s. For example,
1
(0.9999)10 . . . = (1.0000)10 . . . (0.011111 . . . )2 = 2 = (0.1000 . . . )2
7
(0.102222 . . . )3 = 9 = (0.110000)3
Say that an expansion is terminating if it eventually ends in all zeroes. Then every real
number has a unique non–terminating base n–expansion.
Using an argument due to Vitali in 1905, it can be shown that it is impossible to assign a length
to every bounded subset of R, i.e. there is no function which satisfies each of the properties (1)-(4)
of L(·) above, and which is defined for every bounded subset of R. Thus there are subsets of R
which have no length. This does not mean that these sets have zero length; it means that there is
no number which can be called their length, and which is consistent with (1)-(4).
We present the proof, but it can be safely omitted:
?
Example 1.1.14 Define an equivalence relation ∼ on R by
x∼y ⇐⇒ y−x∈Q
Let {Ei : i ∈ I} be an enumeration of the equivalence classes of ∼. Note that if x ∈ R, then there exists
q ∈ Q such that 0 ≤ x + q ≤ 1. Now since x ∼ x + q, we see that for every x there is y ∈ [0, 1] such that
x ∼ y. Thus [0, 1] ∩ Ei 6= ∅ for every i ∈ I.
Now pick2 for each i ∈ I one xi ∈ [0, 1] ∩ Ei , and define a Vitali set H by H := {xi :∈ I}. Thus for each
y ∈ R there is a unique i ∈ I such that y ∼ xi .
For q ∈ Q, define H + q := {xi + q : i ∈ I}. First note that the H + q are mutually disjoint: For if
y ∈ (H + q) ∩ (H + q 0 ) for rational numbers q, q 0 , then y = xi + q = xj + q 0 for some i, j ∈ I, and thus
xi = xj + (q 0 − q), i.e. xi ∼ xj . It follows that xi = xj , thus that q = q 0 , and thus that H + q = H + q 0 .
2
This requires the Axiom of Choice.
10 Motivation for Measure Theory
Next, we claim that for each y ∈ R there is a unique q ∈ Q such that y ∈ H + q := {xi + q : i ∈ I}.
Indeed, existence follows from the fact that there is an i ∈ I such that y ∼ xi , so that q := y − xi has the
property that y ∈ H + q. Uniqueness follows from the disjointness of the H + q.
Now let {qn : n ∈ N} be an enumeration
S of Q ∩ [−1, 1]. Note that if x ∈ [0, 1], there
S is a unique i ∈ I such
that x − xi ∈ Q ∩ [−1, 1], so that x ∈ n∈N (H + qn ). Since H ⊆ [0, 1], we also have n∈N (H + qn ) ⊆ [−1, 2].
Thus: [
[0, 1] ⊆ (H + qn ) ⊆ [−1, 2]
n∈N
Now suppose that the Vitali set H has an length i.e. that L(H)
S∞exists. Each H + qn is a translation of
H, and thus L(H + qn ) = L(H) for all n ∈ N. Now since [0, 1] ⊆ n=1 (H + qn ) ⊆ [−1, 2], it follows that
∞
!
[
1≤L (H + qn ) ≤ 3
n=1
Similarly, there are subsets of R2 which have no area, some subsets of R3 fail to have a volume.
This does not mean that these sets have zero area/volume; it means that there is no number which
can be called their area/volume, and which is consistent with (1)-(4). In fact, for R3 , things are
very much worse: Look up the Banach–Tarski Paradox!
Here are some points to consider:
• Not every subset of R/R2 can be assigned a length/area. It will therefore be nec-
essary to exclude these non–measurable sets from consideration: They are not
“permissible”.
Now note that length and area may be considered as special cases of probability, namely uniform
probability.
Example 1.1.15 Consider the experiment of randomly choosing a number ω from the unit interval
[0, 1] with equal probability.
1
• The probability that 0 ≤ ω ≤ 2 is 21 . The event that 0 ≤ ω ≤ 1
2 corresponds to the set [0, 12 ].
Its length is L([0, 12 ]) = 21 .
• Similarly, if (a, b) ⊆ [0, 1], then the probability that ω ∈ (a, b) is P((a, b)) = L((a, b)) = b − a.
Events and Probabilities 11
• In particular, if H is the Vitali set of Example 1.1.14, then the probability that ω ∈ H is
undefined — it is impossible to meaningfully assign a probability to H while maintaining
translation invariance. H is not “permissible”.
Similarly, consider the experiment of choosing a point ω from the unit square [0, 1] × [0, 1] with
equal probability. If E ⊆ [0, 1] × [0, 1], then the probability that ω ∈ E is just A(E).
• If A is an event, then the possibility of A not occurring should also be an event. Now if
the outcome of a random experiment is ω ∈ Ω, then the event A occurs if and only if ω ∈ A
(remember that we consider an event to be a subset of the sample space). Thus the event that
A does not occur corresponds to ω 6∈ A, i.e to the set Ac = Ω − A. We want the probabilities
of these events to be related by P(Ac ) = 1 − P(A).
• If A, B are events, then the possibility of both A and B occurring should also be an event.
Now if the outcome of a random experiment is ω ∈ Ω, then both A and B occur if and only
if ω ∈ A and ω ∈ B, i.e. if and only if ω ∈ A ∩ B. Thus the event of both A and B occurring
corresponds the set A ∩ B.
• In the same way, if A and B are events, then the possibility of at least one of A or B occurring
should be an event as well. This corresponds to the set A ∪ B. We say that events are disjoint
or mutually exclusive if they cannot occur simultaneously. Thus if A, B are disjoint, the
ω ∈ A implies ω 6∈ B. Clearly, therefore, A and B are disjoint if and only if A ∩ B = ∅ (i.e.
the event that both A and B occur is impossible). For disjoint events A and B, we want
P(A ∪ B) = P(A) + P(B). This is because P(A ∪ B) = NA∪B N = NA N +NB
= P(A) + P(B) (where
NA is the number of elements in the set A, etc.)
The concept of probability has rather a lot in common with that of length, area and volume:
12 Events and σ–algebras
• P(∅) = 0;
|∅| = 0.
S P
• If An are disjoint events, thenSP( n AnP) = n P(An );
If An are disjoint sets, then | n An | = n |An |.
When we isolate and study the common features of probability and length/area/volume, we
get the subject ofR measure theory. We shall show that we can develop a theory which allows us
to form integrals f dµ of functions f with respect to measures µ (rather than variables). It will
turn out that the integral with respect to Lebesgue measure (yet to be defined) is a more powerful
generalization of the ordinary Riemann integral. It will also transpire that the integral with respect
to a probability measure precisely captures the notion of probabilistic expectation.
Armed with the intuition and motivation provided by the above examples, we now proceed with
the formal theory.
• A family F of events.
An event is a (permissible/relevant) subset of Ω. If A is an event, we say that A occurs if the
outcome ω is an element of A.
We shall require F to be a σ–algebra (which we define below).
Intuitively, we think of F as a set of events E for which we can decide whether or not E
occurred at the termination of the experiment.
Note: . . . whether or not. . . .
This intuition imposes the following constraints on F:
(a) Ω ∈ F and ∅ ∈ F.
Indeed, every outcome ω belongs to Ω, and thus the event Ω always occurs — it’s the certain
event.
Events and Probabilities 13
Similarly, no outcome ω belongs to ∅, and thus the event ∅ never occurs — it’s the impossible
event.
(b) If E ∈ F, then E c ∈ F, i.e. F is closed under complementation.
For if we can decide whether or not E occurred, then we can also decide whether or not E c
occurred: For suppose that the outcome of the experiment is ω. If E occurred, then ω ∈ E, so
ω 6∈ E c , hence E c did not occur.
Similarly, if E did not occur, then E c did occur.
(c) If E1 , E2 ∈ F, then E1 ∩ E2 ∈ F, i.e. F is closed under intersection.
For if we can decide whether or not E1 occurred, and also whether or not E2 occurred, then
we can decide whether or not E1 ∩ E2 occurred: E1 ∩ E2 occurred iff ω ∈ E1 ∩ E2 iff ω ∈ E1
and ω ∈ E2 iff both E1 and E2 occurred.
Thus if we can decide whether or not E1 , E2 occurred, we can also decide whether or not
E1 ∩ E2 occurred.
(d) Similarly, F is closed under union: The event E1 ∪ E2 occurs iff either E1 occurred, or E2
occurred (or both).
(e) We can generalize (c) and (d)T somewhat: If ES1 , E2 , E3 , . . . , En , . . . , is a countable sequence
of members of F, then also n En ∈ F and n En ∈ F, i.e. F is closed under countable
intersections
T and –unions. S
For n En occurred iff each of the En occurred, and n En occurred iff at least one of the En
occurred.
T Thus if weS can decide whether or not each En occurred, we can also decide whether
or not n En and n En occurred.
This leads to the following definitions:
Definition 1.2.1 Let Ω be a set. A collection A of subsets of Ω is called an algebra (or
field) on Ω if
(i) ∅ ∈ A;
(ii) A ∈ A ⇒ Ac ∈ A;
(iii) A, B ∈ A ⇒ A ∪ B ∈ A.
A collection F of subsets of Ω is called a σ–algebra (or σ–field) if it satisfies (i), (ii) and
S
(iii)σ If An ∈ F (for n ∈ N), then n An ∈ F.
Thus a σ–algebra on Ω is a non–empty family of subsets of Ω which is closed under complementation
and countable unions.
Remarks 1.2.2 (i) By De Morgan’s rules, an algebra is closed under (finite) intersections, and
a σ–algebra is closed under countable intersections:
!c
\ [
An = Acn
n n
14 Events and σ–algebras
(iii) If Ω is a set, then F0 := {∅, Ω} is the smallest σ–algebra on Ω, and F∞ := P(Ω) is the biggest
σ–algebra on Ω.
T
(iv) If {Fi : i ∈ I} is a family of σ–algebras on Ω, then F := i∈I Fi is also a σ–algebra on Ω.
S T c
Events are organized in σ–algebras. The set–theoretic operations , , · , correspond
to logical combinations or, and, not of events.
Frequently, the events of interest form a collection C which is not a σ–algebra. Suppose that C
is a collection of events which can be decided, i.e. if E ∈ C, then we can decide whether or not E
occurred. We can then also decide whether or not E c occurred, but E c may not be an element of
C. The bigger set F of all events that can be decided, given that we can decide all the events in C,
is a σ–algebra containing C.
F = σ(C)
T
Proof: Let F = {G : G a σ–algebra with C ⊆ G}, and let F = F. Then F ∈ F. (Why?)
Moreover, if G is a σ–algebra which contains C, then G ∈ F, and so F ⊆ G. (Why?) a
σ(C) consists of all those events F for which we can decide whether or not F has occurred,
given that we know exactly which of the E ∈ C have occurred.
Exercise 1.2.4 Let Ω = (0, 1], and let A be the family of all those sets which can be written as a
union of finitely many intervals of the form (a, b], where 0 ≤ a < b ≤ 1. Show that A is an algebra,
but not a σ–algebra.
(a) Suppose that Ω is a set, and that B := {Bn : n ∈ N} is a partition of Ω with countably many
blocks. Show that the σ–algebra σ(B) generated by B is the precisely family of those sets that
can be written as countable unions of the blocks Bn .
[Hint: Consider union and complements of B2 ∪ B4 ∪ B6 ∪ . . . and B2 ∪ B3 ∪ B5 ∪ B7 ∪ B11 ∪ . . . .]
(b) Show that if Ω is a countable set, and if F is a σ–algebra on Ω, then there is a countable
partition B := {Bn : n ∈ N} which generates F.
[Hint: Define a relation ∼ on Ω by
ω ∼ ω0 ⇔ ∀F ∈ F[ω ∈ F ↔ ω 0 ∈ F ]
Show that ∼ is an equivalence relation, and consider the equivalence classes of ∼.]
(c) In a gambling game, a die is rolled. The sample space is Ω = {1, 2, . . . , 6}. I will tell you
whether or not the outcome is even, and whether or not the outcome is ≤ 4. What is the
σ–algebra on Ω which contains this information?
(d) Suppose that F = σ(B), where B is a partition consisting of n blocks. Explain why F has
exactly 2n = 2no. of blocks elements.
Definition 1.2.6 If the sample space Ω is a topological space, we define the Borel algebra
of Ω by
B(Ω) = σ(open sets of Ω)
In particular, B(R) is the smallest σ–algebra on R which contains all the open intervals of
R.
B(R) is one of the most important σ–algebras that we shall work with.
(ii) The half–open intervals (x, y] and [x, y), where x < y ∈ R.
Proof: As (−∞, x] = ∞ 1
T
n=1 (−∞, x + n ), we see that C ⊆ B(R), and thus σ(C) ⊆ (R). (Why?)
Conversely, suppose first that I = (a, b) is an open interval. Then I = (−∞, a]c ∩ n (−∞, b− n1 ],
S
so I ∈ σ(C), i.e. σ(C) contains every open interval.
Now since every open subset of R can be represented as a union of countably many open
intervals3 , we see that σ(C) contains every open subset of R. Thus B(R) ⊆ σ(C).
a
1.3 Measures
The notion of measure generalizes the concepts of length, area, volume, mass, and probability.
(ii) µ(∅) = 0.
If F is a σ–algebra on a set Ω, then the pair (Ω, F) is called a measurable space. The elements
of F are called measurable sets, or events in the probabilistic framework. If, in addition, µ is
a measure on F, the triple (Ω, F, µ) is called a measure space. The symbols P, Q are used for
probability measures, and (Ω, F, P) will always denote a probability space.
Example 1.3.2 Important: Lebesgue Measure: We shall soon show that there is a unique
measure λ on (R, B(R)), which assigns every interval its length, i.e.
This measure is called Lebesgue measure, and provided the original impetus for the development of
the subject of measure theory.
Example 1.1.15 makes clear that Lebesgue measure is also important in probability theory:
Consider the experiment of drawing a uniformly distributed random number from the unit interval
3
Let U ⊆ R be a bounded open set, and let {qn : n ∈ N} S enumerate the rationals in U . Define rn := sup{r :
(qn − r, qn + r) ⊆ U }, and let In = (qn − rn , qn + rn ). Then n In ⊆ U . Conversely, if x ∈ U , there is ε > 0 so that
(x − ε, x + ε) ⊆ U . Choose qn such that |x − qn | < 2ε . Then if |y − qn | < 2ε , we have |x − y| ≤ |x − qn | + |qn − y|
S < ε,
and thus x ∈ (qn − 2ε , qn + 2ε ) ⊆ (x −Sε, x + ε) ⊆ U . It follows that rn ≥ 2ε , and thus that x ∈ In . Hence U ⊆ n In .
If U is not bounded, note that U = n (U ∩ (−n, n)) is a union of bounded open sets.
Events and Probabilities 17
[0, 1]. The probability of drawing a number between a and b (where 0 ≤ a ≤ b ≤ 1) is P([a, b]) =
b − a. Thus the appropriate measure P is just Lebesgue measure, restricted to [0, 1].
There are higher dimensional analogues of Lebesgue measure: There is a measure, also denoted
λ and called Lebesgue measure, on (Rn , B(Rn )), which assigns to every n–dimensional rectangle its
volume.
Example 1.3.3 Let Ω be a set. For A ⊆ Ω, define |A| = no. of elements of A. Then | · | defines a
measure on (Ω, P(Ω)), called counting measure.
Exercise 1.3.4 Let (X, F) be a measurable space, and let x0 ∈ X. Define δx0 : F → R by
(
1 if x0 ∈ F
δx0 (F ) =
0 if x0 6∈ F
Show that δx0 is a measure on (X, F).
δx0 is called the Dirac measure, or point mass, at x0 .
Example 1.3.5 Suppose that F : R → R is an increasing right–continuous function, i.e.
F (s) ≤ F (t) when s ≤ t and F (t) = lim F (s)
s↓t
We shall prove later that there is a unique measure µF on (R, B(R)) with the property that
µF (a, b] = F (b) − F (a) for all a < b ∈ R
µF is called the Lebesgue–Stieltjes measure associated with F . Note that if F (t) := t, then µF = λ
is Lebesgue measure.
Note that, for general measures, we allow +∞ as a value. For example, the length of the real
line is +∞, so λ(R) = +∞, were λ is Lebesgue measure on (R, B(R)). However, we often need to
get a “handle” on infinity:
Definition 1.3.6 A measure µ on a measurable space (Ω, F) is called
(b) σ–finite, if Ω is the countable union of sets of finite measure, i.e. if there is a sequence
S1 , A2 , · · · ∈ F of measurable sets such that each µ(An ) < ∞, and such that Ω =
A
n An .
Exercise 1.3.7 (a) Lebesgue measure on (R, B(R)) is σ–finite, but not finite.
(b) Counting measure on (R, B(R)) is not σ–finite.
The following exercise is often useful:
Exercise 1.3.8 Suppose that (Ω, F, µ) is a measure space, and that A ∈ F. Define
F ∩ A = {F ∩ A : F ∈ F}
(this is an abuse of notation), and let µA = µ|F ∩ A. Then (A, F ∩ A, µA ) is a measure space also
— the restriction of (Ω, F, µ) to A.
18 Measures
B1 := A1 Bn+1 := An+1 − An
S
Then
S B n S⊆ A n for all n, so that µ(B ) n ≤ µ(A n ). Moreover, A n = k≤n Bk for all n, and
n An = n Bn . Hence
! !
[ [ X X
µ An = µ Bn = µ(Bn ) ≤ µ(An )
n n n n
Exercise 1.4.2 Suppose that (Ω, F,TP) is a probability space, and that A, A1 , A2 , · · · ∈ F. Show
that if P(An ) = 1 for n ∈ N, then P( n An ) = 1 also.
(a) We say that A is µ–null if there exists B ∈ F such that A ⊆ B and µ(B) = 0.
(b) We shall say that a statement ϕ holds µ–almost everywhere (or µ–almost surely in the
probabilistic framework), if the set of ω ∈ Ω where ϕ fails to hold is µ–null.
We abbreviate µ–almost everywhere and µ–almost surely by µ–a.e. and µ–a.s. respectively.
Events and Probabilities 19
Remarks 1.4.4 Note that in (a) above, the set A might not belong to F so µ(A) might be
undefined. However, clearly µ(A) “ought” to be zero. Later, this insight will allow us to extend
measures to σ–algebras larger than the ones we start off with.
As an example of (b), consider the reals with Lebesgue measure: (R, B(R), λ). Every point is
λ–null, since λ{x} = λ[x, x] = x − x. Hence the set Q of rational numbers is λ–null: The set Q is
countable, and has an enumeration Q = {qn : n ∈ N}. By countable additivity,
X
λ(Q) = λ{qn } = 0
n
Then
f, g are equal λ–almost everywhere
Exercise 1.4.5 (a) Let N denote the family of all µ–null sets. Prove that N is closed under
countable unions.
Similarly,
2. Also note the following simple interpretations of the above limit operations:
∞
[
x ∈ lim sup An ⇔ ∀n [x ∈ Ak ]
n
k=n
⇔ ∀n ∃k ≥ n [x ∈ Ak ]
⇔ x belongs to infinitely many of the sets Ak
Similarly,
∞
\
x ∈ lim inf An ⇔ ∃n [x ∈ Ak ]
n
k=n
⇔ ∃n ∀k ≥ n [x ∈ Ak ]
⇔ x belongs to all the Ak from some n onwards
In particular, x ∈ lim inf n An iff x eventually belongs to all the An , i.e. belongs to all but
finitely many of the An .
Thus x ∈ (An , i.o.) iff x belongs to infinitely many of the sets An , etc.
Remarks 1.4.10 ? Recall the definitions of lim supn xn , lim inf n xn of a sequence hxn in of real num-
bers — cf. Appendix B.3. How is this related to the definitions lim supn An and lim inf n An of a
sequence hAn in of sets?
For A ⊆ Ω, define the indicator function IA : Ω → {0, 1} of A by
(
1 if ω ∈ A
IA : Ω → R : ω 7→
0 else
(Elsewhere in the mathematical literature, indicator functions are often called characteristic func-
tions, but in probability theory, the term characteristic function has a different meaning.)
Suppose that (An : n ∈ N) is a countable sequence of subsets of Ω. For ω ∈ Ω, we have
B1 := A1 Bn+1 = An+1 − An
Then
[ [ X n
X n
[
µA = µ( An ) = µ( Bn ) = µ(Bn ) = lim µ(Bk ) = lim µ( Bk ) = lim µ(An )
n→∞ n→∞ n→∞
n n n k=1 k=1
Exercise 1.4.12 (a) Show that Propn. 1.4.11(b) may fail if we drop the assumption that at least
one of the µAn is finite.
22 Lebesgue Measure from Coin Tossing
(b) Suppose that µ is finitely additive on the measurable space (Ω, F). Show that if
µAn → 0 whenever An ↓ ∅
(c) Conclude that µ(lim supn An ) ≥ lim sup µ(Am ) = lim supn µ(An ).
n n≥m
Ω̂ = {0, 1}N
Events and Probabilities 23
i.e. Ω̂ is the set of N–indexed sequences of 0’s and 1’s. We take a slightly different view, however.
Every sequence of 0’s and 1’s can be regarded as the dyadic or binary expansion of a real number.
For example, the sequence
1 1 0 1 0 0 0 1 ...
can be thought of as the binary number
1 1 1 1 1 1 1 1
{z . .}. = 1 · 2 + 1 · 22 + 0 · 23 + 1 · 24 + 0 · 25 + 0 · 26 + 0 · 27 + 1 · 28 + . . .
0.11010001
|
binary
= |0.816
{z . .}.
decimal
We thus have a correspondence between sequences (an : n ∈ N) of 0’s and 1’s, and real numbers
between 0 and 1:
∞
X an
(an : n ∈ N) 7→
2n
n=1
Moreover, a little thought shows that if we chuck out the terminating sequences from Ω̂, then the
correspondence above is a bijection between Ω̂ − T and (0, 1]. How many such terminating dyadic
expansions are there? It is not hard to see that T is countable. Moreover, since each element of
Ω̂ has probability 0, the probability of the event T is also 0 (being a countable union of events of
probability 0). It is therefore practically certain that the event T won’t occur. The terminating
dyadic expansions are therefore, in a sense, redundant. Nothing is lost by chucking them out
(except a set of measure 0). We may therefore take the sample space to be the set
Ω = (0, 1]
F = all events that can be decided after only finitely many tosses
(a) A = the first toss is “Heads” . These are the dyadic numbers with a 1 in the first place, i.e.
all the numbers from 0.1000 · · · = 12 to 0.1111 · · · = 1, but not including 0.1000 . . . , because it
is terminating. Thus A = ( 12 , 1]
(b) B = the third toss is “Tails”. This is the set of all dyadic numbers with a 0 in the third place.
These are the numbers in intervals
binary = decimal
(0.000000 . . . , 0.000111 . . . ] = (0, 0.125]
(0.010000 . . . , 0.010111 . . . ] = (0.25, 0.375]
(0.100000 . . . , 0.100111 . . . ] = (0.5, 0.625]
(0.110000 . . . , 0.110111 . . . ] = (0.75, 0.875]
Hence B = (0, 0.125] ∪ (0.25, 0.375] ∪ (0.5, 0.625] ∪ (0.75, 0.875].
(c) C = There are 2 “Heads” and 1 “Tail” in the first 3 tosses. A little thought shows that
C = (0.375, 0.5] ∪ (0.625, 0.75] ∪ (0.75, 0.875]
Reasoning along these lines, it is clear that F is the σ–algebra generated by all intervals of the
form ( 2kn , k+1 n
2n ], where n ∈ N and k < 2 . It is therefore not hard to see that F = B(0, 1] (since
every real number can be approximated arbitrarily closely by a dyadic rational, i.e. a number of
the form 2kn .
Having identified the “right” σ–algebra, we turn to the probability measure appropriate for this
experiment.
(a) P(A) = probability that first toss is “Heads”. Assuming a fair coin, this is clearly 12 . Now the
event A corresponds to the interval ( 12 , 1], and λ( 12 , 1] = 21 (where λ is the Lebesgue measure
introduced in Example 1.3.2. Thus P(A) = λ(A).
(b) P(B) = probability that third toss lands “Tails”. This is clearly also 12 , as the third toss
is just as likely to land “Heads” as it is “Tails”. Now λ(B) = λ(0, 0.125] + λ(0.25, 0.375] +
λ(0.5, 0.625] + λ(0.75, 0.875] = 4 × 18 = 12 , and thus P(B) = λ(B) in this case also.
(c) P(C) = probability that there are 2 heads and 1 tail in the first 3 tosses. This probability is
3 −3 3
clearly 2 2 = 8 . In this case we therefore also have P(C) = λ(C).
It therefore becomes apparent that the “right” probability space for the random experiment of
tossing a coin infinitely many times is just the same as the random experiment of picking a number
from (0, 1] (Example 1.3.4).
All of this leads us to formulate the following principle:
Borel’s Principle:
Consider the random experiment of tossing a (fair) coin infinitely many times, and let E
be an event. Interpret E as a subset of (0, 1]. Then P(E) = λ(E), i.e. the probability
that the event E occurs is just the Lebesgue measure of the associated subset of the unit
interval.
Events and Probabilities 25
(i) Ω ∈ C;
(ii) A, B ∈ C and A ⊆ B implies B − A ∈ C;
(iii) If A1 , A2 , · · · ∈ C and An ↑ A, then A ∈ C.
(c) We denote by π(C) and λ(C) the π–, respectively, λ–system generated by C, i.e. the
smallest π–, respectively, λ–system on Ω which contains C.
Why do π(C), λ(C) always exist? It follows from the easily proved fact that the intersection of
an arbitrary family of π–systems (resp. λ–systems) is again a π–system (resp. λ–system).
Then each Bn ∈ C, and Bn ↑ A. Hence A ∈ C, by Defn. 1.6.1(b)(iii), and thus C is closed under
countable unions. a
The following technical result often allows us to work with “easy” π–systems, instead of the
“difficult” σ–algebras:
(b) Suppose that C is a π–system and that D is a λ–system (both on a set Ω), and also
that C ⊆ D. Then σ(C) ⊆ D.
26 Other Families of Sets
Proof: (a) Let D = λ(C). By Propn. 1.6.2, it suffices to show that D is a π–system. We do this
in two steps.
STEP I: Fix C ∈ C, and define
DC = {A ∈ D : A ∩ C ∈ D}
DD = {A ∈ D : A ∩ D ∈ D}
As an application, here is an easy but useful result: Two probability measures which agree on
a π–system agree on the σ–algebra generated by that π–system.
Proposition 1.6.4 Suppose that µ1 , µ2 are finite measures on a measurable space (Ω, F),
and let C be a π–system such that Ω ∈ C and σ(C) = F. Then if µ1 = µ2 agree on C, they
agree on F.
• µ1 (Ω) = 1 = µ2 (Ω), so Ω ∈ D.
µ1 (B − A) = µ1 (B) − µ1 (A)
= µ2 (B) − µ2 (A) because A, B ∈ D
= µ2 (B − A)
so that B − A ∈ D also.
Events and Probabilities 27
1. We first define the notion of outer measure on a set Ω, i.e. a monotone countably subadditive
map µ∗ : P(Ω) → R̄+ ..
2. We show that with associated with each such outher measure is associated the σ–algebra M(µ∗ )
of µ∗ –measurable sets. Moreover, if we denote the restriction µ∗ M(µ∗ ) by = µ, then µ is a
measure, i.e. (Ω, M(µ∗ ), µ) is a measure space.
3. We define the Lebesgue outer measure λ∗ : P(R) → [0, +∞] as follows: Given an interval I ⊆ R,
define |I| to be the length of the interval. Then, for A ⊆ R, we define
∞
X [
λ∗ (A) := inf
|In | : each In an interval, A ⊆ In
n=1 n
4. We show that λ∗ (I) = |I| for every interval I: λ∗ assigns to each interval its length.
(i) µ∗ (∅) = 0;
(ii) µ∗ is monotone: A ⊆ B implies µ∗ (A) ≤ µ∗ (B);
(iii) µ∗ is countably sub–additive: If A1 , A2 , · · · ⊆ Ω, then µ∗ ( n An ) ≤ n µ∗ (An ).
S P
Note that we require µ∗ A to be defined for every subset A ⊆ Ω. We haven’t mentioned a base
σ–algebra. But there is one:
Theorem 1.7.2 Let µ∗ be an outer measure on Ω, let M(µ∗ ) be the family of all µ∗ –
measurable sets, and define µ := µ∗ M(µ∗ ). Then M(µ∗ ) is a σ–algebra, and µ is a
measure on (Ω, M(µ∗ )).
Proof: We must show that M(µ∗ ) is a σ–algebra, and that µ∗ is countably additive on M(µ∗ ).
Certainly ∅ is µ∗ –measurable. Also, it is obvious that A ∈ M(µ∗ ) implies Ac ∈ M(µ∗ ).
Next, we show that M(µ∗ ) is closed under finite intersections: Let A, B ∈ M(µ∗ ), and let
E ⊆ Ω. Then
µ∗ ()E = µ∗ (E ∩ A) + µ∗ (E ∩ Ac )
= µ∗ (E ∩ A ∩ B) + µ∗ (E ∩ A ∩ B c ) + µ∗ (E ∩ Ac )
= µ∗ (E ∩ (A ∩ B)) + µ∗ (E ∩ (A ∩ B)c )
≥ µ∗ E
where the third line follows because µ∗ (E ∩(A∩B)c ) = µ∗ (E ∩(A∩B)c ∩A)+µ∗ (E ∩(A∩B)c ∩Ac ) =
µ∗ (E ∩ B ∩ Ac ) + µ∗ (E ∩ Ac ), and the final line holds because µ∗ is sub–additive. It follows that
A ∩ B ∈ M(µ∗ ). Hence M(µ∗ ) is an algebra.
Next, let A, B ∈ M(µ∗ ) be disjoint, and let E ⊆ Ω. Then B ⊆ Ac , and hence
µ∗ (E) = µ∗ (E ∩ Bn ) + µ∗ (E ∩ Bnc )
X
≥ µ∗ (E ∩ An ) + µ∗ (E ∩ Ac )
m≤n
→ µ (E ∩ A) + µ∗ (E ∩ Ac )
∗
≥ µ∗ (E)
Hence equality holds throughout, and so A ∈ M(µ∗ ). This proves that M(µ∗ ) is a σ–algebra. a
Events and Probabilities 29
Proposition 1.7.4 λ∗ is an outer measure on R, and λ∗ (I) = |I| for every interval I.
Proof: It is clear that λ∗ is a monotone increasing non–negative function with λ∗ (∅) = 0. To prove
that λ∗ is also countably sub–additive, let A1 , A2 , · · · ⊆ R, and fix an arbitrary ε > 0. By definition
of λ∗ we may, for each n ∈ N, choose open intervals In,1 , In,2 , . . . such that
[ X
An ⊆ In,k |In,k | ≤ λ∗ An + ε2−n all n ∈ N
k k
Then [ [[ [ XX X
An ⊆ In,k λ∗ An ≤ |In,k | ≤ λ∗ (An ) + ε
n n k n n k n
Without loss of generality (relabelling if necessary), we may assume that b ∈ (an , bn ). Then
(a , b1 ), . . . , (an−1 , bn−1 ) is a cover of the interval [a, an ]. By induction hypothesis, we have an − a ≤
P1n−1
k=1 (bk − ak ), and hence
n−1
X n
X
b − a = (b − an ) + (an − a) ≤ (bn − an ) + (bk − ak ) = (bk − ak )
k=1 k=1
as required.
It follows that b − a ≤ ∞
P
k=1 (bk − ak ). Since the (an , bn ) were an arbitrary open covering of
[a, b], it follows that b − a ≤ λ∗ [a, b], i.e. that |I| ≤ λ∗ (I) for every compact interval I.
It remains to deal with intervals that are non–compact. If I is a bounded interval, then there
is a compact interval J such that J ⊆ I and |I| ≤ |J| + ε (where we fix ε > 0). It follows that
|I| ≤ λ∗ (J) + ε. Now since also λ∗ (J) ≤ λ∗ (I), we must have |I| ≤ λ∗ (J) + ε ≤ λ∗ (I) + ε. Letting
ε ↓ 0, we see that |I| ≤ λ∗ (I) if I is a bounded interval.
Finally, if I is an unbounded interval, then λ∗ (I) = +∞: Indeed, if K > 0 is arbitrary, there is
a compact interval J ⊆ I such that |J| ≥ K. Then
λ∗ (I) ≥ λ∗ (J) ≥ K
Now that we have constructed an outer measure λ∗ , it follows by Thm. 1.7.2 that there is
a σ–algebra L(R) on R such that λ = λ∗ |L(R) is a measure on (R, L(R)). Indeed, we have
L(R) := M(λ∗ ), the family of all λ∗ –measurable sets. The σ–algebra L(R) is called the σ–algebra
of all Lebesgue measurable sets, or the Lebesgue algebra, on R.
Our next aim is to show that B(R) ⊆ L(R), i.e. that every Borel set is Lebesgue measurable.
Proof: It suffices to prove that every interval of the form (−∞, a] is Lebesgue measurable, because
the collection of intervals of this form generates B(R). Let
P E ⊆ R be arbitrary, and let I = (−∞, a].
Fix ε > 0, and choose intervals I1 , I2 , . . . such that n |In | ≤ λ∗ (E) + ε. Note that if J is an
arbitrary interval, then so are J ∩ I, J ∩ I c , and |J| = |J ∩ I| + |J ∩ I c |.
Now X
λ∗ (E) + ε ≥ |In |
n
X X
= |In ∩ I| + |In ∩ I c |
n n
∗ ∗
≥ λ (E ∩ I) + λ (E ∩ I c )
Letting ε ↓ 0, we see that λ∗ (E) ≥ λ∗ (E ∩ I) + λ∗ (E ∩ I c ), for all E ⊆ R. a
Theorem 1.7.6 There exists a unique measure λ on R, B(R)) such that λI = |I| for every
interval I.
Events and Probabilities 31
Proof: By Propn. 1.7.4, λ∗ is an outer measure with λ∗ (I) = |I| for every interval, and thus it
follows by Thm. 1.7.2 that there is a σ–algebra L(R) on R such that λ = λ∗ |L(R) is a measure on
(R, L(R)). By Propn. 1.7.5, B(R) ⊆ L(R).
It remains to prove uniqueness: Suppose that µ is another measure on (R, B(R)) with the
property that µ(I) = |I| for all intervals I. Let In = [−n, n]. Then λn := λ In , νn := ν In are
finite measures on (In , B(In )). Now Cn = {J : J an interval, J ⊆ In } is a π–system which generates
B(In ), and λn , νn agree on Cn . Hence, by Propn. 1.6.4, λn , µn agree on B(In ).
Now, if B ∈ B(R), then by Propn. 1.4.11,
and show that λ∗ is an outer measure, and that every Borel set is λ∗ –measurable.
Instead of performing the construction outlined above, we elect to wait until we have constructed
products of measure spaces: Given measure spaces (Ωi , Fi , µi ), where i = 1, . . . , n, it is possible to
construct, in a canonical way, a measure space
Y O Y
( Ωi , Fi , µi )
i≤n i≤n i
f −1 [T ] = {a ∈ A : f (a) ∈ T }
We call the set f −1 [T ] the pullback (or inverse image) of T along f . See Section A.3 for more
information.
Remarks 2.1.1 Here is some motivation for the definition of measurable function.
Suppose that X is a random variable on a probability space (Ω, F, P), i.e. a function X : Ω → R
which assigns a number X(ω) to every outcome ω ∈ Ω — we will make this notion more precise
shortly. We would like to be able to discuss the probability that X = 0, or that X lies between −1
and 1, etc. Thus we’d like to know
However, P(F ) makes sense only if F ∈ F. Thus, in order to be able to discuss the above proba-
bilities, it is necessary that the sets
belong to F.
More generally, given a Borel set B, we want to be able to discuss the probability that the
outcome X(ω) belongs to B. For
33
34 Definition of Measurable Function
(a) f −1 [T c ] = (f −1 [T ])c
(b) f −1 [ n Tn ] = n f −1 [Tn ]
S S
(c) f −1 [ n Tn ] = n f −1 [Tn ]
T T
Proposition 2.1.3 Suppose that (A, A), (S, S) are measurable spaces, and that f : A → S
is a map. Then
(i) A0 = f −1 S is a σ–algebra on A.
(ii) S 0 = {T ⊆ S : f −1 [T ] ∈ A} is a σ–algebra on S.
Exercise 2.1.4 Prove Propn. 2.1.2 and 2.1.3.
f
Definition 2.1.5 (1.) Let (A, A), (S, S) be measurable spaces. A map A → S is said to
be A/S–measurable if and only if f −1 S ⊆ A (i.e. f −1 [T ] ∈ A for all T ∈ S).
If the σ–algebras A, S are obvious from context, we simply call f a measurable func-
tion.
(2.) A measurable function X from a probability space (Ω, F) to (R, B(R)) is called a
random variable.
A measurable function X from a probability space (Ω, F) to (Rd , B(Rd )) is called a
random vector.
More generally, any measurable function from a probability space to a measurable
space is called a random element.
f
(3.) If S is a topological space, then a measurable function (S, B(S)) → (R, B(R)) is called
a Borel function.
We are usually interested in the case where S = R or Rd .
Measurable Functions and Random Variables 35
Remark 2.1.6 ? Note the similarity with the definition of continuous function: A function f between
topological spaces X, Y is continuous iff f −1 [V ] is an open subset of X whenever V is an open subset of Y ,
i.e. iff pullbacks of open sets are open.
Remarks 2.1.7 (1) The notion of measure does not occur in the definition of measurable func-
tion/random variable: Only the measurable spaces ( = set + σ–algebra) play a role.
f
(2) If (A, A) → (S, S) is measurable, then
f f −1
A→S S → A
f
(4) We will also allow extended real–valued maps: A → R̄, where R̄ := [−∞, +∞].
∅ if neither 0, 1 ∈ B
1 ∈ B and 0 6∈ B
A if
−1
IA [B] =
Ac if 0 ∈ B and 1 6∈ B
S if both 0, 1 ∈ B
−1
It follows that IA [B] ∈ S for every B ∈ B(R), so that IA is a measurable function.
−1
Similarly, if IA is a measurable function, then IA {1} = A ∈ S, since {1} is a Borel set. Thus:
Proposition 2.2.2 Suppose that (Ω, F) is a measurable space, and that A ⊆ Ω. Then
the indicator IA : Ω → R is a measurable function iff A is a measurable set (i.e. IA is
F/B(R)–measurable iff A ∈ F).
We shall soon see that all measurable functions can be represented as limits of linear combinations
of measurable indicator functions, a fact that will be extremely important when we develop the
notions of integration and expectation.
Exercise 2.2.3 Suppose that Ω is a set, that F := {∅, Ω} is the trivial σ–algebra on Ω, and that
G := P(Ω) is the powerset–algebra on Ω.
f
(a) Determine all F/B(R)–measurable functions Ω → R.
f
(b) Determine all G/B(R)–measurable functions Ω → R.
Exercise 2.2.4 Suppose that (Ω, F), (S, S) are measurable spaces, and that f : Ω → S is F/S–
measurable.
(2.) Show that if T is a σ–algebra on S such that T ⊆ S, then f is also F/T –measurable.
Corollary 2.2.5 A function f : (Ω, F) → (R, B(R)) is measurable iff one of the following
conditions holds :
Proof: (a) In Propn. 2.1.8, take (S, S) = (R, B(R)) and C to be the collection of all intervals of the
form (−∞, c]. We already know that these intervals generate the Borel algebra on R — cf. Propn.
1.2.8.
(b),(c),(d) are proved similarly. a
Proof: Suppose that f : R → R is continuous, and c ∈ R. Given x ∈ {f < c}, let 0 < εx < c−f (x),
and choose δx > 0 so that |f (x) − f (y)| < εx wheneverS|x − y| < δx (by definition of continuity).
Then (x − δx , x + δx ) ⊆ {f < c}, and hence {f < c} = x∈{f <c} (x − δx , x + δx ) is a union of open
intervals. By footnote 3 of Chapter 1, it follows that {f < c} is a countable union of open intervals.
hence {f < c} ∈ B(R). a
f
Proof: If R → R is monotone, then {f < c} is an interval (for all c ∈ R): Suppose, for example,
that f is increasing, and let x0 = sup{x : f (x) < c}. Since f is increasing we have f (x) ≤ f (x0 )
whenever x ≤ x0 . Hence (
(−∞, x0 ] if f (x0 ) < c
{f < c} =
(−∞, x0 ) if f (x0 ) ≥ c
Hence {f < c} is an interval, and thus a member of B(R).
The case where f is decreasing can be dealt with similarly. a
Proposition 2.2.8 Suppose that (Ω, F) is a measurable space, and that A = {An : n ∈ N}
is a partition of Ω which generates F, i.e. σ(A) = F. Then f : (Ω, F) → R̄ is measurable
if and only if f is constant on each block An .
Proof: Recall that every element of F is a union of some of the blocks An — see Exercise 1.2.5.
Further recall that each ω ∈ Ω belongs to exactly one block An .
Suppose first that f : (Ω, F) → R̄ is F/B(R̄)–measurable. Suppose that ω1 , ω2 belong to the
same block Ak . Let c := f (ω1 ). Then f −1 {c} ∈ F, and hence f −1 {c} is a union of blocks. Now
ω1 ∈ f −1 {c}, and hence Ak is one of the blocks in the union which makes up f −1 {c}, i.e. Ak ⊆
f −1 {c}. Since ω2 ∈ Ak , we also have ω2 ∈ f 1 {c}, and thus f (ω2 ) = c also. Thus f (ω1 ) = f (ω2 ). It
follows that f is constant (with value c) on the block Ak .
Conversely, suppose that f is constant on the blocks. Let f take the value cn on the block An ,
i.e. f (ω) = cn for all ω ∈ An . If B ∈ B(R̄), then
[
f −1 [B] = {ω : f (ω) ∈ B} = {An : cn ∈ B}
n
Proposition 2.3.1 (a) Suppose that f, g : (Ω, F) → R̄ are measurable functions and that
α ∈ R. Then
f +g f2 αf f ·g f /g
are measurable functions, where we assume g 6= 0 on Ω for the case f /g.
are measurable.
Proof: (a) Suppose that f, g are measurable. First, we show that f + g is measurable. By Propn.
2.1.8 it suffices to show that
Now f (s) + g(s) > c iff f (s) > c − g(s) iff f (s) > q > c − g(s) for some q ∈ Q. Thus
[
{f + g > c} = {f > q} ∩ {g > c − q}
q∈Q
Measurable Functions and Random Variables 39
Now {f > q}, {g > c − q} ∈ F because f, g are measurable. Since Q is a countable set, {f + g >
c} ∈ F also.
Next, we show that f 2 is measurable. This follows easily from Propn. 2.1.8 using the fact that
( √ √
2 {− c ≤ f ≤ c} if c ≥ 0
{f ≤ c} =
∅ else
To see that αf is measurable is easy, e.g. if α > 0, then {αf < c} = {f < αc }.
Next, to see that f g is measurable, use the polarization identity
f g = 14 [(f + g)2 − (f − g)2 ]
f
Finally, to see that g is measurable, it suffices to see that g1 is measurable. But if c > 0
{ g1 < c} = { 1c < g} ∩ {g > 0} ∪ { 1c > g} ∩ {g < 0}
• |f |
Proof: If C ∈ T , then (g ◦ f )−1 [C] = f −1 [g −1 [C]]. Now g −1 [C] ∈ S because g is measurable. Thus
f −1 [g −1 [C]] ∈ A, because f is measurable. a
40 Approximation by Simple Functions
Let (Ω, F) be a measurable space. Recall that a finite or countably infinite sequence (Fn )n of
members of F is said toSform a partition of Ω iff (i) The Fn are mutually disjoint (i.e. Fn ∩ Fm = ∅
when n 6= m), and (ii) n Fn = Ω.
Proof: It is obvious that a function of the form f = ni=1 ci IFi (where each Fi ∈ F) is simple: f
P
can only take on values which are sums of finitely many of the ci .
} is a finite set. Define Fi = f −1 {ci }
Suppose now that f is simple, i.e. that ranf = {c1 , . . . , cnP
for i = 1, . . . , n. Then the Fi form a partition of Ω, and f = ni=1 ci IFi . a
Simple functions play an important part in integration theory. Many important results are
proved first for simple functions, and then extended to arbitrary measurable functions by taking
limits. The next result is therefore extremely important:
f
Proposition 2.4.3 (a) For any non–negative measurable function (Ω, F) → R̄+ there
exists a sequence of simple measurable functions fn , n ∈ N such that 0 ≤ fn ↑ f .
Moreover, if f is bounded, we can choose the fn so that fn → f uniformly.
(b) For any measurable function (Ω, F) → R̄, there is a sequence of simple measurable
functions such that fn → f . Moreover, if f is bounded, we can choose the fn so that
fn → f uniformly.
fn (s) = 2kn . It follows that fn (s) ≤ fn+1 (s), i.e. that hfn (s)in is increasing.
Next, we show that fn (s) → f (s) for all s ∈ S. If f (s) = +∞, then fn (s) = n for all
n ∈ N, so certainly fn (s) → f (s). If f (s) < ∞, choose N such that f (s) < N . If n ≥ N , then
0 ≤ f (s) − fn (s) ≤ 2−n , and thus |f (s) − fn (s)| ≤ 2−n . Thus fn (s) → f (s) in this case also.
Finally, if f is bounded, i.e. f ≤ N for some N ∈ N, then we see that |f (s) − fn (s)| ≤ 2−n for
all n ≥ N and all s ∈ S, i.e. fn → f uniformly.
(b) Now let f be an arbitrary measurable function to R̄. Then f is the difference of two non–
negative measurable functions f = f + − f − (cf. Remarks 2.3.3), and thus, as in (a), there exist
non–negative simple functions fn+ , fn− such that fn+ ↑ f + , fn− ↑ f − . Clearly then also (fn+ −fn− ) → f .
Now note that if f is bounded, so are f + , f − . If the fn+ and fn− converge uniformly, then also
(fn+ − fn− ) → f uniformly. a
f
Proposition 2.5.1 Suppose that (Ω, F) → (S, S) is measurable, and that µ is a (proba-
bility) measure on (Ω, F). Define a set function µf −1 on S by
µf −1 (T ) = µ f −1 [T ]
PX −1 B = P(X ∈ B)
Exercise 2.5.4 (a) Suppose that (Ω, F, P) is the die space, i.e. Ω = {1, 2, . . . , 6}, F = P(Ω) and
P(ω) = 61 for all ω ∈ Ω. Define X : Ω → R : ω 7→ ω 2 − 5. Show that X is F/B(R)–measurable,
and determine law of X, i.e. the measure PX −1 on (R, B(R)).
(b) Suppose that F : R → R : x 7→ x2 . Show that F is a Borel function, and calculate λF −1 [−1, 3]
(where λ is Lebesgue measure).
A measure µ on (R, B(R)) is said to be locally finite iff µ(I) < ∞ for every compact interval I.
The next theorem states that there is a one–to–one correspondence between locally finite measures
and increasing right–continous functions.
Theorem 2.5.5 ∗
(b) Conversely, given a locally finite measure µ on (R, B(R)), there is a unique right–
continuous increasing function F with F (0) = 0 so that
Example 3.1.1 A die is rolled. Let A be the event that the outcome is a 6, let B be the event
that the outcome is an even number, and let C be the event that the outcome is an odd number.
Clearly P(A) = 16 . However, if we know for sure that the outcome is an even number, then the
probability of getting a 6 is 13 , i.e. P(A|B) = 13 . In the same way, if B occurs, then C cannot
possibly occur, so although P(C) = 21 , P(C|B) = 0.
Basically, what’s happening here is that we have to modify our probability measure to accom-
modate the “new” information that B has occurred. If P( |B) is the new probability measure on
(Ω, F), then we must have P(B|B) = 1 and P(B c |B) = 0. If A is another event, then A occurs if
and only if A ∩ B occurs, since we know that B also occurs, and it makes sense to assume that the
new probability that A occurs is proportional to the old probability that A ∩ B occurs, i.e. that
P(A|B) = cP(A ∩ B) for some constant c. To ensure P(B|B) = 1, we must have c = P(B)−1 . We
therefore find that
P(A ∩ B)
P(A|B) =
P(B)
the standard formula given in elementary probability theory texts.
Exercises 3.1.2 (1.) Prove that P( |B) is a probability measure on (Ω, F).
45
46 Conditional Probability and Independence of Events
Example 3.1.3 A couple has two children. Assuming that boys and girls are equally likely, and
given that one of the children is a girl, what is the probability that the other child is also a girl?
We can model this probability space as follows:
P(A ∩ B) 1/4 1
P(A|B) = = =
P(B) 3/4 3
Two events A, B are said to be independent if knowledge of B tells us nothing about A, and
vice versa. By this we mean that our estimate of the probability that A occurs isn’t revised by the
knowledge that B has occurred. Thus:
P(A ∩ B)
P(A) = P(A|B) =
P(B)
Definition 3.1.4 Let (Ω, F, P) be a probability space. A (possibly infinite) set A = {Ai :
i ∈ I} of events is said to be an independent family provided that for any distinct
i1 , i2 , . . . , in ∈ I
Example 3.1.5 It’s worth pointing out that whether or not two events are independent depends
on the probability measure , i.e. it is possible for events to be independent under one measure, but
not under another. The notion of independence is therefore a genuinely probabilistic notion, which
has no analogue in general measure theory.
Information and Independence 47
(a) Consider the random trial of tossing coin twice. The sample space Ω is the 4–element set
{HH, HT, TH, TT} and the associated σ–algebra is just P(Ω). Intuitively, if the coin is fair,
the outcome of the first coin should have no influence on the second. Thus knowing that the first
coin has landed heads should make no difference to whether the second coin lands heads. Let
B = {HH,HT} be the event that the first coin lands heads, and let A = {HH,TH} be the event
that the second coin lands heads. Then P(A ∩ B) = P({HH}) = 41 , and P(A) · P(B) = 12 · 12 = 41 .
Thus P(A ∩ B) = P(A) · P(B), i.e. the events A and B are indeed independent.
(b) Consider the same experiment as in (a), but with one important difference: Before the exper-
iment starts, we are told that the coin is unfair. It has either two heads, or two tails, but we
are not told which. Each possibility is equally likely.
To model this, we use a different probability measure Q, which has
1
Q({HH}) = = Q({T T }) Q({HT }) = 0 = Q({T H})
2
In this case Q(A ∩ B) = 12 , whereas Q(A)Q(B) = 41 . Thus A and B are not independent under
Q.
Remarks 3.1.6 Can an event be independent of itself, i.e. given an event A, can the events A, A
be independent? Here we have to be a little careful. From the intuitive point of view, the answer
would seem to be no, since the information that the event A has occurred will certainly make us re-
evaluate our estimation of the probability that A has occurred! However, if we look at the definition,
A and A will be independent provided P(A ∩ A) = P(A)P(A), i.e. provided P(A) = P(A)2 . This can
happen only if P(A) is either 0 or 1. That’s not too far removed from our intuition. If P(A) = 1,
for example, then A happens almost surely, so telling us that A has happened does not really give
us any information. We were practically certain that it would anyway.
Exercise 3.1.7 A gambling game involves the rolling of a fair die followed by the flipping of a fair
coin.
(b) Let A be the event that the die lands on an even number, and let B be the event that the coin
lands tails. Show that A and B are independent events.
Exercise 3.1.8 An breathalizer test for drinking and driving is 95% accurate, i.e. it gives the
correct result 95% of the time. John lives in a small town with a 1000 inhabitants, about 50 of
whom are drunk on any given evening. One evening, John is stopped by the police, and tested.
The test says that John is drunk. What is the probability that John is drunk?
48 Information in Random Variables
Definition and Proposition 3.2.1 (a) Let (S, S) be a measure space, and suppose that
X is a collection of functions Ω → S. There is a smallest σ–algebra on Ω denoted by
σ(X )
is the such that all X ∈ X are σ(X )/S–measurable. σ(X ) is called the σ–algebra
generated by X .
We also write σ(Xi : i ∈ I) for the σ–algebra generated by the family X = {Xi : i ∈ I}.
σ(X) = {X −1 [T ] : T ∈ S}
In the probabilistic framework, σ–algebras play the role of carriers of information: Earlier, we
saw that if (Ω, F, P) is a probability space, then
• F is the set containing all those events for which it can be decided whether or not they
occurred.
• If C is a family of events, then σ(C) is the set containing all those events for which it can be
decided whether or not they occurred, given that we can decide all the events in C.
Information and Independence 49
Similarly:
For a random variable X on a probability space Ω, the σ–algebra σ(X) can be interpreted
in two ways (which are two sides of the same coin):
• σ(X) is the information carried by X: It is the set of all events that can be decided,
given that we know value of X.
Exercise 3.2.3 Suppose that X : (Ω, F) → (R, B(R)) is a function, and that ω1 , ω2 ∈ Ω are two
elements with the following property:
For all F ∈ F we have ω1 ∈ F ⇔ ω2 ∈ F
Show that if X is F–measurable, then X(ω1 ) = X(ω2 ). Thus if F cannot distinguish between ω1
and ω2 , neither can any F–measurable random variable.
[Hint: Define x := X(ω1 ) and consider X −1 {x}.]
If X, Y are random variables such that σ(Y ) ⊆ σ(X), then the information needed to determine
the value of Y is a subset of the information required to determine the value of X. Hence, if we
know the value of X, we should also know the value of Y . This suggests that Y is a function of X.
The following theorem makes this precise.
Theorem 3.2.4 (Doob–Dynkin Lemma)
Suppose that Xi , Y : (Ω, F) → (R, B(R)) (i = 1, . . . , n) are measurable. Then Y
h
is σ(X1 , . . . , Xn )–measurable iff there is a Borel function Rn → R such that Y =
h(X1 , . . . , Xn ).
50 Independence of σ–algebras and Random Variables
Proof: (⇐): We first show that the map X : Ω → Rn : ω 7→h(X1 (ω), . . . , Xin (ω) is σ(X1 , . . . , Xn )/B(Rn )–
Qn
measurable. By Propn. 2.1.8, it suffices to check that X −1 i=1 (−∞, ci ] ∈ σ(X1 , . . . , Xn ) for all
(c1 , . . . , cn ) ∈ R , because the family of these lower orthants generates B(Rn ). But
n
n
hY i \n
X −1
(−∞, ci ] = Xi−1 (−∞, ci ]
i=1 i=1
lim hk (x) if x ∈ M
(
h(x) = k
0 else
which implies two things: (i) (X1 (ω), . . . , Xn (ω)) ∈ M , and (ii) Y = h(X1 , . . . , Xn ), as required. a
The basic idea is that two σ–algebras are independent if there is no information about an event
in one σ–algebra that would lead us to revise our estimate of the probability of any event in the
other σ–algebra.
Other variations (e.g. the what it means for random variables X1 , . . . , Xn to be independent)
should be obvious.
We use the symbol ⊥ ⊥ to denote independence. Thus X ⊥⊥ G means that the random variable
X is independent of the σ–algebra G, etc.
Example 3.3.2 Suppose that A, B are events in some probability space (Ω, F, P). Then A =
{Ω, A, Ac , ∅} and B = {Ω, B, B c , ∅} are the σ–algebras of events that can be decided by knowledge
of A, B respectively. It is easy to show that A, B are independent events if and only if A, B are
independent σ–algebras. For example, P(A) = P(A ∩ B) + P(A ∩ B c ), and thus P(A ∩ B c ) =
P(A)[1 − P(B)] = P(A)P(B c ), by independence of A, B. It follows that A, B c are independent if
A, B are. The other combinations of events are similarly proven independent.
Recall that we decomposed the notion of σ–algebra into to parts, namely π–systems and λ–
systems (cf. Proposition 1.6.2). The idea is that π–systems are easy to work with, whereas λ–
systems mesh well with the properties of measures. As an example, we have the following result:
If two π–systems are independent, so are the σ–algebras generated by those π–systems:
Theorem 3.3.3 Let {Ct }t∈T be a collection of independent π–systems on (Ω, F, P). Then
{σ(Ct )}t∈T is a collection of independent σ–algebras.
Proof: We must show that if t1 , . . . , tn ∈ T are distinct, then σ(Ct1 ), . . . , σ(Ctn ) are independent.
We proceed by recursion. Fix t1 , . . . , tn ∈ T , define Ftk := σ(Ctk ), and also fix Ct2 ∈ Ct2 , . . . , Ctn ∈
Ctn . Let
D := {F ∈ Ft1 : P(F ∩ Ct2 ∩ · · · ∩ Ctn ) = PF · PCt1 · . . . · PCtn }
By assumption, Ct1 ⊆ D. Using the additivity and continuity properties of measures, it is straight-
forward to check that D is a λ–system. Thus by Thm. 1.6.3 we have D = Ft1 for every selection of
52 Borel–Cantelli Lemmas
Ctk ∈ Ctk k = 2, 3, . . . , n, and hence the families Ft1 , Ct2 , Ct3 , . . . , Ctn are independent. Repeat: Fix
Ft1 ∈ Ft1 and Ct3 ∈ Ct3 , . . . Ctn ∈ Ctn . Redefine
Again, D is a λ–system containing Ct2 , and hence by Thm. 1.6.3 D = Ft2 . From this it follows
that Ft1 , Ft2 , Ct3 , . . . , Ctn are independent. Repeat the construction n − 2 more times to deduce
that Ft1 , . . . Ftn are independent. a
If you’re familiar with the elementary definition of independence for random variables, given in
introductory courses on probability and statistics, you will want to know the following:
Exercise 3.3.4 Random variables X, Y are said to be independent in the elementary sense iff
The following result is very easy to within the measure–theoretic framework, but very difficult
to prove using the elementary definition of independence given in Exercise 3.3.4:
Theorem 3.3.5 Suppose that X1 , . . . , Xn+m are independent random variables, and that
f g
Rn → R and Rm → R are Borel functions. Then Y = f (X1 , . . . , Xn ) and Z =
g(Xn+1 , . . . , Xn+m ) are independent.
Proof: We see that σ(X1 , . . . , Xn ) and σ(Xn+1 , . . . , Xm+n ) are independent σ–algebras. Now
Y is σ(X1 , . . . , Xn )–measurable, i.e. σ(Y ) ⊆ σ(X1 , . . . , Xn ). Similarly Z is σ(Xn+1 , . . . , Xm+n )–
measurable, i.e. σ(Z) ⊆ σ(Xn+1 , . . . , Xm+n ). Thus σ(Y ) and σ(Z) are independent. a
In particular, if X ⊥
⊥ Y then also f (X) ⊥⊥ g(Y ) for any Borel functions f, g.
then
P(An , i.o.) = 0
Information and Independence 53
∞
S
Proof: Let Bn = Ak . Then Bn ↓ lim sup An = (An , i.o.). Hence by countable subadditivity
k=n
∞
X
P(An , i.o.) ≤ P(Bn ) ≤ P (An )
k=n
P
for all n ∈ N. Now as n → ∞, the right–hand sum goes to zero, since n P(An ) converges. Hence
P(An , i.o.) = 0. a
then
P(An , i.o.) = 1
Proof: The proof depends on the fact that 1 − x ≤ e−x for all x ∈ R, an inequality which is easily
proved using first–year calculus. Now clearly (An , i.o.) = lim sup An = (lim inf Acn )c = (Acn , ev.)c ,
∞ T
∞
and thus it suffices to prove that P(Acn , ev.) = 0. But (Acn , ev.) = Ack by definition, and so
S
n=1 k=n
∞
Ack ) = 0 for all n. Now by independence of the An , and thus the Acn ,
T
it suffices to show that P(
k=n
we have
n+m
n+m n+m n+m P
\ Y Y − P(Ak )
−P(Ak )
P( Ack ) = [1 − P(Ak )] ≤ e =e k=n
Remarks 3.4.3 The First Borel–Cantelli Lemma says that given events An , not necessarily in-
dependent, if the sum of the probabilities P(An ) converges, then (An , i.o.) is an event of zero
probability. The Second Borel–Cantelli Lemma says that if the An are independent and the sum
of the probabilities P(An ) diverges, then the event (An , i.o.) occurs almost surely, i.e. with prob-
ability 1. Thus for independent events An , there is no middle road: (An , i.o) is either an event of
probability 0 or an event of probability 1.
Example 3.4.4 Suppose that Xn are independent exponentially distributed random variables with
parameter λ for n = 0, 1, 2, . . . , i.e.
(
1 − e−λx if x ≥ 0
P(Xn ≤ x) =
0 if x < 0
54 Borel–Cantelli Lemmas
Exercise 3.4.5 It is sometimes asserted that if a monkey hit the keys of a type writer at random,
it would eventually produce, in one continuous stream, the complete works of William Shakespeare.
Prove it.
Chapter 4
Expectation = Integration
Z
EP [X] = X dP
Throughout this section let (Ω, F, µ) be a measure space, and let mF be the set of all measurable
functions
R from (Ω, F) to R̄. We will define a (partial) linear functional, also denoted by µ, or by
· dµ, from mF to R̄, i.e.
Z Z
µ = · dµ : mF → R f 7→ µ(f ) = f dµ
R
The quantity µ(f ) = f dµ need not exist for every measurable function f . If it does exist, we
say that f is integrable.
Below follows a wish list of properties that we would like the integral to possess. Bear in mind
that not all wishes come true. In particular, we shall have to be content with a weaker version wish
(IV.).
WISH LIST:
R
I. IA dµ = µ(A).
R R R
II. (Linearity) A αf + βg dµ = α f dµ + β A g dµ
R R
III. (Monotonicity) If f ≤ g then f dµ ≤ g dµ
R R
IV. (Continuity) Suppose that fn → f . Then fn dµ → f dµ.
55
56 The Integral: Definition and Basic Properties
From these wishes, we will be able to derive instructions for the definition of the integral.
Note that wish (I.) states that the integral µ is, in some sense, an extension of the measure
µ: Every measurable set can be identified with a measurable function (the set R A is identified with
the indicator function IA ), and we require that µ(A) = µ(IA ) The integral f dµ = µ(f ) can be
thought of as extending the measure µ from measurable sets to measurable functions.
The definition of the integral proceeds in three steps:
Step 2. Extend the definition to the set mF + of all non–negative measurable functions.
If ϕ is a non–negative simple function, there is only one way to define the integral to be consistent
with wishes (I.) and (II.):
Proof: We may assume that the ak are all distinct from each other, and that the bj are all distinct
from each other. Thus Ak , Bj P ∈ σ(ϕ), the σ–algebra generated by ϕ. A little thought shows that
there is a representation ϕ = m cm ICm of ϕ such that (Cm )m forms a partition of S, and such
that each Ak and Bj is a union of some Cm ’sm — just let the Cm ’s be the blocks of the partition
P the σ–algebra σ(ϕ). In particular, for each k, m, either Ak ∩ Cm = ∅, or Cm ⊆ Ak .
that generates
Also, cm = k {ak : Cm ⊆ Ak }. A similar statement holds for the Bj .
(a) We have
X X X XX
ak µ(Ak ) = ak µ(Ak ∩ Cm ) = ak µ(Ak ∩ Cm )
k k m m k
XX X
= {ak µ(Cm ) : Cm ⊆ Ak } = cm µ(Cm )
m k k
Integration and Expectation 57
(b) is obvious. P P
(c) Suppose
P that ϕ = k ak IAk , ψ = j bj IBj , where (Ak )k , (Bj )j are partitions of S. Then
ϕ + ψ = k,j (ak + bj )IAk ∩Bj and hence
X X X
µ(ϕ + ψ) = (ak + bj )µ(Ak ∩ Bj ) = ak µAk + bj µBj
k,j k j
(d) is obvious. a
R
R If f is a non–negative measurable function, then wish (III.) requires that we must have f dµ ≥
ϕ dµ whenever ϕ is simple, with f ≥ ϕ. By Proposition 2.4.3, there is a sequence hψn i of simple
non–negative functions such that ψn ↑ f .Then
Z Z
f dµ = lim ψn dµ (by wish (IV.)
n
Z
= sup ψn dµ (because if hxn i is increasing, then lim xn = sup xn )
n n n
Z
≤ sup ϕ dµ : ϕ ∈ sF + , ϕ ≤ f
Z
≤ f dµ (by wish (III.))
The most parsimonious choice — and one that does not depend on the approximating sequence
hψn i — is therefore:
Proving that the integral defined in Step 2 is linear, i.e. that wish (II.) holds, is much more
difficult, and requires a version of (IV.) In fact, a weak version of (IV.) forms the foundation for
the whole edifice of integration theory:
58 The Integral: Definition and Basic Properties
This is true for any non–negative simple ϕ ≤ f and any ε > 0. Taking the supremum over those
ϕ, we see that
Z Z Z
+
lim fn dµ ≥ (1 − ε) sup ϕ dµ : ϕ ≤ f, ϕ ∈ sF = (1 − ε) f dµ
n
R R
Letting ε → 0, we conclude that we also have limn fn dµ ≥ f dµ. a
In Propn. 4.1.2(b), Exercise 4.1.4(b) and Thm 4.1.5, we have seen that wishes (I.) and (III.),
and a weak version of wish (IV.) hold. We have also verified wish (II.) for non–negative simple
functions (cf. Proposition 4.1.2(c)). Now we can verify that wish (II.) holds for non–negative
measurable functions:
Integration and Expectation 59
Proof: By Proposition 2.4.3, we may choose sequences hϕn in , hψm im of non–negative simple func-
tions such that ϕn ↑ f , ψn ↑ g. Then each αϕn + βψn is non–negative simple, and (αϕn + βψn ) ↑
(αf + βg). Since wish (II.) holds for simple functions, by the Monotone Convergence Theorem, we
see that
Z Z Z Z Z Z
αf + βg dµ = lim αϕn + βψn dµ = α lim ϕn dµ + β lim ψn dµ = α f dµ + β g dµ
n n n
It remains to define the integral for arbitrary measurable functions. Recall that if f ∈ mF,
then
f = f+ − f− |f | = f + + f −
f +R := max{f, − + − +
Rwhere
+ −
R 0} and f := max{−f, 0}. Since f , f , |f | ∈ mF , the three integrals
f dµ, f dµ, |f | dµ have already been defined in Step 2.R Moreover,R by wish (II.)R for non–
negative measurable functions (i.e. Proposition 4.1.6), we have f + dµ + f − dµ = |f | dµ.
If we want wish (II.) to hold for arbitrary
R measurable functions, i.e. if we want to preserve
linearity, we have no choice but to define f dµ by
Z Z Z
f dµ = f + dµ − f − dµ
However, here we face a problem: If both f + dµ, f − dµ are equal to +∞, we have f dµ =
R R R
∞ − ∞, an indeterminate form.
We therefore demand that both integrals f + dµ, f − Rdµ are finite.
R R
+
R −Since both
R integrals are
non–negative, this is equivalent to demanding that the sum f dµ + f dµ = |f | dµ is finite.
Notation:
Rb
Remarks 4.1.8 Later, we will prove the following important fact: If the Riemann integral a f (x) dx
f
of a function R → R exists, then
Z b Z
f (x) dx = f (x) λ(dx)
a [a,b]
where λ is Lebesgue measure on (R, B(R)). If the Riemann integral of a function exists, then so
does the Lebesgue integral, and the two integrals coincide. This is obvious if f is a simple function,
as you can easily check, but the proof for general f is deferred to a later section. Note that the
Lebesgue integral may exist even when the Riemann integral does not.
Proof: (a) is obvious. (b) follows from the fact that µ|f | ≤ µg < ∞ (because we have (III.),
monotonicity, for non–negative measurable functions). R
(c) Let A = {s ∈ S : |f (s)| = ∞}. ThenR nIA ≤ |f | for all n ∈ N, and hence n µ(A) ≤ |f | dµ.
Letting n → ∞, we see that we must have |f | dµ R= +∞ if µ(A) > 0.
dµ − f − dµ| ≤ | f + dµ| +
R + R R
(d)
R − follows
R because by the triangle inequality | f dµ| = | f
| f dµ| = |f | dµ. a
Exercise 4.1.10 The decomposition f = f + − f − is but one of many ways that f can be decom-
posed as a difference of non–negative measurable functions. Show that if f = g − h is a difference of
non–negative functions, then µf = µg − µh. Thus the definition of the integral of f is independent
of the representation of f as a difference of non–negative measurable functions.
[Hint: Apply Proposition 4.1.6 to f + + h = g + f − .]
Integration and Expectation 61
Looking at our wish list of properties, i.e. (I.)–(IV.), we see that (I.) holds
R + automatically. (III.)
+ ≤ g + and f − ≥ g − , so
R +
(monotonicity)
R − is easy: If f ≤ g, then f f dµ ≤ g dµ and
f dµ ≥ g − dµR (becauseR (III.) holds for non–negative measurable functions, cf. Exercise
R
R R R R R
Proof: It suffices to prove that f + g dµ = f dµ + g dµ and that αf ) dµ = α ; dµ (for
f, g ∈ L1 and α ∈ R). Now
f + g = (f + + g + ) − (f − + g − )
is a representation of Rf + g as a difference of non–negative measurable functions. By Exercise
+ dµ − f − + g − dµ. Proposition 4.1.6 implies that
R + R
4.1.10,
R it follows
R that R f + g dµ = f + g
f + g dµ = f dµ + g dµ.
Similarly, an application of Proposition 4.1.6 and Exercise 4.1.10 to αf = αfR + − αf − (if α ≥ 0),
or αf = (−α)f − − (−α)f + (if α < 0) yields the conclusion that αf dµ = α f dµ.
R
a
Exercise 4.1.12 Show that L1 (Ω, F, µ) is a vector space. Also give an example to show that it
may not be closed under multiplication.
(b) REVERSE FATOU LEMMA: Suppose that fn ∈ mF + for n ∈ N, and that there is
g ∈ L1 (Ω, F, µ) such that {fn : n ∈ N} is dominated by g. Then
Z Z
lim sup fn dµ ≤ lim sup fn dµ
n n
62 Dominated Convergence Theorem
Proof: (a) Let f = lim inf n fn , and define gn = inf m≥n fmR. Then gnR↑ f (by definition of
R lim inf),
and so Rthe Monotone Convergence Theorem implies R that g n dµ
R ↑ f dµ. Moreover, R gn dµ ≤
inf m≥n R fm dµ (by monotonicity, (III.)), and so f dµ = limn gn dµ ≤ limn inf m≥n fm dµ =
lim inf n fn dµ. a
Exercise 4.2.2 Prove the Reverse Fatou Lemma by applying Fatou’s Lemma to the sequence
g − fn . (Why do we require that g ∈ L1 ? Cancellation! For x + y = x + z does not imply that
y = z when x = ∞.)
This provides a useful mnemonic: The terms with the limits on the outside (of the integral) are on
the inside (of the string of inequalities).
The mnemonic Terms with limits on the inside are on the outside also works.
Theorem 4.2.4 (Dominated Convergence Theorem, DCT)
Suppose that f1 , f2 , f3 , . . . is a sequence of measurable functions on (Ω, F, µ) such that
Proof: Since |fn | ≤ g, the functions g ± fn are non–negative measurable functions, and thus by
Fatou’s lemma, we see that
Z Z Z
g dµ + lim inf ± fn dµ = lim inf g ± fn dµ
n n
Z
≥ lim inf g ± fn dµ
n
Z
= g ± f dµ
Z Z
= g dµ ± f dµ
R R R R
Subtracting
R g dµ < ∞ from R we see that lim inf n fn dµ ≥ f dµ and that lim inf n − fn dµ ≥
R both sides,
− f dµ, i.e. that lim supn fn dµ ≤ f dµ. Combining, we obtain
Z Z Z Z
f dµ ≤ lim inf fn dµ ≤ lim sup fn dµ ≤ f dµ
n n
a
Integration and Expectation 63
(ii) A ⊆ F
then “clearly” µ(A) = 0 also. However, if A 6∈ F, then µ(A) is undefined! Yet, µA clearly “ought”
to be zero. By adding all those sets whose measure “ought” to be zero, we get a new σ–algebra
F µ , called the completion of F w.r.t µ.
(a) We say that A is µ–null if there exists B ∈ F such that A ⊆ B and µ(B) = 0.
(It is not necessary that A ∈ F.)
(c) The measure space (Ω, F, µ) is said to be complete iff every µ–null set is measurable,
i.e. belongs to F.
Exercise 4.3.2 Show that the family N µ of µ–null sets closed under countable unions, i.e. that if
Nn ∈ N , for n ∈ N, then also n Nn ∈ N µ .
µ
S
64 Measure Zero
Definition and Proposition 4.3.3 Let (Ω, F, µ) be a measure space. Let N µ be the
family of µ–null sets.
(b) We have
(c) We can extend the measure µ to a measure µ̄ on the σ–algebra F µ in the obvious way:
Proof: (a) We first show that F µ is a σ–algebra. That F µ is closed under countable unions follows
straightforwardly from the fact that both F and N µ are closed under countable unions. To check
that F µ is closed under complementation, suppose that F ∪ N ∈ F µ , where F ∈ F, N ∈ N µ . We
must show that (F ∪ N )c ∈ F µ as well. Choose G ∈ F such that µ(G) = 0 and N ⊆ G. Then
(F ∪ N )c = (F ∪ G)c ∪ [G − (F ∪ N )]
there exist F ∈ F such that G ⊆ F and µ(F ) = 0. Putting all this together, we obtain N ⊆ F and
µ(F ) = 0. Thus N ∈ N µ ⊆ F µ . a
Unless it is likely to cause confusion, we shall usually refre to the measure µ̄ by the name µ.
Definition 4.3.4 We shall say that a statement Φ holds µ–almost everywhere (or µ–
almost surely if µ is a probability measure), if the set {ω ∈ Ω : Φ(ω) is not true} where Φ
fails to hold to hold is a µ–null set.
First note that completing a measure space does not create any interesting new measurable
functions:
f
Proposition 4.3.5 Let (Ω, F µ , µ) be the completion of (Ω, F, µ). Then a function Ω → R̄
g
is F µ –measurable iff there is an F–measurable function Ω → R̄ such that f = g µ–a.e.
Proof: (⇒): First suppose that f = IA is an indicator function, where A ∈ F µ . Then there exist
F ∈ F and N ∈ N µ such that A = F ∪ N . Clearly {ω ∈ Ω : IA (ω) 6= IF (ω)} = N is µ–null, so
IA = IF µ–a.e., and IF is F–measurable.
It is now straightforward to see that the proposition holds for simple functions as well.
If f is an arbitrary F µ –measurable function, we may (by Proposition 2.4.3) choose a sequence fn
of simple F µ –measurable functions such that fn → f . Then choose simple F–measurable functions
gn such that fn = gn µ–a.e., for all n ∈ N. LetS g = lim supn gn . Then g is F–measurable and f = g
µ–a.e. (because {ω ∈ Ω : f (ω) 6= g(ω)} ⊆ n {ω ∈ Ω : fn (ω) 6= gn (ω)}, a countable union of null
sets, and hence µ–null).
(⇐): Suppose that f = g µ–a.e. for some F–measurable functiong. To show that f is F µ –
measurable, we must show that f −1 [B] ∈ F µ for any Borel set B. Certainly g −1 [B] ∈ F, because
g is F–measurable. Let N ∈ F be such that {f 6= g} ⊆ N and µ(N ) = 0. Then a little thought
will show that
f −1 [B] = (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ) (?)
| {z } | {z }
∈F ∈N µ
Exercise 4.3.6 Show that if (Ω, F, µ) has completion (Ω, F µ , µ), then
[Hint: Let N µ be the family of null sets, and let G := {A ⊆ S : A∆F ∈ N µ for some F ∈ F}. First
show that F, N µ ⊆ G, and conclude that F µ ⊆ G. Next, note that if A ∈ G for some F ∈ F, then
there is N ∈ F such that A∆F ⊆ N . Show that A = (F − N ) ∪ (A ∩ N ), where F − N ∈ F and
A ∩ N ∈ N µ , and conclude that A ∈ F µ .]
Next, we shall show that two functions which are equal µ–a.e. have the same integrals. To do
so, the following lemma will be useful:
R
Lemma 4.3.8 On (Ω, F, µ), if h ≥ 0 is measurable, then h dµ = 0 iff h = 0 µ–a.e.
Proof: It is easy to see that the statement is true if h is a simple non–negative measurable function.
+
R
For general h ∈ mF , choose simple R hn suchR that 0 ≤ hn ↑ h. If h dµ = 0, then by the
Monotone Convergence Theorem, 0 ≤ hn dµ ≤ h dµ = 0, so that, by the above, hn = 0 µ–a.e.
(since the result holds for simple functions, and hn is simple). ThusRalso h = limn Rhn = 0 µ–a.e.
Conversely, if h = 0 µ–a.e., then also hn = 0 µ–a.e., and hence h dµ = limn hn dµ = 0, by
the MCT. a
Theorem 4.3.9 If f, g are measurable functions on (Ω, F, µ) such that f = g µ–a.e., and
if f is integrable, then g is integrable, and µf = µg.
R R R
Proof: We have 0 ≤ | f dµ − g dµ| ≤ |f − g| dµ, R by Proposition
R 4.1.9. But f = g µ–a.e. iff
|f − g| = 0 µ–a.e., so Lemma 4.3.8 shows that 0 ≤ | f dµ − g dµ| ≤ 0. a
Define f by f (ω) = limn fn (ω) if this limit exists, and let f (ω) be arbitrary otherwise.
Then f ∈ L1 (Ω, F, µ), and Z Z
f dµ = lim fn dµ
n
Integration and Expectation 67
Proof: Let
N = {ω ∈ Ω : lim fn (ω) does not exist} ∪ {ω ∈ Ω : |fn (ω)| > g(ω)}
n
Then N is a null set, and thus in F (because the measure space is assumed complete). Define
f¯n = fn IN c ḡ = gIN c f¯ = f IN c
These
R functions
R are also F–measurable, and ḡ = g µ–a.e, f¯ = f µ–a.e., and f¯n = fn µ–a.e. Since
ḡ dµ = g dµ < ∞ (by Theorem 4.3.9), we see that ḡ is integrable. Also, we have
lim f¯n (ω) = f¯(ω) |f¯n (ω)| ≤ ḡ(ω)
for all ω ∈ Ω
n
A tagged partition is a partition P together with a choice t∗k ∈ [tk−1 , tk ] for each k = 1, . . . , n.
Tagged partitions will be indicated by a ∗, i.e. if P is a partition, then P ∗ denotes an associated
tagged partition.
With each tagged partition, we can associate a Riemann sum
n
X n
X
S(P ∗ , f ) := f (t∗k ) (tk − tk−1 ) = f (t∗k ) ∆k t
k=1 k=1
Rb
The Riemann integral a f dt should be the limit of the Riemann sums, over all tagged partitions
P ∗ , as ||P || → 0. To be precise, we say
lim S(P ∗ , f ) = L exists
σ(P )→0
provided this limit exists, and say that f is Riemann integrable on [a, b].
With each partition {a = t0 < t1 < · · · < tn = b} it is possible to associate three natural tagged
partitions, namely those having tags equal to the left endpoint, right endpoint and midpoint of
each interval. This yields:
68 Riemann Integral vs. Lebesgue Integral
P
• The lefthand Riemann sum k f (tk−1 )∆k t;
P
• The righthand Riemann sum k f (tk )∆k t;
P tk−1 +tk
• The symmetric Riemann sum k f( 2 )∆k t.
where ∆k t := tk −tk−1 . If f is Riemann integrable over [a, b], then each of these sums must converge
as k|P || → 0, and all to the same limit.
Remarks 4.4.1 A slightly different definition uses Darboux sums rather than Riemann sums.
Given a real–valued functions f defined andbounded on an interval [a, b], and a partition P = {a =
t0 < t1 < · · · < tn = b}, let the upper and lower Darboux sums be defined by
n
X
U (P, f ) := sup{f (t) : t ∈ [tk−1 , tk ]} · (tk − tk−1 )
k=1
Xn
L(P, f ) := inf{f (t) : t ∈ [tk−1 , tk ]} · (tk − tk−1 )
k=1
i.e. the Darboux sums give the most extreme values of the Riemann sums for any given partition.
Furthermore, given any ε > 0, it is possible to find tags P ∗ , P 0 for the same partition P such that
|S(P ∗ , f ) − L(P, f )|, |U (P, f ) − S(P 0 , f )| ≤ ε: To see this, observe that if we choose t∗k ∈ [tk−1 , tk ]
so that f (t∗k ) − inf ε
f (t) < n||P || , then
tk−1 ≤t≤tk
n n
X X ε
0 ≤ S(P ∗ , f ) − L(P, f ) = f (t∗k ) − inf f (t) ∆k t < · ||P || = ε
tk−1 ≤t≤tk n||P ||
k=1 k=1
f
A bounded function [a, b] → R is Riemann integrable if and only if the limits lim L(P, f )
||P ||→0
and lim U (P, f ) exist and are equal.
||P ||→0
But observe that Riemann sums may be defined even when f is Banach space–valued, however,
whereas Darboux sums, being dependent on sup’s and inf’s, make sense for real–valued functions
only.
Rb
From calculus, we know that the Riemann integral a f dt exists when f is continuous (or even
piecewise continuous) on [a, b] — cf. also Thm. 4.4.4. When the function is too discontinuous, we
run into trouble, however:
Integration and Expectation 69
where Q is the set of rational numbers. If P = {a = t0 < t1 < · · · < tn = b} is any partition of
[a, b], no matter how fine, we can always find tags t∗k , t0k ∈ [tk−1 , tk ] so that t∗k is rational, and t0k is
irrational. Thus IQ (t∗k ) = 1, IQ (t0k ) = 0. It follows that
X X
S(P ∗ , f ) = 1 · (tk − tk−1 ) = 1 − 0 = 1 S(P 0 , f ) = 0 · (tk − tk−1 ) = 0
k k
and thus S(P ∗ , f ), S(P 0 , f ) cannot be made to lie arbitrarily close to each other, no matter how
fine the partition P . Thus lim S(P, f ) does not exist.
||P ||→0
When the Riemann integral is firstR b encountered in calculus, it is taught as “the area under a
curve”: If f ≥ 0 is continuous, then a f dt is the area under the curve described by f , between
t = a and t = b. For A ⊆ R, define the indicator function of A by
(
1 if t ∈ A
IA (t) :=
0 else
Consider now the IQ , where Q is the set of rational numbers. This function is very discontinuous.
If we try to compute this “curve” R 1 over the interval [0, 1] using the Riemann integral, we run into
trouble: The Riemann integral 0 IQ dt does not exist.
We can make a convincing argument that the area under the curve over the interval [0, 1] should
be zero, as follows: Use the fact that Q is countable to enumerate the rational numbers in [0, 1],
i.e. write [0, 1] ∩ Q = {qn : n ∈ N}. For any ε > 0, define
ε ε
Bn := [qn − ,q
2n+1 n
+ 2n+1
] for n ∈ N, f = ISn Bn
The area under the curve of f is made up of (possibly overlapping)P rectangles of heightP 1 centered
at the rational numbers. Thus the area under f over [0, 1] is ≤ n 1 · (length of Bn ) = n 2εn = ε.
It is also clear that 0 ≤ IQ ≤ f , and thus that the area under IQ is less than the are under f , i.e.
that the area under IQ is ≤ ε. Since this is true for any ε > 0, we conclude that the area under IQ
is 0.
Thus we have the following:
Z 1
The Riemann integral IQ dt is undefined, but it should be zero
0
The Riemann integral is simply not powerful enough to handle functions like IQ .
You may counter that a function such as IQ is pathological, and unlikely to be encountered in
practice. It is true that we chose it here simply to make a point. However, the following example
should cause you to feel uneasy about the assertion that IQ is “pathological”:
70 Riemann Integral vs. Lebesgue Integral
Unlike the Lebesgue integral, the Riemann integral does not handle limits well. We now show
that when the Riemann integral exists, then the Lebesgue integral (w.r.t. Lebesague measure λ)
does too, and the integrals’ values coincide:
Theorem 4.4.4 Let f be a bounded real–valued function on the compact interval [a, b].
Then
(b) If f is Riemann integrable, then f is Lebesgue integrable, and the integrals are equal:
Rb R
a f dt = [a,b] f dλ.
Proof: Assume that f is Riemann integrable. Then we can choose a sequence Pn of successively
finer partitions of [a, b] such that U (Pn , f )−L(Pn , f ) < n1 — see Remarks 4.4.1 for the definitions of
U (Pn , f ) and L(Pn , f ). Define functions gn , hn on [a, b] as follows: For each n, gn (a) = hn (a) = f (a).
If Pn = {a = tn0 < t1 < tn2 < · · · < tnmn = b}, then gn , hn are step functions, with steps determined
by Pn , defined as follows: If t ∈ [a, b], then t ∈ (tnk−1 , tnk ] for some k, and we define
gn (t) = inf{f (x) : tnk−1 < x ≤ tnk } hn (t) = sup{f (x) : tnk−1 < x ≤ tnk }
Moreover hgn in is a bounded increasing sequence, with gn ≤ f , and hhn in is a bounded decreasing
sequence, with hn ≥ f . Define g = limn gn , hR = limn hn . Then g, h are Borel
R b functions,R and by the
Dominated Convergence Theorem we have [a,b] g dλ = limn L(f, Pn ) = a f dt and [a,b] h dλ =
Rb R
limn U (f, Pn ) = a f dt. Hence a,b] h − g dλ = 0.
Now since h ≥ g, Lemma 4.3.8 implies that h = g λ–a.e. on [a, b]. Since g ≤ f ≤ h, we must
R R Rb
have g = f = h λ–a.e., and thus f dλ = g dλ = limn L(f, Pn ) = a f dt. This proves (b).
Integration and Expectation 71
S
Next, note that if t 6∈ n Pn , and if h(t) = g(t), then f is necessarily continuous at t: For then
g(t) = f (t) = h(t), i.e.
lim inf{f (x) : tnkn −1 < x ≤ tnkn } = f (t) = sup inf{f (x) : tnkn −1 < x ≤ tnkn }
n n
S∞
Proof: We need only check that ν is countably additive. P Suppose that A = k=1 Ak is a union R of a
family
R of mutually disjoint membersR of F. Put f n = Rk≤n f IA .
k P
Then f
Rn ↑ f IA , and so
P f n dµ ↑
A f dµ, by the MCT, i.e. lim n fn dµ = ν(A). But fn dµ = k≤nR f IAk dµ =
P∞ k≤n ν(A k ),
and thus fn dµ ↑ P∞
R P
k=1 ν(A k ) as n → ∞. We conclude that also lim n fn dµ = k=1 ν(A k ), and
hence that ν(A) = ∞ k=1 ν(Ak ) . a
dν
The following proposition explains the notation dµ :
i.e. Z Z Z
dν
gf dµ = g dµ dµ = g dν
This means that whenever one of side of (?) exists, the so does the other, and the two
sides are equal.
R R
Proof:PIf g = IA is an indicator function,R then
P I A f dµ
R = ν(A) =P IA dν, by definition
R of ν.
If g = k≤n αk IAk is simple, then gf dµ = k≤n αk IAk f dµ = k≤n αk ν(Ak ) = g dν, by
linearity of the integral. So the result holds for simple g.
If g is a non–negative measurable function, we may choose simple gn ↑ g, by Proposition 2.4.3.
Since f ≥ 0, we R have also gn f R↑ gf . Then by the
R MCT and R the fact that the result holds for simple
gn , we obtain gf dµ = limn gn f dµ = limn gn dν = gR dν. R R
Finally, if g is an arbitrary measurable function, then |gf | dµ = |g|f dµ = |g| dν, since
f, |g| are non–negative. Hence gf is µ integrable if and only if gR is ν–integrable.
R + (by Proposition
R −
4.1.9). Now split g into its positive and negative parts to see that gf dµ = g f dµ− g f dµ =
g dν − g − dν = g dν.
R + R R
a
Remarks 4.5.3 The above proof illustrates a useful technique, which David Williams1 calls the
standard machine. To prove something holds for all integrals of a certain type:
• Then use the MCT to lift the result to non–negative measurable functions;
• And finally split an arbitrary measurable f into its positive and negative parts, and use
linearity once again.
1
cf. his excellent (and short) book Probability with Martingales.
Integration and Expectation 73
f
We now come back to the measure µf −1 defined in Section 2.5: Recall that if (Ω, F, µ) → (T, T )
is measurable, then the map
µf −1 : T → R̄ : B →7 µf −1 [B]
g
defines a measure on (T, T ) — see proposition 2.5.1. Also if (T, T ) → (R̄, B(R̄)) is measurable,
g◦f
then so is (Ω, F) → (R̄, B(R̄)), by Proposition
R 2.3.5.
The next result shows that the integrals g ◦ f dµ and g d(µf −1 ) are equal:
R
i.e. whenever one side of this equation exists, then so does the other, and the two sides
are equal.
Exercise 4.5.5 Prove Proposition 4.5.4.
[Hint: Use the standard machine for indicators, then simple – and then R non-negative
R measurable
g. For arbitrary measurable g, to check integrability, observe that |g ◦ f | dµ = |g| ◦ f dµ =
−1
R
|g| d(µf ), because |g| is non–negative. Now split g into its positive and negative parts, ]
where the xkR are the values that X can take, and Ak = {ω ∈ Ω : X(ω) = xk }. We are interested in
the value of X dP, assuming that it exists. Using the Lebesgue Dominated Convergence Theorem,
it is easy to see that
Z X∞ ∞
X
X dP = xk P(Ak ) = xk P(X = xk )
k=1 k=1
n
P ∞
P
(because Xn = xk IAk is dominated by |X| = |xk |IAk , assuming that the Ak are mutually
k=1 k=1
disjoint.) But this sum is just the definition of the expected value of a discrete random variable,
i.e. Z
E[X] = X dP
Example 4.6.2 Let (Ω, F, P) be a probability space, and let X be a continuous random variable,
i.e. a random variable that has a probability density function fX such that
Z x Z
P(X ≤ x) = fX (t) dt = fX dλ
−∞ (−∞,x)
(This is the definition of a continuous random variable.) Now let PX := P ◦ X −1 be the law of X.
Recall that PX is a probability measure on (R, B(R)) with the property that
it follows that νX = PX . This is because µX , νX agree on the π–system of intervals of the form
(−∞, a] which generates B(R) — see Proposition 1.6.4.
In particular:
dPX
fX =
dλ
i.e. the density of a continuous random variable X is precisely the Radon–Nikodým
derivative of the law of X w.r.t. Lebesgue measure!
R
Now let g : R → R be a Borel function and consider the integral g(X) dP, assuming that it
exists. We use the Change of Variable Formula (Proposition 4.5.4) to obtain
Z Z
g(X) dP = g ◦ X(ω) P(dω)
Z
= g(x) P ◦ X −1 (dx)
Z
= g(x) PX (dx)
But the integral on the right is just the definition of the expected value of a continuous random
variable, i.e. Z
E[g(X)] = g(X) dP
R
In particular, with g(x) := x, we have E[X] = X dP.
We now wipe the slate clean and redefine the expectation of a random variable as follows:
The two examples show that this definition will give the same results as the two earlier definitions
of expectation for discrete and continuous random variables respectively. Moreover, we now also
have a definition of the expected value of a random variable which is neither discrete nor continuous.
Some notation: If X is a random variable on a probability space (Ω, F, P) and F ∈ F, we define
the integral E[X; F ] of X over F by
Z Z
E[X; F ] := X dP := XIF dP
F
Now that we’ve defined the expectation of a random variable as its integral with respect to the
probability measure, several facts about expectation are immediately obvious:
(b) E[aX +bY ] = aE[X]+bE[Y ] for any random variables X, Y (whose expectations exist)
and any a, b ∈ R.
(d) If X, Y are random variables and X = Y almost surely, then E[X] = E[Y ] (where one
expectation exists if and only if the other exists).
Remark 4.6.5 We’ll stress once more the point that our definition of expectation as a Lebesgue
integral is superior to the definitions in elementary courses. Suppose that you want to prove the
following general theorem:
E[X + Y ] = E[X] + E[Y ]
If X, Y are discrete, this is easy. If X, Y have a joint density function, this is a little harder, but
doable. Suppose, however, that X, Y do not have a joint density function nor a joint probability
mass function. It is easy to construct such pairs. For example, suppose X is a standard normal
random variable, and that Y is a Bernoulli variable with values 1 and −1. In that case, the pair
76 Inequalities
(X, Y ) can assume uncountably many values, so there is no joint probability mass function. On
the other hand, were the joint density f to exist, we would have f (x, y) 6= 0 only for pairs (x, y)
R a set whose total area is zero, namely the union of two horizontal lines y = ±1. It follows that
in
R2 f (x, y) dx dy = 0, so f cannot be a density.
Of course, the fact that E[X + Y ] = E[X] + E[Y ] follows directly from the linearity of the
integral.
Limit theorems about integration translate directly into limit theorems about expectation.
Xn → X a.s.
Then
(c) If there is a non-negative random variable Y with finite expectation such that |Xn | ≤ Y
for all n, then E[|Xn − X|] → 0, so that E[Xn ] → E[X].
Proof: (a) is the Monotone Convergence Theorem, (b) is Fatou’s Lemma, and (c) is the Lebesgue
Dominated Convergence Theorem. a
4.7 Inequalities
We end this chapter by proving some important inequalities. For a random variable X, we define
its variance Var(X) by
Var(X) := E[(X − E[X])2 ]
i.e. Var(X) is an estimate of how far we expect an outcome X(ω) to be away from the mean E[X].
Eg(X) ≥ g(c)P(X ≥ c)
Proof: Let F = {ω ∈ Ω : X(ω) ≥ c} = {X ≥ c}. Markov’s inequality follows directly from the
fact that Z Z Z
g(X) dP ≥ g(X) dP ≥ g(c) dP
F F
Chebyshev’s Inequality is a direct consequence of Markov’s — cf. next exercice. a
Next, we tackle Jensen’s Inequality, whose importance is hard to overstate. This result states
g
that if R → R is a convex function (roughly, a concave–up function), then E[g(X)] ≥ g(E[X]). We
begin with a definition:
Remarks 4.7.4 The following remarks feature in the proof of Jensen’s inequality, the next propo-
sition, and should be digested thoroughly.
(a) Recall that if ~a, ~b are points in Rn , then {λ~a +(1−λ)~b : 0 ≤ λ ≤ 1} is simply the line segment in
Rn joining ~b to ~a. The point with coordinates (λx+(1−λ)y, g(λx+(1−λ)y)) is simply a point on
the graph of g between x and y. On the other hand, the point λx+(1−λ)y, λg(x)+(1−λ)g(y))
is a point on the line segment joining (y, g(y)) to (x, g(x)). These two points have the same
x–coordinate, namely λx + (1 − λ)y. We can now interpret convexity geometrically: A function
g is convex if and only if its graph lies below any chord joining two points on the graph of g.
v−w u−v
g(v) ≤ λg(u) + (1 − λ)g(w) = g(u) + g(w)
u−w u−w
Hence
(u − v)g(v) + (v − w)g(v) ≥ (v − w)g(u) + (u − v)g(w)
Rearranging yields the result.
(ii) ∆(u, v) ≤ ∆(v, w), and thus ∆(u, v) is bounded from above as u ↑ v.
Since a sequence which is increasing and bounded from above must converge, D− (v) = lim ∆(u, v)
u↑v
exists. Similar reasoning shows that D+ (v) = lim ∆(v, w) must exist for every v ∈ U . Thus
w↓v
left– and right derivatives exist at every point v. Moreover, D− (v) ≤ D+ (v), because each
∆(u, v) is ≤ each ∆(v, w). If these limits are equal, then g is differentiable at v.
(e) A convex function is automatically continuous, and thus Borel measurable: For let v ∈ U . If
there is a discontinuity at v, then it is easy to see that either lim ∆(u, v) or lim ∆(v, w) does
u↑v w↓v
not exist.
Proof: We use notation and results from Remarks 4.7.4. Let v ∈ U , and let D− (v) = lim ∆(u, v)
u↑v
and D+ (v) = lim ∆(v, w). Then D− (v), D+ (v) both exist, and D− (v) ≤ D+ (v). Now suppose that
w↓v
m is a real number satisfying D− (v) ≤ m ≤ D+ (v), and that x ∈ U . We consider two cases: If (i)
x ≤ v, then ∆(x, v) ≤ D− (v) (since ∆(u, v) increases as u ↑ v) and thus ∆(x, v) ≤ m. It follows
that g(x) ≥ m(x − v) + g(v). Next, if (ii) x ≥ v, then ∆(v, x) ≥ D+ (v) (because ∆(v, w) decreases
as w ↓ v) and thus ∆(v, x) ≥ m. It follows that g(x) ≥ m(x − v) + g(v). Hence, in either case, we
have
g(x) ≥ m(x − v) + g(v)for all v ∈ U , x ∈ U , and D− (v) ≤ m ≤ D+ (v)
Geometrically, this means that the graph of g lies above the graph of the line m(x − v) + g(v). Note
that both the graph of g and that of the line go through the point (v, g(v)).
We are now ready to prove Jensen’s inequality: Put v = EX. Then v ∈ U because X takes
values almost surely in U . We thus have
If we now take expectations on both sides (i.e. if we integrate with respect to P on both sides),
then
E[g(X)] ≥ m(EX − EX) + Eg(EX) = g(EX)
a
Chapter 5
79
80 Product Spaces
The aim of this section is to construct, out of two measure spaces (S, S, µ), (T, T , ν) an new
measure space (S × T, S ⊗ T , µ ⊗ ν) satisfying the following requirements:
(i) A subset of S ×T is called a measurable rectangle if it has the form A×B, where A ∈ S, B ∈ T .
S ⊗ T will be the σ–algebra on S × T which is generated by the measurable rectangles, i.e. it
is the smallest σ–algebra on S × T which has all rectangles with measurable sides as members.
(Please note that this is an abstract definition. the sets S, T do not have to be R, so A × B
may look nothing like a rectangle in the plane.)
(ii) For each measurable rectangle A × B, we require that (µ ⊗ ν)(A × B) = µ(A) · ν(B)
Remarks 5.1.2 (a) A remark on notation: We will be working with functions of more than
one variable,
R and may integrate with respect to just one of those variables. Thus, for ex-
ample,
RR f (x, y) µ(dx) integrates the function f (x, y) over x, keeping y fixed. The integral
f (x, y) µ(dx) ν(dy) is a double integral
R that first integrates f w.r.t. µ over the variable x,
and then integrates the function y 7→ f (x, y) µ(dx) w.r.t ν over the variable y.
(b) Several times below, we will prove a result for finite measures, and then refer to a “standard
argument” to lift the result to σ–finite measures. This is done as follows: Suppose that µ is
σ–finite on (S, S), and that a result Φ has been proved to hold for finite measures. Since µ is
σ–finite, there exists a sequence of measurable sets An ↑ S such that µAn < ∞ for all n ∈ N.
The measures µn := IAn · µ are finite on (S, S), so that result Π holds for the µn . By the MCT,
if f ∈ mS + , then
µf = µ(lim f IAn ) = lim µn f
n n
πS : (s, t) 7→ s πT : (s, t) 7→ t
The interpretation is as follows: (s, t) denotes a sample point in a space of “combined” outcomes:
i.e. s ∈ S occurred and t ∈ T occurred. Given such a combined outcome ω = (s, t), πS (ω) = s
measures which outcome occurred in S, and πT (ω) = t measures which outcome occurred in T .
Given that we know a combined outcome ω = (s, t), we should also know the component outcomes
s and t. Thus the projection mappings πS , πT should be measurable. The product σ–algebra
S ⊗ T is defined to be the smallest σ–algebra on S × T which makes these maps measurable. To
recapitulate:
Products and Independence 81
Definition 5.1.3 Let (S, S) and (T, T ) be measurable spaces. Define projections πS :
S × T → S, πT : S × T → T by
πS : (s, t) 7→ s πT : (s, t) 7→ t
Then define S ⊗ T := σ(πS , πT ) to be the smallest σ–algebra for which both projections
are measurable.
Suppose that (S, S, µ) and (T, T , ν) are measure spaces. We would like to construct a measure
µ ⊗ ν on (S × T, S ⊗ T ). One way that suggests itself is to define
Z Z
(1) (µ ⊗ ν)(B) := IB (s, t) ν(dt) µ(ds) B ∈S ⊗T
Another is to define it as
Z Z
(2) (µ ⊗ ν)(B) := IB (s, t) µ(ds) ν(dt) B ∈S ⊗T
We shall soon see that (i) the above definitions are both possible, and (ii) they coincide.
We first investigate
RR the possibility of defining µ ⊗ ν in the above manner. To be able to perform
a double integral f (s, t) ν(dt) µ(ds) it is necessary that:
(i) for each Rs ∈ S, the map t 7→ f (s, t) must T –measurable, so that we can calculate the inner
integral f (s, t) ν(dt);
R
(ii) the map s 7→ RF (s) := f (s, t) ν(dt) must be S–measurable, so that we can calculate the
outer integral F (s) µ(ds).
82 Product Spaces
Lemma 5.1.7 Suppose that (S, S) and (T, T ) are measurable spaces, that µ is a σ–finite
measure on (S, S), and that f : S × T → R+ is S ⊗ T –measurable. Then
Proof: We apply the Monotone Class Theorem (Theorem 2.4.4). First assume that µ is a finite
measure, and let
H = {f ∈ mS ⊗ T : f is bounded and satisfies (i) and (ii)}
It is easy to verify that H is a vector space (we need the finiteness of µ in order to avoid expressions
of the form ∞ − ∞), and that that each IA×B ∈ H, where A ∈ S, B ∈ T . By the MCT, H is
closed under bounded limits of increasing non–negative sequences. Moreover, the set R := {A × B :
A ∈ S, B ∈ T } is a π–system with the property that IR ∈ H for every R ∈ R, and thus by Thm.
2.4.4 every bounded S ⊗ T –measurable function belongs to H (since σ(R) = S ⊗ T ). Now each
non–negative measurable function f is the limit of bounded non–negative measurable functions
(f = limn f ∧ n), and thus another application of the MCT shows that every f ∈ m(S ⊗ T )+
satisfies (i) and (ii).
Now drop the assumption that µ is a finite measure. Because µ is σ–finite, we can choose
An ↑ S such that µ(An ) < ∞. The measures µn defined by dµ n
dµ := IAn areRfinite measures,
R
and thus
R each map t 7→ f (s, t)µn (ds) is T –measurable (where f ≥ 0). Since f (s, t) µ(ds) =
limn f (s, t) µn (ds), the MCT implies that the result holds for µ. a
We now know that it is possible to define µ ⊗ ν in the ways indicated. What we don’t (yet)
know is that these constructions define a measure, and that they coincide.
For definiteness, we arbitrarily fix one of the above definitions:
Definition and Proposition 5.1.8 Suppose that (S, S, µ) and (T, T , ν) are σ–finite
measure spaces. Define a map µ ⊗ ν : S ⊗ T → R̄+ by
ZZ
(µ ⊗ ν)B := IB (s, t) ν(dt) µ(ds) = µs (ν t IB (s, t)) B ∈S ⊗T
Exercise 5.1.9 Prove the above result, i.e. show that µ ⊗ ν defines a σ–finite measure on (S ×
T, S ⊗ T ).
The next two results show that (modulo certain conditions) we can calculate the integral w.r.t.
µ ⊗ ν as a double integral, and the order of integration doesn’t matter:
Z ZZ ZZ
f d(µ ⊗ ν) = f (s, t) ν(dt) µ(ds) = f (s, t) µ(ds) ν(dt)
Proof: We use the Monotone Class Theorem (Thm. 2.4.4). First assume that µ, ν are finite
measures. The result is obvious if f = IA×B , where A × B measurable rectangle, (or cf. Exercise
5.1.6). The class
H = {f ∈ m(S ⊗ T ) : f is bounded and satisfies (∗)}
is easily seen to satisfy the requirements of Thm. 1.6.3 , and thus implies that H contains every
bounded S ⊗ T –measurable function. The result for arbitrary non–negative f follows by MCT.
A standard argument lifts the result to the case where µ, ν are merely σ–finite. a
As a by–product, we obtain the result that our two possible definitions of µ ⊗ ν as iterated
integrals coincide: If B ∈ S ⊗ T , then IB is a non–negative measurable function, and we may apply
Tonelli’s Thm. R R
For non–negative functions f , the integral f dµ always makes sense, but we may have f dµ =
∞. For arbitrary measurable f , we therefore have to be more careful:
Here the map t 7→ µs f (s, t) belongs to L1 (T, T , ν) for ν–a.e. t ∈ T . Similarly, the map
s 7→ ν t f (s, t) belongs to L1 (S, S, µ) for µ–a.e. s ∈ S.
R
Proof: The result holds for |f |, by Tonelli’s Thm., and hence
R NS = {s ∈ S : |f (s, t)| ν(dt) = +∞}
is µ–null by Proposition 4.1.9. Similarly, NT = {t ∈ T : |f (s, t)| µ(ds) = +∞} is ν–null. Redefine
f (s, t) to be zero when either s ∈ NS or t ∈ NT ; this won’t affect the integral of f , by Theorem
4.3.9. The result follows by splitting f into positive and negative parts. a
Remarks 5.1.12 (a) Fubini’s Theorem allows the interchange of the order of integration, provided
the integrand is integrable w.tr.t the product measure. It follows from Fubini’s Theorem that
Z Z Z Z
f dν dµ = f dµ dν
(b) To check if f ∈ L1 (S ×R RT, S ⊗ T , µ ⊗ ν), observe that Tonelli’s Theorem applies to |f |. Thus
if the double integral 1
|f (s, t)| µ(ds) ν(dt) is finite, then f ∈ L (S × T, S ⊗ T , µ ⊗ ν), and
Fubini’s Theorem may be applied to f .
84 Product Spaces
(c) Fubini’s Theorem also easily extends to arbitrary finite products: If (Si , Si , µi ) are σ–finite
measure spaces for i = 1, . . . , n, then
where r is the c.c. riskless rate — a market observable — and Q is a so–called risk–neutral
measure, which is not directly observable. The method of Breeden–Litzenberger recovers the
distribution function F (x) := Q(ST ≤ x) and the density function f of the asset price ST at
maturity under the risk–neutral measure Q from market–observable prices of (liquid) vanilla
Products and Independence 85
european calls C(K) (provided that C(K) is known for a largish set of strikes K). This
distribution can then be used to price more exotic european–style options, whose prices are not
market–observable.
Show that the riskneutral distribution– and density functions of ST are given by:
∂C(y) ∂ 2 C(y)
F (y) = 1 + e+rT f (y) = e+rT
∂y ∂y 2
R∞ R∞
[Hint: by(a) C(y) = e−rT 0 Q((ST − y)+ > x) λ(dx) = e−rT y 1 − F (x) λ(dx).]
5.2 Independence
The covariance Cov(X, Y ) and correlation ρX,Y between two random variables are defined by
Cov(X, Y )
Cov(X, Y ) := E[(X − EX)(Y − EY )] ρX,Y :=
Var(X)Var(Y )
when this integral exists. It is an important and well–known that independent random variables
are uncorrelated (i.e. have ρX,Y = 0):
Proof: Recall that X, Y are independent if and only if σ(X), σ(Y ) are independent σ–algebras.
We use the standard machine, but need to pay special attention to measurability.
If X, Y are both indicator functions,
X = IA , Y = IB
then Z
E[XY ] = IA∩B dP = P(A ∩ B) = P(A)P(B) = E[X]E[Y ]
are non–negative simple random variables (where the aj are distinct, as are the bk ) then each
Aj ∈ σ(X) and each Bk ∈ σ(Y ). Thus the Aj , Bk are independent, and it follows that
Z
E[XY ] = XY dP
X
= aj bk P(Aj ∩ Bk )
j,k
X X
= aj P(Aj ) P(Bk )
j k
= E[X]E[Y ]
again because P(Aj ∩ Bk ) = P(Aj )P(Bk ) for each j, k. If X, Y are non–negative random variables,
then by Proposition 2.4.3 we may choose simple non–negative random variables Xn ↑ X, Yn ↑ Y ,
with each Xn measurable w.r.t. σ(X), and each Yn measurable w.r.t. σ(Y ) (e.g. take Xn :=
n −1
n2P
k
2n I{ kn ≤X< k+1
2 2 n }
+ nI{X≥n} to be the usual staircase functions). In that case Xn , Yn are
k=0
independent (being measurable with respect to independent σ–algebras) and Xn Yn ↑ XY . By the
Monotone convergence Theorem
Finally if X, Y ∈ L1 , first observe that E[|XY |] = E[|X|]E[|Y |] (because the result has been proved
for the non–negative independent random variables |X|, |Y |), so that XY is integrable when X, Y
are integrable. Now split X, Y up into their positive and negative parts and apply linearity. a
Remarks 5.2.2 Note that it is not true in general that if X, Y ∈ L1 , then XY ∈ L1 . However,
the above theorem shows that this is the case if X, Y are independent.
Actually, there is an easier proof of Propn. 3.3.5, if we adopt another point of view: If X, Y
(X,Y )
are random variables, then (X, Y ) is a random vector, i.e. a map (Ω, F, P) → (R2 , B(R2 )).
Its distribution is a probability measure on (R2 , B(R2 )) given by µX,Y (B) = P{(X, Y ) ∈ B},
where B ∈ B(R2 ). If µX , µY are the distributions of X, Y respectively, then the product measure
µX ⊗ µY is another probability measure on (R2 , B(R2 )). It turns out that X, Y are independent iff
µX,Y = µX ⊗ µY :
Theorem 5.2.3 Let X, Y be random variables on (Ω, F, P). Let µX,Y , µX , µY be the
distributions of the random elements (X, Y ), X and Y . Then X, Y are independent iff
µX,Y = µX ⊗ µY .
Products and Independence 87
Hence µX,Y , µX ⊗ µY agree on a π–system that generates B(R2 ) (the family of measurable rectan-
gles). NowµX,Y and µX ⊗ µY agree on a π–system that generates B(R2 ), so by Proposition 1.6.4
they are equal: µX,Y = µX ⊗ µY .
Conversely, if µX,Y = µX ⊗ µY , then if x, y ∈ R, we have
The norm ||v|| of a vector v ∈ V should be interpreted as its length, as in the following standard
examples.
89
90 Topological Vector Spaces
Exercise 6.1.5 Suppose that [a, b] is a closed interval in R. Let C[a, b] be the set of all continuous
functions f : [a, b] → R.
(a) Show that C[a, b] is a vector space over the scalar field R, where the operations of addition and
scalar multiplication are defined pointwise.
If (V, || · ||) is a normed space, then we can define the distance d(v, w) between two elements
v, w ∈ V by
d(v, w) := ||v − w||
It is easy to see that the following holds:
Spaces of Random Variables 91
(i) d(v, w) ≥ 0.
The properties (i)-(iv) of the preceding proposition capture the notion of the distance between two
points.
Proposition 6.1.8 Every norm induces a metric or distance: If (V, || · ||) is a normed
space, then (V, d) is a metric space, where d : V × V → R is defined by
Once we have a metric, i.e. a concept of distance, we can use it to talk about limits and
convergence:
Definition 6.1.9 Suppose that (X, d) is a metric space, and that hxn i is a sequence in
X, and that x ∈ X. We say that
xn → x or that lim xn = x
n
In the above definition, note that hd(xn , x)i is a sequence of real numbers, so that we already
know what d(xn , x) → 0 means — see Appendix B.1:
Definition 6.1.10 (i) A sequence hxn in in a metric space (X, d) is called a Cauchy
sequence iff ∀ε > 0 ∃N ∈ N ∀n, m ≥ N [d(xn , xm ) < ε], i.e. from some point
N onwards any two terms of the sequence (not necessarily successive) lie within a
distance of ε of each other.)
(ii) A metric space (X, d) is said to be complete if every Cauchy sequence in X converges
(to a point in X).
(iii) A normed vector space which is complete (w.r.t. the metric induced by the norm) is
called a Banach space.
In Appendix B.4, it is proved that every Cauchy sequence in R converges. It follows easily that
(Rn , equipped with the usual Euclidean norm, is a Banach space.
Definition 6.1.11 An inner product space is a pair (V, h·, ·i), where V is a vector space
over R and h·, ·i is an inner product on V , i.e. a function h·, ·i : V × V → R with the
following properties:
Rn is an inner product space when equipped with the usual dot product:
n
X
hx, yi := x · y = x j yj
j=1
for all x, y ∈ V .
Moreover we have equality iff y is a scalar multiple of x.
a
Also called the Cauchy–Bunyakovskii–Schwarz Inequality
Proof: Let (V, h·, ·i) be an inner product space. Note that for α ∈ R and x, y ∈ V we have
0 ≤ hαx − y, αx − yi = α2 hx, xi − αhx, yi − αhy, xi + hy, yi
= α2 hx, xi − 2αhx, yi + hy, yi
Now, with x, y held fixed, consider the righthand side of the above inequality as a quadratic polyno-
mial in α. Since it is always non–negative, it can have at most one root, and thus its discriminant
is ≤ 0, i.e.
hx, yi2 − hx, xi hy, yi ≤ 0
which yields the result upon taking square roots.
We have hx, yi2 − hx, xi hy, yi = 0 only when this quadratic has a root, i.e. if there is α such
that hαx − y, αx − yi = 0. But then αx = y, i.e. y is a scalar multiple of x. a
Proposition 6.1.13 If (V, h·, ·i) is an inner product space, then the map || · || : V → R
given by p
||x|| := hx, xi
defines a norm on V .
Proof: We only verify the triangle inequality. Note that the Cauchy–Schwarz inequality states
that |hx, yi| ≤ ||x|| ||y||. Thus
||x + y||2 = hx + y, x + yi = hx, xi + 2hx, yi + hy, yi
= ||x||2 + 2hx, yi + ||y||2
≤ ||x||2 + 2||x|| ||y|| + ||y||2
= (||x|| + ||y||)2
a
Thus an inner product induces a norm, and a norm induces a metric. An inner product space
which is complete (w.r.t. the metric induced by the inner product) is called a Hilbert space. In
particular, every Hilbert space is also a Banach space.
In Rn , the dot product does not only induce a length; it also induces an angle: The angle θ
between two vectors x, y ∈ Rn is given by
x·y
cos θ =
||x|| ||y||
94 Topological Vector Spaces
We can imitate this definition in an abstract inner product space (V, h·, ·i), and define the angle
between x, y ∈ V by
hx, yi p
cos θ := where ||x|| := hx, xi
||x|| ||y||
By the Cauchy–Schwarz inequality it follows immediately that | cos θ| ≤ 1, so that this definition
makes sense. It also follows that | cos θ| = 1 if and only if x is a scalar multiple of y, i.e. iff x, y are
parallel. We can also define orthogonality in an abstract inner product space, in the obvious way:
Definition 6.1.14 Suppose that (V, h·, ·i) is an inner product space. We say that x, y ∈ V
are orthogonal, and write x ⊥ y, if and only if hx, yi = 0.
If G ⊆ V , we say that x ⊥ G iff ∀g ∈ G(x ⊥ g).
Proposition 6.1.15 Let V be an inner product space with induced norm || · ||.
(b) (Parallelogram Law) If v, w ∈ V , then ||v + w||2 + ||v − w||2 = 2||v||2 + 2||w||2
(i) w0 ∈ W , and
(ii) ||v0 − w0 || = inf{||v0 − w|| : w ∈ W }, i.e. w0 is the vector in W that lies closest to v0 .
The vector w0 satisfying (i)–(iii) is called the orthogonal projection of v0 onto W . Indeed, v0 =
w0 + (v0 − w0 ) decomposes v0 into a vector in W and a vector orthogonal to W .
It remains to show that orthogonal projections exist and are unique. To be able to do that, we
need an additional condition: We say that a linear subspace W of a Hilbert space V is closed if W
is itself a Hilbert space. This simply means that if hwn i is a sequence in W and wn → v for some
v ∈ V , then v ∈ W , i.e. that if a sequence in W converges, then the limit lies in W . Any linear
subspace of Euclidean space is automatically closed.
Spaces of Random Variables 95
Proposition 6.1.16 Let V be a Hilbert space, and let W be a closed linear subspace of
V . Any v0 in V has a unique decomposition
|| ||
v0 = v0 + v0⊥ where v0 ∈ W, v0⊥ ⊥ W
||
v0 is called the orthogonal projection of v0 onto W .
||
Moreover, v0 is the best approximation to v0 in W , i.e. the vector in W which lies closest
to v0 :
||
||v0 − v0 || = inf{||v0 − w|| : w ∈ W }
|| ||
where v0 , u0 ∈ W and v0⊥ , u⊥
0 ⊥ W , then
|| ||
v0 − u0 = u⊥ ⊥
0 − v0 =: x
is a vector with the properties that x ∈ W and that x ⊥ W . This implies that x ⊥ x, i.e. that
|| ||
hx, xi = 0. Hence x = 0, and so v0 = u0 , v0⊥ = u⊥
0.
Existence: Let δ = inf{||v0 −w|| : w ∈ W }, and choose a sequence wn ∈ W such that ||v0 −wn || → δ.
We show that hwn in is a Cauchy sequence in W : for if ε > 0, we may choose N such that
||v0 − wn ||2 − δ 2 < ε whenever n ≥ N . By the Parallelogram Law it follows that if n, m ≥ N , then
1
2ε+2δ 2 > ||v0 −wn ||2 +||v0 −wm ||2 = 2||v0 − 21 (wn +wm )||2 +2|| 12 (wn −wm )||2 ≥ 2δ 2 + ||wn −wm ||2
2
Since hwn in is a Cauchy sequence, and since W is closed, there is w0 ∈ W such that wn → w0 . We
||
will show that w0 = v0 . The fact that ||v0 − w0 || ≤ ||v0 − wn || + ||wn − w0 || (for all n ∈ N) then is
easily seen to imply that ||v0 − w0 || = δ. In particular, ||v0 − w0 || = inf{||v0 − w|| : w ∈ W }.
It remains to show that v0 − w0 ⊥ W . Given an arbitrary w ∈ W and a real λ ∈ R, have
||v0 − w0 ||2 = δ 2 ≤ ||v0 − (w0 + λw)||2 , so that
−2λhv0 − w0 , wi + λ2 ||w||2 ≥ 0
Since this holds for all real λ we must have hv0 − w0 , wi = 0. (Another way to see this is to note
that the quadratic in λ has a unique root at λ = 0) and to calculate the discriminant.)
a
96 Lp Spaces
6.2 Lp Spaces
Definition 6.2.1 Suppose that (Ω, F, µ) is a measure space.
If the underlying measure space is understood from context, we shall write Lp instead of
Lp (S, F, µ).
Remarks 6.2.2 • Note that L1 is just the family of µ–integrable functions (cf. Propn. 4.1.7).
f
• Also, if S → R is F–measurable, then f ∈ Lp iff f p ∈ L1 iff ||f ||p < ∞.
• If f ∈ L∞ , then |f | ≤ ||f ||∞ µ–a.e. Yo see this, note that if M > ||f ||∞ , then |f | ≤ M µ–a.e.,
and hence {ω ∈ Ω : |f (ω)| > M } is a µ–null set. But then
[
{ω ∈ Ω : |f (ω)| > ||f ||∞ } = {ω ∈ Ω : |f (ω)| > ||f ||∞ + n1 }
n
In probability theory, the spaces L1 (Ω, F, P) and L2 (Ω, F, P) are the spaces that occur most fre-
quently, for reasons that will become apparent in Section 6.3.
We shall show that the Lp spaces are almost Banach spaces, and that L2 is almost a Hilbert
space.
Proof: We show that Lp is closed under addition, and leave scalar multiplication as an easy
exercise.
If 1 ≤ p < ∞, and f, g ∈ Lp , then |f + g|p ≤ (|f |R+ |g|)p ≤ max{2|f |, 2|g|}p ≤ 2p (|f |p + |g|p ).
Hence if f, g ∈ Lp , then |f +g|p dµ ≤ 2p |f |p dµ + |g|p dµ < ∞, i.e. ||f +g||pp ≤ ||f ||pp +||g||pp <
R R
∞.
If p = ∞, and f, g ∈ L∞ , then clearly ||f + g||∞ ≤ ||f ||∞ + ||g||∞ < ∞. a
For the next theorem, note that if a, b ≥ 0, and if 1 < p, q < ∞ are such that p−1 + q −1 = 1,
then
ap bq
ab ≤ +
p q
p
To see this, define h(t) = tb − tp , and find the maximum of h (using differential calculus). Alter-
natively, apply the Arithmetic–Geometric Mean inequality.
Theorem 6.2.4 Let (Ω, F, µ) be a measure space and let f, g be real–valued F–measurable
functions.
1 1
(a) (Hölder’s Inequality) Suppose that 1 ≤ p ≤ ∞ and that p + q = 1. If f ∈ Lp , and
g ∈ Lq , then f g ∈ L1 , and
||f g||1 ≤ ||f ||p ||g||q
Definition and Proposition 6.2.5 Let (Ω, F, µ) be a measure space, and let 1 ≤ p ≤
∞. Define a relation ≡ on Lp by f ≡ g iff f = g µ–a.e. Then ≈ is an equivalence relationa
on Lp . Define [f ] := {g ∈ Lp : f ≡ g}. Then [0] = {g ∈ Lp : g = 0 µ–a.e.} is a vector
subspace of Lp , and [f ] = f + [0] := {f + g : g ∈ [0]}. Let
Then Lp is a vector space and the map, which by abuse of notation we also call || · ||p ,
which is defined by
||[f ]||p := ||f ||p
is a norm on Lp .
a
See Appendix A.4 for a discussion of equivalence relations.
Proof: That ≡ is an equivalence relation is straightforward, as is the statement that [0] is a vector
subspace of Lp . It is also easy to see that Lp is a vector space, if the operations are defined in
the natural way (e.g. [f ] + [g] := [f + g] — one must check that this is well–defined, i.e. that
if [f1 ] = [f2 ] and [g1 ] = [g2 ], then [f1 + g1 ] = [f2 + g2 ], but thatR is easy.) R[0] is clearly the zero
vector in Lp . Also, if [f1 ] = [f2 ], then f1 = f2 µ–a.e., and thus f1p dµ = f2p dµ, which shows
that ||f1 ||p = ||f2 ||p (in case p < ∞), and thus that || · ||p is well–defined on Lp . To see that it is
a norm, note that (i) ||[f ]||p = ||f ||p ≥ 0, and that ||[f ]||p = 0 iff f = 0 µ–a.e. iff [f ] = [0]; (ii)
||α[f ]||p = ||αf ||p = |α| ||[f ]||p , and (iii) ||[f ] + [g]||p = ||f + g||p ≤ ||[f ]||p + ||[g]||p .
In case p = ∞, it is also straightforward to see that || · ||∞ is a well–defined norm on L∞ . a
In practice, we usually don’t bother too much about the distinction between Lp and Lp .
Now that we know that || · ||p is a norm, we have a notion of convergence:
We now have two notions of convergence for measurable functions: almost everywhere convergence,
and convergence in mean. We write
a.e. Lp
fn → f fn → f
We will introduce a third notion of convergence in Section 6.4, and discuss the rleationships between
these various notions.
Spaces of Random Variables 99
Proof: Suppose that hfn in is a Cauchy sequence in Lp , i.e. that supm≥n ||fm −fn ||p → 0 as n → ∞
(see Remark B.4.2(b)).
First assume that p < ∞. For k ∈ N, choose an increasing sequence hnk ik of naturalP no. such
||f − || −k . Then by the Monotone Convergence Theorem ||
that supm≥nPk m f nk p < 2 P k |fnk+1 −
fnk | ||p ≤ k ||fnk+1 − fnk ||p < ∞. By Proposition 4.1.9, we see that k |fnk+1 − fnk | < ∞ µ–
a.e., from which it follows easily that hfnk ik is a Cauchy sequence µ–a.e. Define f : Ω → R by
f (ω) = limk fnk (ω), if this limit exists, and f (ω) = 0 else. Then f is measurable, and fnk → f
µ–a.e. as k → ∞. Now by Fatou’s Lemma,
(because Cauchy sequences are bounded — cf. Lemma B.4.5), so that f ∈ Lp , and similarly
Lp
Thus fn → f .
Next, assume that p = +∞. We have supm≥n |fm − fn | ≤ supm≥n ||fm − fn ||∞ µ–a.e., and thus
hfn in is a Cauchy sequence µ–a.e. Define f : Ω → R as above: f (ω) = limn fn (ω) if this limit
exists, and f (ω) = 0 otherwise. Then
|f | ≤ |fn | + |fn − f | = |fn | + lim |fn − fm | ≤ ||fn ||∞ + sup ||fn − fm ||∞ µ–a.e.
m m≥n
so that f ∈ L∞ , and
L∞
Hence ||fn − f ||∞ ≤ supm≥n ||fn − fm ||∞ → 0 as n → ∞, proving that fn → f . a
is an inner product on L2 (Ω, F, µ) which induces the L2 –norm || · ||2 . Hence L2 (Ω, F, µ)
is a Hilbert space.
Proof: Suppose that f, g ∈ L2 (Ω, F, µ). By Hölder’s inequality (or the Cauchy-Schwarz inequal- R
ity), we have ||f g||1 ≤ ||f ||2 ||g||2 , so f g is integrable. It is now easy to see that hf, gi := f g dµ
100 Lp –Spaces and Probability
defines an inner product on L2 (where we identify functions that are µ–a.e. equal to ensure hf, f i = 0
implies f = 0). Furthermore, the norm induced by this inner product is precisely
Z 1
1 2
||f || := hf, f i 2 = |f |2 dµ = ||f ||2
Thus L2 (Ω, F, µ) is an inner product space. Since it is also a Banach space by the Riesz–Fischer
Theorem, it is a complete inner product space, i.e. a Hilbert space. a
6.3.1 Moments
Definition 6.3.1 (i) If X is a random variable on (Ω, F, P), then its pth moment is
defined to be EX p (which exists iff X ∈ Lp (Ω, F, P)).
If p = ∞, then Lp (Ω, F, P) consists of all essentially bounded random variables. Hölder’s inequality
is simply the statement
1 1
E(XY ) ≤ E(|X|p ) p E(|Y |q ) q
where X ∈ Lp , Y ∈ Lq and p1 + 1q = 1.
For probability theory, the following result is also useful:
Spaces of Random Variables 101
||X||p ≤ ||X||r
r
Proof: Note that if X ∈ Lr , then X p ∈ L p . Now p0 = pr and q 0 = r−p r
, satisfy the relation
1 1 p
p0 + q 0 = 1, and so Hölder’s inequality applied to f = |X| and g = 1 yields
Z Z r
p
r
||X||pp = f g dP ≤ ||f ||p0 · ||g||q0 = p
|X | dP
p · 1 = ||X||pr
Next, suppose that X ∈ L∞ . Then ||X||p ≤ ||X||∞ , so lim supp ||X||p ≤ ||X||∞ .
1
If M < ||X||∞ , then |X|p dP ≥ M p P(|X| > M ) and so ||X||p ≥ M P(|X| > M ) p . Now
R
1
P(|X| > M ) > 0, because M < ||X||∞ , and thus lim inf p ||X||p ≥ M (because P(|X| > M ) p → 1).
Since M was arbitrary, also lim inf p ||X||p ≥ ||X||∞ .
a
Propn. 6.3.2 states that if 1 ≤ p ≤ r ≤ ∞, and if the rth absolute moment of X exists, then so
does the pth absolute moment. Thus it is not possible, for example, for a random variable to have
a variance, but no mean.
Cov(X, Y ) hX, Y i
ρX,Y = = = cos θ
σX σY ||X||2 ||Y ||2
i.e. the correlation between two centered random variables can be interpreted as the cosine of angle
between them in L2 . In particular, two centered random variables are uncorrelated iff they are
orthogonal.
The Cauchy–Schwarz inequality is simply the statement that |E(XY )| ≤ ||X||2 ||Y ||2 , with
equality only if Y is a scalar multiple of X. If X, Y have zero mean, this amounts to saying
|Cov(X, Y )| ≤ σX σY , i.e. −1 ≤ ρX,Y ≤ 1, with equality only if Y is a scalar multiple of X. Since
ρX,Y = cos θ, this is not surprising.
102 Convergence of Random Variables
Moreover, the mean E[X] of an L2 –random variable can be given a geometric interpretation:
The mean of X is that real number c∗ which lies closest to X in L2 , i.e
To see this, define f : R → R : c 7→ E[(X − c)2 ], where E[(X − c)2 ] = ||X − c||22 . To find the value
of c where f has a minimum, we solve f 0 (c) = 0 to obtain E[2(X − c)] = 0, and conclude that the
minimum occurs at c∗ = E[X]. This theme will be taken up again in the next chapter, when we
discuss conditional expectation
The next exercise shows that we have a similar L1 –interpretation for the median of a random
variable.
Exercise 6.3.3 If X is a random variable, then a median (there may be more than one) of X is a
real number c∗ so that P(X ≤ c∗ ) = 21 .
Assume that X has a density function hX (x). Show that if we define f (c) := E[|X − c|], then
Z c Z ∞
f (c) = (c − x)hX (x) dx + (x − c)hX (x) dx
−∞ c
Z c Z ∞
=2 (c − x)hX (x) dx + (x − c)hX (x) dx
Z−∞
c
−∞
Z c∗
∗ 1
P(X ≤ c ) = hX (x) dx =
−∞ 2
(a) Xn converges to X almost surely, written Xn → X a.s., provided that the set
{ω ∈ Ω : Xn (ω) 6→ X(ω)} is P–null, i.e.
P(Xn → X) = 1
P
(b) Xn converges to X in probability, written Xn −→ X, provided that
Lp
(c) Xn converges to X in Lp , or in pth mean, written Xn −→ X, iff each Xn ∈ Lp (Ω, F, P),
and ||Xn − X||p → 0, i.e. iff E|Xn |p < +∞ for all n and
E|Xn − X|p → 0 as n → +∞
Remarks 6.4.2 (a) Many of the above notions can be extended to random vectors. For example,
P
if Xn , X : (Ω, F, P) → (Rd , B(Rd )), we say that Xn → X iff limn P(||Xn − X|| > ε) = 0 for all
ε > 0. Proofs of the results below can easily be extended to the more general case.
(b) Just a note of caution: The limits in the above notions of convergence are not unique, but
P
unique only up to a.s.–equality. Thus, for example, if Xn → X and X = Y a.s., then also
P
Xn → Y . The same goes for a.s. convergence and convergence in Lp .
Lemma 6.4.3 Suppose that f : R+ → R+ is an increasing bounded continuous function with the
properties that f (0) = 0, and that f (x) > 0 whenever x > 0. Then:
P
Xn → X iff Ef (|Xn − X|) → 0
Proof: (⇒): Let K be a bound for f , and let ε > 0. Choose δ such that 0 ≤ f (x) < ε whenever
0 ≤ x ≤ δ. Then
h i h i
Ef (|Xn − X|) = E f (|Xn − X|); |Xn − X| > δ + E f (|Xn − X|); |Xn − X| ≤ δ
≤ KP(|Xn − X| > δ) + εP(|Xn − X| ≤ δ)
Since by assumption P(|Xn − X| > δ) → 0 as n → ∞, we obtain limn Ef (|Xn − X|) ≤ ε. Since ε
was arbitrary, limn Ef (|Xn − X|) = 0.
104 Convergence of Random Variables
and so
0 ≤ f (ε) lim sup P(|Xn − X| > ε) ≤ lim Ef (|Xn − X|) = 0
n n
P
Proposition 6.4.4 If p ≥ 1, then Xn → X iff E[|Xn − X|p ∧ 1] → 0.
Hence S is integrable.PIt follows that P(S = ∞) = 0 (by Proposition 4.1.9), so that for almost
all ω ∈ Ω the series
P n=1 Xn (ω) is absolutely convergent, and hence convergent. Thus T :=
X
k=1 k = lim n Tn converges a.s.
Spaces of Random Variables 105
Now since |Tn | ≤ Sn ≤ S, and S is integrable, the Dominated Convergence Theorem ensures
that
∞
hX i n
X ∞
X
E Xn = E[lim Tn ] = lim E[Tn ] = lim E[Xk ] = E[Xn ]
n n n
n=1 k=1 n=1
P
Proposition 6.4.6 Xn → X iff every subsequence hXnk ik has a subsubsequence hXnkj ij
such that Xnkj → X a.s..
P
In particular, if Xn → X a.s., then Xn → X.
P
Proof: (⇒): Fix a subsequence hXnk ik of hXn in . Then certainly also Xnk → X, so we may choose
a subsubsequence hXnkj ij such that
Then ∞
P P
j=1 E[|Xnkj − X| ∧ 1] < ∞. Thus by Proposition 6.4.5 we have that j |Xnkj − X| ∧ 1
converges a.s. Now if a series converges, the terms of the series must converge to zero, so |Xnkj −
a.s.
X| → 0 a.s., i.e. Xnkj → X as j → ∞.
P
(⇐): We use Propn. 6.4.4 with p = 1. Suppose that Xn 6→ X, i.e. that lim supn E[|Xn −X|∧1] ≥
ε for some ε > 0. Choose a subsequence hXnk ik such that E[|Xnk − X| ∧ 1] > ε for all k. If hXnkj ij
is any subsubsequence, we cannot have Xnkj → X a.s., for then |Xnkj − X| ∧ 1 → 0 a.s., and hence
E[|Xnkj − X| ∧ 1] → 0 by the Dominated Convergence Theorem. a
P P
Corollary 6.4.7 If Xn → X and f : R → R is a.s. continuous at X, then also f (Xn ) →
f (X).
Proof: Given a subsequence hXnk ik , choose a subsubsequence such that Xnkj → X a.s. Then
P
f (Xnkj ) → f (X) a.s., because f is a.s. continuous at X. By Propn. 6.4.6, f (Xn ) → f (X). a
Lp P
Proposition 6.4.8 If Xn → X for p ≥ 1, then Xn → X.
Proof: Since |Xn −X|p ∧1 ≤ |Xn −X|p , we see that E[|Xn −X|p ∧1] → 0 whenever E|Xn −X|p → 0.
P Lp
By Propn. 6.4.4, we see that Xn → X whenever Xn → X. a
Remarks 6.4.9 So far we have seen that a.s.–convergence and Lp –convergence both imply con-
vergence in probability, and that convergence in probability implies weak convergence. Without
further assumptions, this is the best we can do:
(a) Convergence in Lp does not imply almost sure convergence: On the unit interval with Lebesgue
measure, consider the following sequence of intervals:
106 Convergence of Random Variables
Let Xn be R the indicator function of the nth interval on this list. It is clear that for sufficiently
p
large n, Xn dλ is arbitrarily small, and thus that Xn converges to 0 in Lp for all p ≥ 1. How-
ever, Xn does not converge pointwise anywhere: For any x, there are infinitely many intervals
on the list which contain x, and infinitely many which don’t. It follows that lim supn Xn = 1,
lim inf n Xn = 0, i.e. that Xn is nowhere convergent.
(b) Almost sure convergence does not imply convergence in Lp : On the unit interval with Lebesgue
measure, let Xn = nI[0,( 1 )p ] . Then Xn → 0 a.s., but E|Xn − 0|p = 1 for all n ∈ N, so Xn does
n
not converge to 0 in Lp mean.
(c) Convergence in probability does not imply convergence in Lp or almost sure convergence: This
follows from the previous two examples. For instance, (a) is an example of a sequence which
converges in mean, and thus in probability, but not almost surely.
Exercise 6.4.10 (a) We show that rapid convergence in probability implies a.s. convergence.
Suppose that hXn in is a sequence of random variables. Say that hXn in converges to X rapidly
in probability if and only if
∞
X
P(|Xn − X| > ε) < ∞ for every ε > 0
n=1
a.s.
Show that, in that case, Xn → X.
P a.s.
(b) Use (a) to prove that if Xn → X, then hXn in has a subsequence hXnk ik such that Xnk → X.
a.s.
[Hint: (a) Suppose that Xn 6→ X. Explain why there is ε > 0 such that P(|Xn −X| > ε, i.o) > 0.
Now use a Borel–Cantelli Lemma.
(b) Explain why we may choose an increasing sequence of natural numbers hnk ik such that P(|Xnk −
X| > 2−k ) < 2−k for all k ∈ N. Given ε > 0, choose k0 such that 2−k0 < ε. Explain why
P
k≥k0 P(|Xnk − X| > ε) < ∞. Now use (a).]
Chapter 7
Conditional Expectation
107
108 Definition of Conditional Expectation
which generates G, i.e. such that G = σ(B1 , B2 , . . . ). In the previous subsection, we defined E[X|G]
for any G ∈ G. In particular, cn := E[X|Bn ] has been defined for each block Bn .
Consider now a random variable Z defined by
Z(ω) = cn when ω ∈ Bn
i.e. Z is a random variable which takes the constant value cn on the block Bn :
X
Z= cn IBn
n
Since Z := E[X|G] is a random variable, we can try to integrate it. If Z(ω) = cn when ω ∈ Bn ,
i.e. if cn = E[X|Bn ], then
(i) Z is G–measurable.
Here θ is the angle between v1 and v2 . We say that v1 , v2 are orthogonal if hv1 , v2 i = 0. Hilbert
spaces are also complete, i.e. every Cauchy sequence in V converges (to a vector in V ).
Suppose that W is a complete subspace of V . We then have the notion of orthogonal projection
onto W . Given any vector v ∈ V , there exists adecomposition
v = v || + v ⊥
a unique vector w with the following properties:
(1) v || ∈ W .
(2) v ⊥ ⊥ W , i.e. hv ⊥ , wi = 0 for all w ∈ W .
(3) ||v − v || || = inf{||v − w|| : w ∈ W }.
Thus v || is the vector in W which is the best approximation of v: It lies closer to v than any other
w ∈ W . v || is called the orthogonal projection of v onto W .
Recall also that L2 (Ω, F, P) is a Hilbert space, with inner product hX, Y i = EXY and induced
1
norm ||X||2 = (EX 2 ) 2 . (We ignore here the small difference between L2 and L2 .)
Proof of Thm. 7.1.3: First assume that X ∈ L2 (Ω, F, P). Note that L2 (Ω, G, P) is a closed
subspace of L2 (Ω, F, P), and thus there exists a decomposition
X =Z +Y where Z ∈ L2 (Ω, G, P) and Y ⊥ L2 (Ω, G, P)
Moreover, ||X − Z||2 = inf{||X − U ||2 : U ∈ L2 (Ω, G, P)}. Now Z is clearly G–measurable. Also, if
G ∈ G, then IG ∈ L2 (Ω, G) and so Y ⊥ IG . Hence
E[Z; G] = hZ, IG i = hX, IG i = E[X; G] all G ∈ G
It follows that Z = E[X|G] a.s.
For X ∈ L1 (Ω, F, P), we use an approximation argument. First assume that X ≥ 0, and for
n ∈ N, define Xn := X ∧ n. Then Xn ↑ X, and each Xn ∈ L2 (Ω, F, P). By the above, there are
Zn ∈ L2 (Ω, G, P) such that Zn = E[Xn |G] a.s. Next, note that if n ≤ m, then Xn ≤ Xm , and
thus Zn ≤ Zm a.s.: For if ε > 0, and Gε := {Zn − Zm > ε}, then Gε ∈ G, so that 0 ≤ S ε · PGε ≤
E[Zn − Zm ; Gε ] = E[Xn − Xm ; Gε ] ≤ 0. Hence PGε = 0 for all ε > 0. Now {Zn > Zm } = k∈N G 1 ,
k
and hence P(Zn > Zm ) = 0, i.e. Zn ≤ Zm a.s., for each pair n ≤ m. Taking the (countable)
intersection over all such pairs yields P(Zn is an increasing sequence) = 1, i.e. the sequence (Zn )n
is increasing a.s. Define Z = lim supn Zn . Then Z is G-measurable, and Zn ↑ Z a.s. If G ∈ G, then
by two applications of the MCT we have
E[Z; G] = lim E[Zn ; G] = lim E[Xn ; G] = E[X; G]
n n
Remarks 7.1.5 (a) If G = {∅, Ω}, the algebra of zero information, then
E[X|G] = EX
Recall that only the constant functions are G—measurable, and it is obvious that the best
constant approximation to X is EX. However, we must be precise, and show that EX is
a version of the conditional expectation. Since EX is just a number, we can regard it as a
constant function, so that EX is G–measurable. Moreover
Z Z Z Z
X dP = EX dP = 0 X dP = EX dP = EX
∅ ∅ Ω Ω
R R
and thus GX dP = G EX dP for all G ∈ G.
(c) If B ∈ G and P(B) > 0, then E[E[X|G]|B] = E[X|B]. To see this, note that
Z Z
1 1
E[E[X|G]|B] = E[X|G] dP = X dP = E[X|B]
P(B) B P(B) B
Thus the random variables X and E[X|G] have the same conditional expectations if we condition
over events in G.
Example 7.1.6 Let (Ω, F, P) be the probability space which models the rolling of a fair die, and
let G be the σ–algebra which contains the information whether the outcome is odd or even, i.e. G =
σ({1, 3, 5}, {2, 4, 6}). Let X be a random variable with X(ω) = ω 2 . We want to determine E[X|G].
Now E[X|G] is G–measurable, and therefore constant on the sets A = {1, 3, 5} and B = {2, 4, 6}.
The integral of X over the set A is
1 35
E[X; A] = (12 + 32 + 52 ) =
6 6
Since E[X|G] is constant on the set A, with value cA say, we must have
Z Z Z
35
cA dP = E[X|G] dP = X dP =
A A A 6
35 56
which yields cA = 3 . Similarly the value cB of E[X|G] on B must be cb = 3 . We thus have
35 56
E[X|G] = 3 I{1,3,5} + 3 I{2,4,6}
Example 7.1.7 Take Ω = (0, 1] with the σ–algebra of Borel sets and with P the Lebesgue measure
on (0, 1]. Define two random variables X, Y as follows:
1
1 if 0 < x ≤
X(x) = 3x2 Y (x) = 2
4x if 1 < x ≤ 1
2
We want to find a version of E[X|Y ]. First we need to find the σ–algebra generated by Y , i.e. we
must describe the family of sets Y −1 (B), B ∈ B. Now Y −1 ({1}) = (0, 12 ], which shows that this
half–open interval is a smallest non–empty set in σ(Y ). This makes sense: Y cannot distinguish
between any of the elements of (0, 21 ], and therefore neither can σ(Y ). Moreover, if 2 < a < b ≤ 4,
then Y −1 (a, b) = ( 14 a, 41 b) ⊆ ( 12 , 1]. It is therefore clear that every open set in ( 21 , 1] belongs to σ(Y ),
and thus that the family B( 12 , 1] of Borel subsets of ( 21 , 1] is a subset of σ(Y ). It is now easy to see
that A ∈ σ(Y ) iff there is B ∈ B( 12 , 1] such that either A = B or A = (0, 21 ] ∪ B. This completely
describes the family σ(Y ). Since E[X|Y ] has to be σ(Y )–measurable, it must be constant on (0, 12 ].
Also, since (0, 12 ] ∈ σ(Y ), we must have
Z Z Z 1
2 1
E(X|Y ) dP = X dP = 3x2 dx =
(0, 12 ] (0, 21 ] 0 8
so that the random variable E[X|Y ] must take the constant value 41 for x ∈ (0, 12 ]. The σ–algebra
σ(Y ) can distinguish between all the points in the interval ( 12 , 1]: For example, if we know that
Y = 3, then we know that the outcome is x = 43 . Thus if we know the value of Y for values
2 < Y ≤ 4, then we know the outcome, and if we know the outcome, we know the value of X. We
therefore expect the best σ(Y )–measurable approximation of X over ( 12 , 1] to be X itself, i.e. we
expect E[X|Y ](x) = 3x2 for 21 < x ≤ 1. Putting this together, define a random variable
(
1 1
4 if 0 < x ≤ 2
Z(x) = 2 1
3x if 2 <x≤1
We just need to show that Z is a version of E[X|Y ]. By similar arguments as for Y , it is clear
that σ(Z) = σ(Y ), and thus that Z is σ(Y )–measurable. Now suppose that A ∈ σ(Y ). Then
either A = B for some B ∈ B( 12 , 1] or A = (0, 21 ] ∪ B. In the former case we obviously have
E[Z; A] = E[X; A], because Z = X on the interval ( 21 , 1]. In the latter case, we have
Conditional Expectation 113
(f ) cFATOU: If Xn ≥ 0, then E[lim inf n Xn |G] ≤ lim inf n E[Xn |G] a.s.
(g) cDCT: If |Xn | < Y (all n ∈ N) for some integrable Y , and if Xn → X, then
E[Xn |G] → E[X|G] a.s.
(h) PROJECTION: E X · E[Y |G] = E E[X|G] · Y = E E[X|G] · E[Y |G] .
Since E[Y |G] is integrable, it is finite a.s., and hence can be cancelled to yield lim inf n (±E[Xn |G] ≥
±E[X|G], which implies
E[X|G] ≤ lim inf E[X|G] ≤ lim sup E[Xn |G] ≤ E[X|G] a.s.
n n
(h): This follows from the usual properties of projections if X, Y ∈ L2 (Ω, F, P): For suppose that
X = Xk + X⊥ , Y = Yk + Y⊥ are decompositions of X, Y into components parallel and perpendicular
to L2 (Ω, G, P, so that Xk = E[X|G], Yk = E[Y |G] (by the second proof of Thm. 7.1.3). Then
E X · E[Y |G] = hX, Yk i = hXk , Yk i = E E[X|G] · E[Y |G] a.s.
because hX⊥ , Yk i = 0. If X, Y ∈ L2 (Ω, F, P) are non–negative, we may define Xn := X ∧ n, Yn :=
Y ∧ n. Then 0 ≤ Xn ↑ X and 0 ≤ Yn ↑ Y , and Xn , Yn ∈ L2 (Ω, F, P). It follows by the MCT and
cMCT that
E X · E[Y |G] = lim E Xn · E[Yn |G] = lim E E[Xn |G] · E[Yn |G] = E E[X|G] · E[Y |G] a.s.
n n
Hence Y E[X|G] is a version of E[Y X|G]. The result now follows by linearity and cMCT.
(j): Consider the case where X ∈ L2 (Ω, F, P). Since L2 (Ω, H, P) ⊆ L2 (Ω, G, P) ⊆ L2 (Ω, F, P)
are closed Hilbert subspaces, the result follows from the fact that a projection of a projection is a
projection. Alternatively, let
Y := E[X|G] a.s. Z := E E[X|G]|H = E[Y |H] a.s.
If H ∈ H ⊆ G, then
E[Z; H] = E[Y ; H] = E[X; H]
and hence Z is a version of E[X|H], i.e. E E[X|G]|H = E[X|H] a.s.
The fact that E E[X|H]|G = E[X|H] a.s. follows directly (i).
(k): Let Y := E[X|G]. Since Y is certainly G ∨ H–measurable, we must show that E[Y ; F ] =
E[X; F ] for all F ∈ G ∨ H. Now let C := {G ∩ H : G ∈ G, H ∈ H}, and let D := {F ∈ G ∨ H :
E[Y ; F ] = E[X; F ]}. First note that C ⊆ D: For if G ∈ G, H ∈ H, then E[X; G∩H] = E[XIG ]E[IH ],
by independence, and so
E[X; G ∩ H] = E[X; G]E[IH ] = E[Y ; G]E[IH ] = E[Y ; G ∩ H]
since Y IG is independent of H. It is straightforward to verify that C is a π–system that generates
G ∨ H, and that D is a λ–system. Hence by Dynkin’s Lemma (Thm. 1.6.3), D = G ∨ H.
a
Proposition 7.2.3 (Jensen’s inequality)
Suppose that g : U → R is a convex function on an open interval U ⊆ R, and that X is
a random variable with values in U (a.s.) such that both X and g(X) have finite expected
values. Then
E[g(X)|G] ≥ g (E[X|G])
Conditional Expectation 115
Proof: We use notation and results from Remarks 4.7.4. Let v ∈ U , and let D− (v) = lim ∆(u, v)
u↑v
and D+ (v) = lim ∆(v, w). Then D− (v), D+ (v) both exist, and D− (v) ≤ D+ (v). Now suppose that
w↓v
m is a real number satisfying D− (v)
≤m≤ D+ (v),
and that x ∈ U . We consider two cases: If (i)
−
x ≤ v, then ∆(x, v) ≤ D (v) (since ∆(u, v) increases as u ↑ v) and thus ∆(x, v) ≤ m. It follows
that g(x) ≥ m(x − v) + g(v). Next, if (ii) x ≥ v, then ∆(v, x) ≥ D+ (v) (because ∆(v, w) decreases
as w ↓ v) and thus ∆(v, x) ≥ m. It follows that g(x) ≥ m(x − v) + g(v). Hence, in either case, we
have
g(x) ≥ m(x − v) + g(v)
Examples 7.2.4 (a) Suppose that X is a random variable in Lp (Ω, F, P), where p ≥ 1, and let Y
be a version of E(X|G). Then Y in Lp (Ω, F, P) as well. This is because g(x) = |x|p is convex.
By Jensen’s inequality, we therefore have |E(X|G)|p ≤ E(|X|p |G) a.s., i.e. |Y |p ≤ E(|X|p |G). It
follows that E|Y |p ≤ E[E(|X|p |G)] = E|X|p < +∞, by (I).
(b) Suppose that X ∈ L2 (Ω, F, P) (i.e. that var(X) exists). If Y is a version of E(X 2 |G), then
Y ∈ L2 (Ω, F, P) as well, by (a), and EY 2 ≤ EX 2 . Since EY = EX (by (I)), we thus have
Thus var(Y ) ≤ var(X). This reflects the fact that Y , being cruder, can’t vary as much as X
can.
(i) We say that ν is absolutely continuous w.r.t. µ, and write ν µ, iff µ(A) = 0 implies
ν(A) = 0 for all A ∈ F.
(iii) We say that µ, ν are mutually singular, and write µ ⊥ ν, iff there exists A ∈ S such
that µ(A) = 0 = ν(Ac ).
Remarks 7.3.2 By definition, two measures are equivalent iff they have the same null sets. It is
easy to show that, in that case, they have the same sets of positive measure, and (assuming the y
are probability measures) they have the same sets of measure 1.
Suppose that (Ω, F, µ) is a measure space, and that f ∈ F + . Recall from Propn. 4.5.1 that
there is a measure ν on (Ω, F), defined by
Z
ν(A) = f dµ
A
dν
The map f is called the density, or Radon–Nikodým derivative, of ν w.r.t. µ, and also denoted dµ .
Furthermore, we showed that for
Z Z
dν
g dν = g dν
dµ
for any ν–integrable function g (cf. Propn. 4.5.2, the Chain Rule).
It should be clear that ν µ
The Radon–Nikodým Theorem (below) states that this way of constructing an absolutely R con-
tinuous measure is the only way to do so: If ν µ, then ν has a density, i.e. then ν(A) = A f dµ
for some non–negative measurable f .
We leave the following proposition as an exercise:
Proposition 7.3.3 (a) Let µ, ν, η be σ–finite measures on a measurable space (Ω, F), and
dν dη dη
suppose that dµ and dν exist. Then dµ exists, and
dη dη dν
=
dµ dν dµ
dν dν dµ
(b) If dµ exists and dµ > 0 µ–a.e., then dν exists and
dµ 1
= µ–a.e.
dν dν/dµ
where dµ
dν may be defined to be an arbitrary constant (e.g. 0) where
dν
dµ = 0.
In particular, µ, ν are equivalent measures.
Conditional Expectation 117
Proof: We prove the resultRfor the case where µ, ν are finite measures on (Ω, F), and that ν µ.
dν
If ν(Ω) = 0, then ν(A) = A 0 dµ for all A ∈ F, i.e. dµ = 0. We may therefore assume that
ν(Ω) > 0, from which it follows that also µ(Ω) > 0.
Define the probability space space (Ω̃, F̃, P) by
Ω̃ := Ω × {0, 1} F̃ := (A × {0}) ∪ (B × {1}) : A, B ∈ F
and
1 µ(A) ν(B)
P (A × {0}) ∪ (B × {1}) := +
2 µ(Ω) ν(Ω)
(It is straightforward to verify that (Ω̃, F̃, P) is a probabilility space.)
Now define a random variable X and a σ–subalgebra G̃ ⊆ F̃ by
X := IΩ×{1} G̃ := A × {0, 1} : A ∈ F
g(ω) := Z(ω, j) j = 0, 1
Then
dη g dη 1−g
= =
dµ 2µ(Ω) dν 2ν(Ω)
and thus η µ, η ν.
1
R
With A := {ω ∈ Ω : g(ω) = 1} in (∗)., we see that ν(Ω) A 1 − g dν = 0, and hence that
1
R
µ(Ω) A g dµ = 0 also. It follows that µ({ω ∈ Ω : g(ω) = 1}) = 0, and since ν µ, that
dη
ν({ω ∈ Ω : g(ω) = 1}) = 0. In particular ν({ω ∈ Ω : 1 − g(ω) = 0}) = 0, and hence dν > 0 ν–a.e.
It follows that dν dν 1
dη exists, and that dη = dη/dν — cf. Propn. 7.3.3.
Now define f : (Ω, F) → (R, B(R)) by
ν(Ω) g(ω)
if g(ω) < 1
f (ω) := µ(Ω) 1 − g(ω)
0 else
Then Z Z Z Z
g/2µ(Ω) dη dν dν
f dµ = dµ = dµ = dµ = ν(A)
A A (1 − g)/2ν(Ω) A dµ dη A dµ
dν
and thus f = dµ .
a
Exercise 7.3.5 Extend the above proof to the case where µ, ν are σ–finite.
Appendix A
A.1 Logic
We introduce here a formal language for talking about mathematical objects. This language is
very precise, and unambiguous — properties which are largely absent from spoken languages such
as English, but obviously essential for mathematics. But, as a result, this language is rather
restricted in scope. The reason we use it is to make certain statements amenable to logical analysis.
The purpose of logical analysis is to decide whether a particular sentence/expression (e.g. about
mathematical objects) is true (T) or false (F). A sentence/expression that is either true or false
(but not both!) is called a statement.
More complicated statements in our formal language are built up from a collection of symbols,
including amongst others
• Symbols for objects, operations and relations;
• Logical Connectives;
• Quantifiers;
We will briefly discuss each of these in turn. None of this material is difficult, though it may take
a little while to get used to.
119
120 Logic
In probability theory, the set operations of union, intersection and complementation can be
interpreted as the logical connectives or, and and not.
In our formal language, the logical connectives have very precise meanings: If φ, ψ denote
statements, then
This means, for example, that if φ is true and ψ is false — in the second row of the table — then
φ ∧ ψ is false, φ ∨ ψ is true, φ → ψ is true, etc.
Now it is extremely important to note that the logical use of and ∧, or ∨, and implies →,
though related to their common usage in English, is certainly not identical to it. In particular the
truth value T or F of an expression such as φ ∧ ψ, φ ∨ ψ, φ → ψ etc. depends only on the truth
values of φ and ψ, and not on any meaning that the statements φ, ψ might possess! Let us discuss
some of the pitfalls:
φ ψ φ∧ψ
T T T
• And, ∧: T F F
F T F
F F F
To say that ψ ∧ ψ simply means that both φ and ψ are true. It does not assert any connection
(causal or otherwise) between φ and ψ. This is not typically true in English. With the
English and, the following sentences have rather different meanings, but with the logical and
they mean the same thing:
φ ψ φ∨ψ
T T T
• Or, ∨: T F T
F T T
F F F
φ ∨ ψ is true precisely when at least one of φ, ψ is true, possibly both. In particular, it is not
exclusive–or (“either. . . , or. . . ”). Thus the statement
is true.
φ ψ φ∨ψ
T T T
• Implies, Then, →: T F F
F T T
F F T
The logical (or material) implication is likely to present you with the most difficulties, as it
diverges considerably from its meaning in natural language. In English usage, implies (or
then) usually involves a causal connection, as in “If it is raining, then it is wet outside.” It is
wet because of the rain. But such a connection is irrelevant for the logical then. As we said
before, the truth value of the statement φ → ψ depends only on the truth values of φ and ψ.
Now the one thing that is certain is that a true statement cannot imply a false statement.
Thus T → F must be false, and we define:
Note that a false statement can obviously imply a false statement. This is because any
statement implies itself, i.e. φ → φ. Indeed, if φ is true, then φ is true. So φ → φ is true even
when φ is false. But a false statement can also imply a true statement, e.g. “If the moon is
made of green cheese, then the moon has mass”.
Thus there are severe differences between the English usage and the mathematical usage of
implies. For example, the statement
is true. Of course, the reason that 5 is prime is not because of the fact that 1 > 0!! There is
no causal connection. Indeed, (1 < 0) → (5 is prime) is true also!
We repeat: A logical φ → ψ statement is false only when φ is true and ψ is false — just look
at the truth table.
Logic and Sets 123
is true.
A.1.3 Quantifiers
Many mathematical statements assert the existence of a mathematical object with certain proper-
ties. For example to say that
x2 − 1 = 0 has a real root
is to say that there exists a real number c such that c2 − 1 = 0.
Other mathematical statements assert that something is true for all objects (of a prespecified type),
for example
For every real number x, x2 ≥ 0.
We therefore introduce the following symbols for quantifiers:
∀ For all
∃ There exists
A quantifier always occurs in conjunction with a variable, i.e. as ∀x or as ∃x. Thus if φ(x) is a
statement about x, then
Frequently, if we want to restrict the domain to a particular set X, we may also write ∀x ∈ X φ(x)
or ∃x ∈ X φ(x). Thus
(∃x ∈ X)φ(x) is true iff there is at least one x ∈ X for which the statement φ(x) is true
Thus the statement ∃x ∈ R(x2 − 1 = 0) asserts that the equation x2 − 1 = 0 has a real root.
The statement ∀x ∈ R(x2 ≥ 0) asserts that the square of any real number is non–negative.
Exercise A.1.2 Decide if the following sentences about real numbers are true or false:
(b) ∃x ∈ N(4x = 1)
(c) ∃x ∈ R(4x = 1)
(d) ∀x ∈ R ∃y ∈ R(x ≤ y)
(e) ∃y ∈ R ∀x ∈ R(x ≤ y)
124 Sets
(h) ∀x ∈ R ∀y ∈ R ∃z ∈ R[x + z = y]
(i) ∃z ∈ R ∀x ∈ R ∀y ∈ R[x + z = y]
For if it isn’t the case that the statement ϕ(x) is true for every x, then there is at least one x
for which the statement ϕ(x) is false, and thus for which ¬ϕ(x) is true. Thus a negation sign can
“creep” past a quantifier, but it flips the quantifier in the process. For example,
One more thing: The variable x in a statement of the form ∀xφ(x) or ∃xφ(x) is unimportant, i.e.
the meaning of the statement remains the same if we change the variable (provided that the new
variable does not already occur in the statement φ). This is just like what happened for definite
integrals: For example, we have
Z b Z b
f (x) dx = f (y) dy
a a
Just so, we have
A.2 Sets
In the early twentieth century, the following principle was established:
All mathematical objects are sets.
All mathematical notions can be expressed as relationships between sets.
x∈A (x is an element of A)
Logic and Sets 125
S S S ∞
S
We will frequently write Ai or i Ai instead of Ai . We will also write An instead
S I T i∈I n=1
of An . The same holds for .
n∈N
etc.
A − B = {x : x ∈ A ∧ x 6∈ B}
A∆B = (A − B) ∪ (B − A) = (A ∪ B) − (A ∩ B)
Often, we will be working with subsets of some universal set Ω. If A ⊆ Ω, we define the
complement of A by
Ac = Ω − A
• Idempotent laws:
A∪A=A A∩A=A
• Commutative laws:
• Associative laws:
• Distributive laws:
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
[ [ \ \
A∩ Bi = (A ∩ Bi ) A∪ Bi = (A ∪ Bi )
i∈I i∈I i∈I i∈I
• Absorption laws:
A ∩ (A ∪ B) = A A ∪ (A ∩ B) = A
• Complementation laws:
(Ac )c = A (A∆B)c = Ac ∆B
• De Morgan’s laws:
(A ∩ B)c = Ac ∪ B c (A ∪ B)c = Ac ∩ B c
[ c \ \ c [
Ai = Aci Ai = Aci
i∈I i∈I i∈I i∈I
A − (B ∪ C) = (A − B) ∩ (A − C) A − (B ∩ C) = (A − B) ∪ A − C)
[ \ \ [
A− Bi = (A − Bi ) A− Bi = (A − Bi )
i∈I i∈I i∈I i∈I
Using ordered tuples, we can define one more way of making new sets from old:
(a) Suppose that A1 , A2 , . . . , An are sets. The cartesian product of A1 , . . . , An is the set
of all n–tuples (a1 , . . . , an ), with each ak ∈ Ak .
A1 × A2 × · · · × An = {(a1 , a2 , . . . , an ) : ak ∈ Ak for k = 1, 2, . . . , n}
• In probability theory, Cartesian products are used to combine random experiments. The
experiment of tossing a coin is modelled by the sample space ΩC := {T, H}. The experiment of
rolling a die is modelled by the sample space ΩD := {1, 2, 3, 4, 5, 6}. The combined experiment
of tossing a coin and then rolling a die is modelled by the sample space
Ω := ΩC × ΩD = {(T, 1), (T, 2), . . . , (T, 6), (H, 1), (H, 2), . . . , (H, 6)}
i.e.
f −1 ◦ f = 1A f ◦ f −1 = 1B
where 1A : A → A : a 7→ a is the identity function (which maps an element a to itself), and similarly
1B : B → B : b 7→ b.
The above is possible only for bijective functions f . We now introduce a similar notion which
will work for any function f : A → B. However, instead of operating on elements, these operat
on sets: With every function f : A → B (not necessarily invertible), we can associate two new
functions between the power sets of A and B
Thus f [·] assigns to each subset A0 of A a subset f [A0 ] ⊆ B. Similarly, f −1 [·] transforms each subset
B 0 of B into a subset f −1 [B 0 ] ⊆ A.
We will, for the moment, use square brackets to distinguish the various functions, but will drop
this convention later. Which function is meant will be clear from context. We shall also call f [A0 ]
the direct image of A0 along f , and f −1 [B 0 ] the inverse image or pullback of B 0 along f . Note that
whereas
f −1 [B 0 ] = set of all preimages of b ∈ B 0
Inverse images play a very important role in mathematics. It is therefore useful to remember
the following:
(d) Give an example to show that we may not have f [G ∩ H] = f [G] ∩ f [H];
(f) Give an example to show, in (e), that both ⊆’s may fail to be =’s.
Logic and Sets 131
[x]R := {y ∈ X : xRy}
We may write just [x] is R is understood from context.
The following device yields a fruitful way of looking at equivalence relations:
Definition A.4.3 A family P of non–empty sets is called a partition of a set X if and only if:
S
(i) {P : P ∈ P} = X.
(ii) Any two distinct members of P are disjoint, i.e. P, Q ∈ P and P 6= Q implies P ∩ Q = ∅.
The members P ∈ P are sometimes called the blocks of the partition. Observe that (i) of the above
definition states that each x ∈ X belongs to at least one block, whereas (ii) states that no x ∈ X
belongs to more than one block. Thus each x ∈ X belongs to exactly one block.
Exercise A.4.4 (a) Every equivalence relation on X induces a partition of X: Let R be an equiv-
alence relation on a set X. Define a family of sets PR by
PR := {[x]R : x ∈ X}
x RP y if and only if ∃P ∈ P (x ∈ P ∧ y ∈ P )
(i.e. x RP y if and only if x, y belong to the same block of P.) Show that RP is an equivalence
relation. Further show that the blocks of P are precisely the equivalence classes of RP , i.e.
that [x]RP = P if and only if x ∈ P .
(c) Show that the constructions in (a), (b) above are inverses of each other, in the following sense:
Starting from an equivalence relation R, the construction in (a) yields a partition PR . If we
then apply the construction in (b) to the partition PR , we get an equivalence relation RPR .
This relation is precisely the original equivalence relation R, i.e. RPR = R.
Similarly, PRP = P.
If R is an equivalence relation on a set X, then the set of equivalence classes — which we above
denoted by PR — is usually denoted by X/R. The equivalence class of an element x — above
denoted by [x]R — is often denoted by x/R.
Here is another reason why equivalence relations are ubiquitous in mathematics:
Exercise A.4.5 (a) Every function induces an equivalence relation on its domain: Suppose that
f : X → Y is a function. Define a binary relation R on X by
The relation R is called the kernel of f , and denoted R = ker f . Show that ker f is an
equivalence relation.
(b) Assuming the Axiom of Choice, every equivalence relation on a set X is induced by a function
with domain X: Show that if R is an equivalence relation, then there is a function f with
domain X such that R = ker f .
Appendix B
Convergence
• We say hxn in has property P eventually iff P (xn ) is true for all n from some point onwards.
These are intuitive descriptions, not formal definitions — we’ll come to that. But first, some
examples:
Examples B.1.1 (1) The sequence 1, 2, 3, 4, 5, . . . is prime infinitely often: Infinitely many terms
are prime numbers. It is also even infinitely often. It is eventually greater than 1010 .
(2) The sequence −2, −1, 0, 1, 2, 3, . . . is positive eventually: From the fourth term onwards, all
terms are positive (i.e. > 0).
Remarks B.1.2 • First note that hxn in has property P eventually if and only if there exists
an N ∈ N such that every term after the N th has property P . Formally,
133
134 Convergence of Sequences
• Next note that if hxn in has property P infinitely often, then the following is true:
For if this is not the case, then there is some N such that no n ≥ N has property P . Thus if
xn does have property P , then n < N .
But then there are only finitely many xn which have property P — at most those xn for
n = 1, 2, . . . , N − 1!! This contradicts the assumption that hxn in has property P infinitely
often.
It follows that if hxn in has property P infinitely often if and only if then (∗) is true.
• Finally, observe that infinitely often and eventually are closely related: If it is not the case that
a sequence hxn in has property P infinitely often, then eventually hxn in must have property
¬P .
¬(∀N )(∃n ≥ N ) [xn has property P ] ≡ (∃N )(∀n ≥ N ) [xn has property ¬P ]
Definition B.1.3 (a) We say that a sequence hxn in has property P infinitely often if and
only if
(∀N ∈ N) (∃n ∈ N) [n ≥ N ∧ xn has property P ]
(b) We say that a sequence hxn in has property P eventually if and only if
The notion “small” is subjective, so we will demand that it holds for absolutely anybody’s idea of
“small”. Specifically, suppose you define “small” by specifying some number ε > 0 and saying “A
Convergence 135
non–negative number x is small iff x < ε”. To say that hxn in is eventually small then means that
from some point onwards all the xn ’s are small, i.e
∃N ∀n ≥ N [xn < ε]
This must be true no matter what gauge ε > 0 of “smallness” you use. Thus:
We also write
lim xn = 0
n→∞
Thus xn → 0 iff given any ε > 0 it is possible to find a natural number N such that
xn < ε whenever n ≥ N
The number N typically depends on ε: The smaller ε > 0, the greater N usually has to be. It is
now rather simple to define convergence of arbitrary sequences in R: To say that xn → x means
that the distance between xn and x converges to 0, i.e.
xn → x ⇔ |xn − x| → 0
Now the distance |xn − x| between xn and x is non–negative, so we already know what |xn − x| → 0:
It means ∀ε > 0 ∃N ∀n ≥ N [|xn − x| < ε]. Thus
We then say that hxn in is a convergent sequence, with limit x. We also write
x = lim xn for xn → x
n
Thus xn → x if and only if, given any ε > 0, the distance between xn and x is eventually < ε, i.e.
all but finitely many terms xn lie within a distance of ε of x.
136 sup and inf
∀a ∈ A(a ≤ u)
∀a ∈ A(l ≤ a)
(c) We say that A is bounded if and only if it has both an upper bound and a lower bound.
(d) We say that u0 is the supremum, or least upper bound of A if and only if the following
hold:
We write
u0 = sup A or u0 = l.u.b.(A)
(e) We say that l0 is the infimum, or greatest lower bound of A if and only if the following
hold:
We write
l0 = inf A or l0 = g.l.b.(A)
(f) We say that u0 is the maximum of A, and write u0 = max A, if and only if
u0 ∈ A and u0 = sup A
(g) Similarly, we say that l0 is the minimum of A, denoted l0 = min A, if and only if
l0 ∈ A and l0 = inf A
sup ∅ = −∞ inf ∅ = +∞
The following trivial observation is nonetheless very useful: If x < sup(A), then x is smaller
than the least upper bound of A, and hence x is not an upper bound of A (else it would be an
upper bound that is less than the least upper bound).It follows that there is a ∈ A such that x < a.
Thus:
Lemma B.2.4 Let A ⊆ R.
sup(−A) = − inf(A)
Proof: Let u be an upper bound for −A. Then u ≥ −x for every x ∈ A, and hence −u ≤ x for
every x ∈ A, i.e. −u is a lower bound for A. Similarly, if l is a lower bound for A, then −l is an
upper bound for −A.
Now let u0 := sup(−A). We must show that −u0 = inf(A). Since u0 is an upper bound for
−A, we see that −u0 is a lower bound for A. Furthermore, if l is another lower bound for A, then
−l is an upper bound for −A, and hence −l ≥ u0 (because u0 is the least upper bound for −A, so
less than the upper bound −l). It follows that −u0 ≥ l, i.e. −u0 is a lower bound of A which is
greater than any other lower bound l. So −u0 = inf(A). a
The existence of suprema and infima is taken as a fundamental axiom of the real number system:
Completeness Axiom:
Every non–empty A ⊆ R which is bounded above has a supremum.
Every non–empty A ⊆ R which is bounded below has an infimum.
138 lim sup and lim inf
Proof: We prove only (a), as (b) is similar. We are given an increasing sequence hxn i, with
a := supn xn . Let ε > 0. We must show that there is N ∈ N such that |xn − a| < ε whenever
n ≥ N . Observe that xn ≤ a (as a = supn xn ), so that |xn − a| = a − xn . Now a − ε < a, so by
Lemma B.2.4 there is N such that a − ε < xN . If n ≥ N , then xn ≥ xN > a − ε, since hxn i is
increasing. It follows that ∀n ≥ N (a − xn < ε), and thus that ∀n ≥ N (|xn − a| < ε). a
Next, we show that every sequence of real numbers has a monotone subsequence. For the
purpose of the proof, we briefly introduce some non–standard terminology. Let hxn in be a sequence
of real numbers. Imagine that you are walking along a landscape, and that xn is your height above
sea level at time n. Call xn a vista if you can see the whole landscape ahead of you, i.e. if xn ≥ xm
for all m ≥ n. Thus if hxn in is decreasing, then each xn is a vista, whereas if hxn in is increasing,
there are no vistas at all. If xn := 1 + (−1)n n1 , then every even point x2n is a vista.
Proof: We consider two cases: Either (1) hxn in has infinitely many vistas, or (2) it has only
finitely many. In case (1), let xn1 , xn2 , xn3 , . . . be the subsequence of vistas, in order of increasing
subscript. Note that
xn1 ≥ xn2 ≥ xn3 ≥ · · · ≥ xnk ≥ . . .
Proof: By Theorem B.2.7, any sequence hxn i has a monotone subsequence. If hxn i is bounded,
then so is the subsequence. But a bounded monotone sequence converges, by Theorem ??. a
Convergence 139
Because hxn i is bounded, yn and zn exist (i.e. are finite real numbers), by the Completeness Axiom.
n
Example B.3.1 Suppose, for example, that xn = (−1)
n for n ≥ 1. Then
1 1 1 1 1
y1 = sup −1, , − , , − , . . . =
2 3 4 5 2
1 1 1 1 1
y2 = sup ,− , ,− ,... =
2 3 4 5 2
1 1 1 1
y3 = sup − , , − , . . . =
3 4 5 4
1 1 1
y4 = sup ,− ,... =
4 5 4
Observe that by Lemma B.2.3 we se that hyn i is a bounded decreasing sequence, and that hzn i is
a bounded increasing sequence.
By Lemma B.2.6, it follows that hyn i converges, and that limn yn = inf n yn . Similarly, limn zn =
supn zn exists also. We now define lim supn xn = limn yn , and lim inf n xn = limn zn :
Definition B.3.2 Let hxn i be a sequence in R. We define the limit superior of hxn i by
where we adopt the convention that if hxn i is not bounded above, we set lim supn xn = +∞.
Similarly, we define the limit inferior of hxn i by
where we adopt the convention that if hxn i is not bounded below, we set lim inf n xn = −∞.
Let’s analyse the notions of lim sup and lim inf. Let hxn in be a sequence, and let yn :=
supm≥n xm and zn := inf m≥n xm .
• If lim supn xn > a, then lim yn > a. Hence yn > a for all n (since hyn in is decreasing). It
follows that, for all n, supm≥n xm > a, i.e. that there exists m ≥ n such that xm > a. Thus
i.e.
lim sup xn > a =⇒ xn > a infinitely often
n
• On the other hand, if xn ≥ a infinitely often, then for every n there is m ≥ n such that
xm ≥ a. Thus yn := supm≥n xm ≥ a for all n, and hence lim supn xn = limn yn ≥ a, i.e.
• From the logical equivalence of ϕ → ψ and ¬ψ → ¬ϕ, and using facts like ¬(xn > a, i.o.) ≡
(xn ≤ a, ev.), we see that
and
lim sup xn < a =⇒ xn < a eventually
n
• By Lemmas B.2.5 and B.2.6, we have lim inf n xn = supn inf m≥n xm = − inf n supm≥n (−xm ) =
− lim supn (−xn ). Thus we need to prove a result only for lim sup, in order to get immediately
a corresponding result for lim inf. Similar statements therefore hold for lim inf.
Summarizing in a box:
If you understand the implications in the box, you understand lim sup and lim inf.
Though a bounded sequence hxn in may not have a limit, it always has a lim sup and a lim inf.
When hxn in does converge, the three notions coincide, and conversely, as we shall see next. Note
that always lim supn xn ≥ lim inf n xn :
Proposition B.3.3 Suppose that hxn in is a bounded sequence of real numbers. Then
hxn in converges if and only if lim supn xn = lim inf n xn . In that case, limn xn =
lim supn xn = lim inf n xn .
Convergence 141
Proof: (⇒): Suppose that xn → x, and let ε > 0. Then |xn − x| < ε eventually, and thus in
particular xn ≤ x + ε eventually. Thus lim supn xn ≤ x + ε.
Similarly x − ε ≤ xn eventually, and thus lim inf n xn ≥ x − ε.
It follows that for all ε > 0, we have
x − ε ≤ lim inf xn ≤ lim sup xn ≤ x + ε
n n
Definition B.4.1 A sequence hxn in in R is called a Cauchy sequence if and only if for
every ε > 0 there is an N ∈ N such that
Remarks B.4.2 (a) Note that all terms from some point onwards need to be within ε of each
other, not just successive terms. Thus, for example, if N = 100, then not just do we have
|x100 − x101 | < ε, but also |x301 − x15 673 428 | < ε.
(b) A neat way to characterize Cauchy sequences is as follows:
hxn in is a Cauchy sequence ⇐⇒ lim sup |xn − xN | = 0
N →∞ n≥N
Here (⇒) is obvious. (⇐) follows by the triangle inequality: Given ε > 0, choose N such that
supk≥N |xk − xN | < ε/2. Then for n, m ≥ N we have
|xn − xm | ≤ |xn − xN | + |xN − xm | ≤ 2 sup |xk − xN | < ε
k≥N
142 Cauchy Sequences and Convergence
Example B.4.3 The sequence h1 + (−1)n 2−n in is Cauchy. Indeed, given ε > 0, we may choose
N ∈ N such that 2−N < 2ε . If n, m > N , then (by the triangle inequality)
Proof: Suppose that xn → x, and that we are given ε > 0. We must find N such that |xn −xm | < ε
whenever n, m > N .
Now because xn → x there is N ∈ N such that |xn − x| < 2ε whenever n ≥ N . In particular, if
n, m ≥ N , then
ε ε
|xn − xm | ≤ |xn − x| + |x − xm | < +
2 2
Hence hxn in is a Cauchy sequence. a
More importantly, the converse is true: Any Cauchy sequence in R is convergent. To prove this,
we will need a number of lemmas. We shall prove:
• If a Cauchy sequence hxn in has a convergent subsequence, then hxn in is itself convergent.
Actually, the second point has already been proved. It is the Bolzano–Weierstrass theorem (Theo-
rem B.2.8). Thus we need only prove the first and the last point.
Proof: Choose N ∈ N such that |xn − xm | < 1 whenever n, m ≥ N . (This is possible, because
hxn in is Cauchy — we have taken ε = 1.) Now define
We show that K is a bound for hxn in , i.e. that |xn | ≤ K for all n ∈ N.
Consider separately the two case (i) n ≤ N , and (ii) n > N . In case (i), we obviously have
|xn | ≤ K, by definition of K. Suppose therefore, that n > N . In that case, both n and N are ≥
N , and thus
|xn | ≤ |xn − xN | + |xN | ≤ 1 + |xN | ≤ K
which finishes case (ii). a
Lemma B.4.6 If hxn in is a Cauchy sequence, and if hxn in has a convergent subsequence, then
hxn in itself converges.
Convergence 143
Proof: Suppose that hxnk ik is a subsequence of the Cauchy sequence hxn in , and that xnk → x (as
k → ∞). We show that xn → x (as n → ∞).
So let ε > 0. We must show that there is N ∈ N such that |xn − x| < ε whenever n ≥ N . Now
because hxn in is a Cauchy sequence, we can find an N1 such that
ε
n, m ≥ N1 implies |xn − xm | < 2
Now define N = max{N1 , nK }, and let n ≥ N . Choose k such that nk ≥ N . Then (i)
n, nk ≥ N1 , and (ii) k ≥ K (because nk ≥ N ≥ nK ). It follows that
ε ε
|xn − x| ≤ |xn − xnk | + |xnk − x| < 2 + 2
whenever n > N . a
Theorem B.4.7 Let hxn in be a sequence in R. Then hxn in converges if and only if it is
a Cauchy sequence.