Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (2 votes)
336 views149 pages

Probability Theory

This document provides an introduction to probability theory. It covers topics such as events and probabilities, measurable functions, integration and expectation, independence, and conditional expectation. The document was written by P. Ouwehand and published by the African Institute of Financial Markets and Risk Management at the University of Cape Town in 2015. It references the work of mathematician Andrey Kolmogorov, who is considered one of the founders of modern probability theory.

Uploaded by

Alex Summers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
336 views149 pages

Probability Theory

This document provides an introduction to probability theory. It covers topics such as events and probabilities, measurable functions, integration and expectation, independence, and conditional expectation. The document was written by P. Ouwehand and published by the African Institute of Financial Markets and Risk Management at the University of Cape Town in 2015. It references the work of mathematician Andrey Kolmogorov, who is considered one of the founders of modern probability theory.

Uploaded by

Alex Summers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 149

Probability Theory

P. Ouwehand

African Institute of Financial Markets and Risk Management


University of Cape Town
2015
Andrey Nikolaevich Kolmogorov 1903-1987
Contents

1 Events and Probabilities 1


1.1 Motivation for Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Probabilistic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Countable and Uncountable Sets . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 What is “Area”? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.4 Structure of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Events and σ–algebras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Properties of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.1 Additivity properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2 Limit Operations on Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.3 Limits of Sets and Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5 Lebesgue Measure from Coin Tossing . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6 Other Families of Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.7 Lebesgue Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.7.1 Outer Measures and their σ–Algebras . . . . . . . . . . . . . . . . . . . . . . 27
1.7.2 Lebesgue measure on R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1.7.3 Lebesgue Measure on Rd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 Measurable Functions and Random Variables 33


2.1 Definition of Measurable Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Some Examples of Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Combinations of Measurable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 Approximation by Simple Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Pushing Measures along Functions: The Laws of a Random Variable . . . . . . . . . 42

3 Information and Independence 45


3.1 Conditional Probability and Independence of Events . . . . . . . . . . . . . . . . . . 45
3.2 Information in Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Independence of σ–algebras and Random Variables . . . . . . . . . . . . . . . . . . . 50
3.4 Borel–Cantelli Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

iii
CONTENTS CONTENTS

4 Integration and Expectation 55


4.1 The Integral: Definition and Basic Properties . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Lebesgue’s Dominated Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Measure Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Riemann Integral vs. Lebesgue Integral . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Chain Rule, Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Definition of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Products and Independence 79


5.1 Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 Products of Measure Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 Spaces of Random Variables 89


6.1 Topological Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.1 Normed Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1.2 Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.1.3 Orthogonal Projection in Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . 94
6.2 Lp Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Lp –Spaces and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 L2 and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Convergence of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.2 Convergence in Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.3 Relationships between Different Modes of Convergence . . . . . . . . . . . . . 104

7 Conditional Expectation 107


7.1 Definition of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.1 Conditioning on an Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1.2 Conditioning on a σ–Algebra in a Discrete Probability Space . . . . . . . . . 107
7.1.3 Conditioning on a σ–Algebra in a General Probability Space . . . . . . . . . 109
7.2 Properties of Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 The Radon–Nikodým Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

A Logic and Sets 119


A.1 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.1.1 Symbols denoting Objects, Operations and Relations . . . . . . . . . . . . . . 120
A.1.2 Logical Connectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.1.3 Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.2 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.2.1 Union and intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

iv
Contents v

A.2.2 Set difference, complementation and symmetric difference . . . . . . . . . . . 126


A.2.3 Set algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A.2.4 Cartesian Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.3 Functions Operating On Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.4 Equivalence Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

B Convergence 133
B.1 Convergence of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.1 “Infinitely Often” and “Eventually” . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.2 Formal Definition of Convergence of Sequences . . . . . . . . . . . . . . . . . 134
B.2 Sup and Inf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
B.3 lim sup and lim inf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B.4 Cauchy Sequences and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

v
Chapter 1

Events and Probabilities

1.1 Motivation for Measure Theory


1.1.1 Probabilistic Modelling
A model for an experiment involving randomness takes the form (Ω, F, P). Intuitively, Ω is the
set of all possible outcomes of the experiment, and is called the sample space. F is the set of all
events, i.e. “permissible” combinations of outcomes. (We shall see that not all combinations of
P
events need be permissible.) P is a map F −→ [0, 1] which assigns to each (permissible) event a
probability that it occurs.

Example 1.1.1 A die is rolled once. The possible outcomes are an integer between one and six.
Thus the sample space can be taken to be Ω = {1, 2, . . . , 6}. We may be interested in the following
events:

(a) The outcome is the number 1;

(b) The outcome is an even number;

(c) The outcome is an odd number which is strictly greater than 1;

Each of these events can be described by a subset of the sample space. Thus if A, B, C are the
subsets corresponding to the events (a), (b), (c), then

A = {1}
B = {2, 4, 6}
C = {3, 5}

The probabilities of these events, by elementary reasoning, are P(A) = 16 , P(B) = 12 and P(C) = 13 ,
provided that the die is fair. Every subset of Ω is a “permissible” event, and thus F = P(Ω). 

Mathematically, an event is a set, i.e. events are just subsets of the sample space. The outcome
of any random experiment must be some element ω of the sample space Ω. Now Ω is itself a subset

1
2 Motivation for Measure Theory

of Ω and thus corresponds to some event. We call it the certain event, since we are certain that
ω ∈ Ω. We must always have P(Ω) = 1. The empty set ∅ is also a subset of Ω and thus corresponds
to some event. We call it the impossible event, since it is impossible that an outcome ω is in ∅.
We will always have P(∅) = 0.
Note that the sample space corresponding to a random experiment need not be unique. Con-
sider, for example, the random experiment
n of rolling two dice. Then o we can choose the sample
space to be the 36 element set Ω1 = (i, j) : i, j ∈ {1, 2, 3, 4, 5, 6} . The probabilities for each
1
outcome are then the same: P(ω) = 36 for each ω ∈ Ω1 . This is a so–called uniform distribution.
On the other hand, we can choose the sample space to be the 11–element set Ω2 = {2, 3, 4, . . . , 12}
corresponding to the total of the two dice. In this case, the probability distribution is non–uniform:
P({7}) = 61 whereas P({2}) = 36 1
. Choosing the sample space and the corresponding probability
distribution for a particular situation is part of the art of probabilistic modelling.

Example 1.1.2 A coin is flipped until the first head turns up. This may happen on the first toss
or the second, or. . . or never. Thus the sample space is Ω = {ω1 , ω2 , . . . , ω∞ }, where the outcome
ωn denotes the event that the first head turns up on the nth toss, and ω∞ denotes the event of
never flipping heads. It is clear from elementary probability that P({ωn }) = 21n (provided that the
coin is fair). We may now consider various composite events, such as:

(a) Let A be the event that the first head appears on either the third or the fourth toss. Then
A = {ω3 } ∪ {ω4 } = {ω3 , ω4 }. Clearly P(A) = 213 + 214 .
(b) Let B be the event that the first head appears after an even number of tosses. Thus B =

1
= 31 . Did you think that the probability that the first head
S P
n∈N {ω 2n } and P(B) = 22n
n=1
appears after an even number of tosses is 12 ? If so, note that the probability that the first
head appears on the first toss is 12 , and the probability that the first head appears after an odd
number of tosses is therefore greater than 12 .
(c) Let C be the event that both events A and B occur. Clearly C = {ω4 } = A ∩ B.
(d) Let D be the event that a head does occur after a finite number of tosses. Thus D is the
complement of the event that heads never occurs. Thus D = Ω − {ω∞ } = {ω1 , ω2 , . . . }. Hence

P 1
P(D) = 2n = 1. This can also be seen from the fact that P({ω∞ }) = 0.
n=1

1.1.2 Countable and Uncountable Sets


In this section, we investigate the idea of the cardinality (or size) of a set, with particular emphasis
on countable sets. We shall soon see why the notion of countability is important for probability
theory, and analysis in general.
For finite sets, we can determine the size of a set simply by counting its elements. Thus for
example, the set {a, b, c} has cardinality 3 (it has 3 elements). We are going to extend this idea of
counting to measure the sizes of infinite sets.
Events and Probabilities 3

First, we explore the idea of counting. To say that A = {a, b, c} has 3 elements is equivalent
to saying that there is a one-to-one correspondence or bijection between the sets A and {1, 2, 3}:
.When we count “One, two, three”, pointing our finger at a, b, c, we are defining a map

f : {1, 2, 3} → A : 1 7→ a, 2 7→ b, 3 7→ c

Thus, when we count a finite collection of objects, we mentally form a list. To count how many
people there are in a room, we may form a list such as

1 2 3 4 ... 27
l l l l ... l
Bob Mary Stoffel Sannie ... Cyril

We have made a one-to-one correspondence between a set of numbers

{1, 2, 3, 4 . . . , 27}

and a set of people


{Bob, Mary, Sannie, Stoffel,. . . , Cyril}
More generally two sets A, X have the same size, i.e. the same number of elements, if there is a
bijection between the elements of A and the elements of B:

A={ a b c ... }
l l l ...
X={ x y z ... }

Clearly, if there is a surjection from A onto B, than A has “more” elements than B, and if there is
an injection from A into B than A has “fewer” elements than B.
We can use these ideas to measure infinite sets. We say that a set A of objects is countable if
we can make a finite or infinite list all of its elements

A = {a1 , a2 , a3 , . . . , an } or A = {a1 , a2 , a3 , a4 , . . . }

Here a1 is the first element on the list, a2 the second, etc. If we allow lists with repetitions, then
we see that a set A is countable if and only if there is a list

A = {a1 , a2 , a3 , a4 . . . }

i.e. the elements of A can be listed 1st , 2nd , 3rd etc., without leaving any out.

Definition 1.1.3 A non–empty set A is countable if and only if there is a surjective map
f : N → A from the set of natural numbers onto A.
The empty set is also defined to be countable.
A set which is not countable is said to be uncountable.
4 Motivation for Measure Theory

Intuitively, as set is countable if its size is smaller than (or equal to) the size of the set of natural
numbers.
Observe that we take 0 to be an element of N. Following common mathematical practice, a
listing of a set A may often start with a zeroth element: A = {a0 , a1 , a2 , . . . }: We begin counting
from 0.
Example 1.1.4 The Hilbert Hotel is a hotel with infinitely many rooms, numbered 0, 1, 2, 3, . . . .
Imagine that you are the manager of the Hilbert Hotel.
1. One day, someone arrives at your desk and requests a room. You look at the guest list, and
notice that every room is full. What do you do?
2. One day, a busload of infinitely many people (numbered 0, 1, 2 . . . , m, . . . ) arrive at your desk
with a request for accommodation. You look at the guest list, and notice that every room is
full. What do you do?
3. One day infinitely many buses (numbered 0, 1, 2, . . . , n, . . . ) arrive, each bus filled with infinitely
many people, so that the mth person on the nth bus is numbered (n, m). You look at the guest
list, and notice that every room is full. What do you do?
What you do is this:
1. Move every person in the hotel to the next room, i.e. the person in room n will move to room
n + 1. Now room 0 is empty!
2. Move every person in the hotel as follows: The person in room n goes to room 2n. Then all the
odd–numbered rooms are empty. Put bus-passenger m in room 2m + 1.
3. It can be done! Wait and see until after Proposition 1.1.6.

Examples 1.1.5 A non–empty set A is countable if there is a list of all its elements, possibly with
repetitions.
A = {a0 , a1 , a2 , . . . , an , . . . }
(a) Every finite set is countable: If A = {a0 , a1 , . . . , am }, pick an arbitrary a ∈ A, and let an := a
for n > m. Then A = {an : n ∈ N}.
(b) N = {0, 1, 2, 3, . . . } is countable: Take an = n
(c) The set {0, 2, 4, 6, . . . } of even natural numbers is countable: Take an = 2n.
(d) The set Z of integers is countable:
 n
 n even
2

Z = {0, 1, −1, 2, −2, 3, −3, . . . } an =
 −n + 1
 n odd
2
If you think of the positive integers as the people already staying in the Hilbert Hotel, and the
negative integers as the people in an arriving bus, this shows how to accommodate everyone.
Events and Probabilities 5

Proposition 1.1.6 (a) A subset of a countable set is countable.

(b) Suppose that each of A0 , A1 , . . . , Ak , . . . are countable sets. Then their union
[
A := Ak = A0 ∪ A1 ∪ · · · ∪ Ak ∪ . . .
k∈N

is countable also. (i.e. a union of countably many countable sets is countable).

Proof: (a) Suppose that B := {b0 , b1 , b2 , . . . } is a countable set, and let A ⊆ B. If A = ∅, then A
is countable by definition. If A 6= ∅, pick a ∈ A. Now define an (n ∈ N) as follows:
(
bn if bn ∈ A
an :=
a else

Then A = {an : n ∈ N}. (Remember: Repetitions are allowed!)


(b) Suppose that each set Ak is countable, and thus can be listed.
Ak := {ak,0 , ak,1 , ak,2 , . . . } (k ∈ N)
Thus ak,n is the nth member of the k th
S
set. Put the members of k∈N Ak into an array
A0 A1 A2 A3 A4 A5 ...
a0,0 a1,0 a2,0 a3,0 a4,0 a5,0 ...
a0,1 a1,1 a2,1 a3,1 a4,1 a5,1 ...
a0,2 a1,2 a2,2 a3,2 a4,2 a5,2 ...
a0,3 a1,3 a2,3 a3,3 a4,3 a5,3 ...
a0,4 a1,4 a2,4 a3,4 a4,4 a5,4 ...
a0,5 a1,5 a2,5 a3,5 a4,5 a5,5 ...
a0,6 a1,6 a2,6 a3,6 a4,6 a5,6 ...
.. .. .. .. .. ..
. . . . . . ...
We can then trace a zig–zag path that moves through all the elements in the array as follows. Start
at the top row and move diagonally down to the left until you reach the leftmost column. Repeat.
We thus obtain a sequence
a0,0 , a1,0 , a0,1 , a2,0 , a1,1 , a0,2 , a3,0 , a2,1 , a1,2 , a0,3 , a4,0 , a3,1 . . .
|{z} | {z } | {z } | {z }
k+n=0 k+n=1 k+n=2 k+n=3

Here ak,n is the nth element of Ak .


The union A can be listed as follows: A = {a0 , a1 , a2 , . . . , an , . . . }, where
a0 := a0,0 , a1 := a1,0 , a2 := a0,1 , a3 := a2,0 , a4 := a1,1 , . . .
We thus have found a surjective map f : N → A : n 7→ an . Now even though we haven’t given a
formula for f , it is perfectly well-defined, and can be calculated. For example, can you see that
a16 = a4,1 ?
As a matter of fact, there is in this case a formula for f : f (n) = an is found as follows:
6 Motivation for Measure Theory

1
• Find biggest k such that 2
k(k + 1) ≤ n.
1
• Let i = n − 2
k(k + 1) and j = k − i. Then an := ai,j
a

Remarks 1.1.7 If you think of A0 as the people already staying in the Hilbert Hotel, and of An as
the nth bus (for n = 1, 2, 3, . . . ), then the proof of Proposition 1.1.6(b) shows how to accommodate
infinitely many busloads (cf. Example 1.1.4). 

Proposition 1.1.8 The set Q of rational numbers is countable.

Proof:
Q = A1 ∪ A2 ∪ · · · ∪ An ∪ . . .
is a union of countably many sets A1 , A2 , A3 , . . . , where An := {0, n1 , − n1 , n2 , − n2 , n3 , − n3 . . . } is the
set of rational numbers with denominator n. Clearly each An is countable. a

In light of the above, the following proposition may come as a complete surprise:

Proposition 1.1.9 The set of real numbers R is uncountable.


In fact, any sub-interval of R of non-zero length is uncountable.

Proof: Suppose we are given an arbitrary list r0 , r1 , r2 , . . . , rk , . . . (k ∈ N) of real numbers in [0, 1].
Write
r0 = 0.r00 r01 r02 r03 r04 r05 . . .
r1 = 0.r10 r11 r12 r13 r14 r15 . . .
r2 = 0.r20 r21 r22 r23 r24 r25 . . .
r3 = 0.r30 r31 r32 r33 r34 r35 . . .
.. .. ..
. . .

where rkn is the nth digit in the decimal representation of rk . Pick a digit an ∈ {0, 1, . . . , 9} so that
an 6= rnn , and define a ∈ [0, 1] by
a := 0.a0 a1 a2 a3 a4 a5 . . .
(i.e. the nth digit in the decimal representation of a is an .)
By construction, the nth digit in the decimal representation of a differs from the nth digit in
the decimal representation of rn . Hence a 6= rn (because a and rn do not have the same decimal
expansion1 ). But as this holds for any n, we are forced to conclude that the number a is not on
the list!
Thus, given any list of real numbers in [0, 1], there is a number in [0, 1] which is not on the list.
It follows that there can be no list of all the numbers in [0, 1]. Thus [0, 1] is uncountable.
1
For the purists: Because decimal expansions of numbers are not unique (e.g. 0.2499999 · · · = 0.250000 · · · = 0.25)
we have to be a little more careful here. Every non-zero real number does have a unique non-terminating decimal
expansion, i.e. one that does not terminate in all zeroes from some point onwards. So one to way to get around the
problem of non-unique decimal expansions is to insist that all the real numbers rn (except 0) be expressed in their
non-terminating decimal expansion, and that each digit an is chosen so that an 6= 0.
Events and Probabilities 7

It follows that R is uncountable, because if R was countable, then so is every subset of R.


Now observe that if [a, b] is a closed interval where a < b, then there is a bijection f : [a, b] →
[0, 1] given by f (x) := x−a
b−a . If it were the case that [a, b] := {xn : n ∈ N} is countable, then
[0, 1] = {f (xn ) : n ∈ N} would be countable as well — but it isn’t! It follows that [a, b] cannot be
countable, i.e. that every closed interval of non–zero length is uncountable.
Finally, every interval of non–zero length contains a closed interval, and hence is uncountable
as well. a

Thus the rational numbers can fit into the Hilbert Hotel, but the real numbers cannot!

1.1.3 What is “Area”?


“Area” is a number associated with certain subsets of the Euclidean plane R2 , i.e. it is a function
| · | which assigns to a set E ⊆ R2 its area |E|. Intuitively, the area function | · | should have certain
properties:

1. If E ⊆ R2 is bounded, then 0 ≤ |E| < ∞.

2. | · | is rotation– and translation invariant: If a set F is obtained by rotating and shifting


a set E, then |F | = |E|.

3. If E is a rectangle, then |E| = length × breadth.

4. | · | is additive: If E, F are disjoint bounded subsets of R2 , then |E ∪ F | = |E| + |F |.


More generally, | · | is countably additive: ifSE1 , E2 , .P
. . is a countable sequence of
mutually disjoint bounded subsets R2 , then | n En | = ∞ n=1 |En |.

It follows easily how to calculate the area of triangles, and thus that of polygons. But what
about non–polygonal sets? For example, how do we justify that the area of a circle of radius r is
πr2 ? Before we do this, do the following exercise.
Exercise 1.1.10 Use the properties (1)-(4) of the area function to show the following:
(a) |∅| = 0.

(b) If A ⊆ B then |B − A| = |B| − |A|.

(c) If A ⊆ B then |A| ≤ |B|.


S
(d) If A1 ⊆ A2 ⊆ A3 ⊆ . . . and if A := n An , then |A| = limn |An |.
[Hint: Define B1 := ∅, and for n = 1, 2, 3, . . . , let Bn+1 := An+1 − An . Then An = nj=k Bk is
S
a union of disjoint sets. Use the fact that the value infinite series is a limit of finite sums.]

Note that Exercise 1.1.10 (d) relies on the countable additivity of | · |.

Exercise 1.1.11 We use the properties of the area function to show that the area of an open circle
A of radius r is |A| = πr2 .
8 Motivation for Measure Theory

(a) For n = 1, 2, 3, . . . , let An be the regular open polygon with 2n+1 sides, inscribed in a circle of
Sr. Thus A1 is a square, A2 an octagon, etc. Note that A1 ⊆ A2 ⊆ A3 ⊆ . . . . Also note
radius
that n An = A. (This is why we need the sets An and A to be open subsets of R2 ).
(b) An consists of 2n+1 congruent isosceles triangles, constructed by joining each of the sides of
the polygon to the centre of the circle. Explain why each such triangle has area 21 r2 sin 2πn , and
conclude that |An | = 2n r2 sin 2πn .
 π 
sin 2n
(c) Conclude that |A| = limn πr2 π
n
= πr2 .
2


The technique used to compute the area in Exercise 1.1.11 relies on the set A being approximated
from the inside by triangles. Not every set can be so approximated, however. Take for example the
set A := {(p, q) : p, q are rational numbers with 0 ≤ p, q ≤ 1}. It is not clear what the area of A
should be: On the one hand, the set A is dense inside the unit rectangle, so one might guess that
|A| = 1. If we approximate A from the outside however, we obtain a convincing argument that
|A| = 0:
Exercise 1.1.12 Recall that the set of rational numbers is countable. So we can write the elements
of A in a list: A = {(pn ,Sqn ) : n ∈ N}. Fix ε > 0. For P
n ∈ N, let Rn be a square centred at (pn , qn )
with area 2εn . Let B = n Rn , and show that |B| ≤ n |Rn | = ε. Also show that A ⊆ B so that
|A| ≤ ε. Since ε > 0 was arbitrary, we have |A| ≤ ε for all ε > 0, i.e. |A| = 0. 
Observe that the technique used in Exercise 1.1.12 can be used to prove that every countable
subset of R2 has zero area. It is therefore necessary that the set of real numbers be uncountable
for the concept of area to have a useful meaning!
Not only intervals or unions of intervals have length. Some quite complicated sets can be
measured. Consider the following example:
Exercise 1.1.13 The Cantor set:
The Cantor set C is a subset of [0, 1] which is constructed as follows: Let C0 := [0, 1]. Now let C1
be C0 with its middle third removed, i.e. C1 := [0, 31 ] ∪ [ 32 , 1]. Now remove the middle thirds of the
two intervals that make up C1 to form C2 , i.e. C2 := [0, 91 ] ∪ [ 29 , 13 ] ∪ [ 23 , 79 ] ∪ [ 89 , 1]. Continue in this
way, removing the middle thirds of each of the intervals comprising Cn to form Cn+1 . Finally, let

T
C := Cn . Then C is called the Cantor set.
n=0
(a) Show that λ(C) = 0, where λ(A) denotes the length of a subset A ⊆ R. [Hint: First calculate
λ(Cn )]
(b) Every real number a ∈ [0, 1] has a non–terminating ternary (base 3–) expansion a = (0.a1 a2 a3 . . . )3 :=
∞ a
P i
i
, where ai = 0, 1 or 2 — cf. remarks on base n–expansions in the box below. Show that
i=1 3
the Cantor set is formed by removing all numbers which have a 1 occurring in their non–
terminating ternary expansion.
[Hint: Show that C1 is formed by removing all numbers which have a 1 in the first digit, C2 is
formed by removing all numbers in C1 which have a 1 in the second digit, and so on.]
Events and Probabilities 9

(c) Show that there are as many elements in C as there are real numbers. Conclude that C is
uncountable. [Hint: Define f : C → R so that f ((0.202202 . . . )3 ) = (0.101101 . . . )2 .]


Base n–expansions: We are familiar with the fact that every real number has a base
10–expansion, i.e. a decimal expansion using digits in {0, 1, . . . , 9}, e.g.
3 1 4 1 5
π = 3.14159265 · · · = 100
+ 101
+ 102
+ 103
+ 104
+ ...

The reliance on the base 10 is owing to an accident in human evolution: We have 10


fingers! But the number 10 is not special: In exactly the same way, every real number has
a base n–expansion, for every integer n ≥ 2, using digits in {0, 1, . . . , n − 1}. For example
1
(0.10110)2 = 21
+ 202 + 213 + 214 = 11
16 = (0.6875)10
2 1 2 1 7
(0.2121212 . . . )3 = 3 + 32 + 33 + 34 + · · · = 8 = (0.875)10

Such an expansion is typically unique, except for one small problem: (0.(n − 1)(n − 1)(n −
P∞ n−1 n−1 1
1) . . . )n = k=1 nk = n · 1− 1 = 1 = (1.000 . . . )n , i.e. every number whose base
n
n–expansion ends in zeroes also has one that ends in (n − 1)’s. For example,
1
(0.9999)10 . . . = (1.0000)10 . . . (0.011111 . . . )2 = 2 = (0.1000 . . . )2
7
(0.102222 . . . )3 = 9 = (0.110000)3

Say that an expansion is terminating if it eventually ends in all zeroes. Then every real
number has a unique non–terminating base n–expansion.

Using an argument due to Vitali in 1905, it can be shown that it is impossible to assign a length
to every bounded subset of R, i.e. there is no function which satisfies each of the properties (1)-(4)
of L(·) above, and which is defined for every bounded subset of R. Thus there are subsets of R
which have no length. This does not mean that these sets have zero length; it means that there is
no number which can be called their length, and which is consistent with (1)-(4).
We present the proof, but it can be safely omitted:
?
Example 1.1.14 Define an equivalence relation ∼ on R by

x∼y ⇐⇒ y−x∈Q

Let {Ei : i ∈ I} be an enumeration of the equivalence classes of ∼. Note that if x ∈ R, then there exists
q ∈ Q such that 0 ≤ x + q ≤ 1. Now since x ∼ x + q, we see that for every x there is y ∈ [0, 1] such that
x ∼ y. Thus [0, 1] ∩ Ei 6= ∅ for every i ∈ I.
Now pick2 for each i ∈ I one xi ∈ [0, 1] ∩ Ei , and define a Vitali set H by H := {xi :∈ I}. Thus for each
y ∈ R there is a unique i ∈ I such that y ∼ xi .
For q ∈ Q, define H + q := {xi + q : i ∈ I}. First note that the H + q are mutually disjoint: For if
y ∈ (H + q) ∩ (H + q 0 ) for rational numbers q, q 0 , then y = xi + q = xj + q 0 for some i, j ∈ I, and thus
xi = xj + (q 0 − q), i.e. xi ∼ xj . It follows that xi = xj , thus that q = q 0 , and thus that H + q = H + q 0 .
2
This requires the Axiom of Choice.
10 Motivation for Measure Theory

Next, we claim that for each y ∈ R there is a unique q ∈ Q such that y ∈ H + q := {xi + q : i ∈ I}.
Indeed, existence follows from the fact that there is an i ∈ I such that y ∼ xi , so that q := y − xi has the
property that y ∈ H + q. Uniqueness follows from the disjointness of the H + q.
Now let {qn : n ∈ N} be an enumeration
S of Q ∩ [−1, 1]. Note that if x ∈ [0, 1], there
S is a unique i ∈ I such
that x − xi ∈ Q ∩ [−1, 1], so that x ∈ n∈N (H + qn ). Since H ⊆ [0, 1], we also have n∈N (H + qn ) ⊆ [−1, 2].
Thus: [
[0, 1] ⊆ (H + qn ) ⊆ [−1, 2]
n∈N

Now suppose that the Vitali set H has an length i.e. that L(H)
S∞exists. Each H + qn is a translation of
H, and thus L(H + qn ) = L(H) for all n ∈ N. Now since [0, 1] ⊆ n=1 (H + qn ) ⊆ [−1, 2], it follows that

!
[
1≤L (H + qn ) ≤ 3
n=1

Furthermore, the sets H + qn are mutually disjoint, so


∞ ∞ ∞
!
[ X X
L (H + qn ) = L(H + qn )) = L(H)
n=1 n=1 n=1
P∞ P∞
The fact that 1 ≤ n=1 L(H) implies that L(H) > 0, whereas the fact that n=1 L(H) ≤ 3 < ∞ implies
that L(H) 6> 0 — contradiction. Hence H ⊆ R is a bounded set to which a length cannot be assigned. 

Similarly, there are subsets of R2 which have no area, some subsets of R3 fail to have a volume.
This does not mean that these sets have zero area/volume; it means that there is no number which
can be called their area/volume, and which is consistent with (1)-(4). In fact, for R3 , things are
very much worse: Look up the Banach–Tarski Paradox!
Here are some points to consider:

• To calculate the lengths/areas of sets more complicated than intervals/rectangles, it


is necessary to approximate these sets from above and/or below, and to take limits.
The countable additivity of the length/area function is what permits the taking of
limits.

• Not every subset of R/R2 can be assigned a length/area. It will therefore be nec-
essary to exclude these non–measurable sets from consideration: They are not
“permissible”.

Now note that length and area may be considered as special cases of probability, namely uniform
probability.

Example 1.1.15 Consider the experiment of randomly choosing a number ω from the unit interval
[0, 1] with equal probability.
1
• The probability that 0 ≤ ω ≤ 2 is 21 . The event that 0 ≤ ω ≤ 1
2 corresponds to the set [0, 12 ].
Its length is L([0, 12 ]) = 21 .

• Similarly, if (a, b) ⊆ [0, 1], then the probability that ω ∈ (a, b) is P((a, b)) = L((a, b)) = b − a.
Events and Probabilities 11

• Thus if E ⊆ [0, 1], then the probability that ω ∈ E is P(E) = L(E).

• In particular, if H is the Vitali set of Example 1.1.14, then the probability that ω ∈ H is
undefined — it is impossible to meaningfully assign a probability to H while maintaining
translation invariance. H is not “permissible”.

Similarly, consider the experiment of choosing a point ω from the unit square [0, 1] × [0, 1] with
equal probability. If E ⊆ [0, 1] × [0, 1], then the probability that ω ∈ E is just A(E).

1.1.4 Structure of Events


In order for our mathematical theory of probability to bear some resemblance to the real world, it
is clear that we should be able to combine events in the following ways:

• If A is an event, then the possibility of A not occurring should also be an event. Now if
the outcome of a random experiment is ω ∈ Ω, then the event A occurs if and only if ω ∈ A
(remember that we consider an event to be a subset of the sample space). Thus the event that
A does not occur corresponds to ω 6∈ A, i.e to the set Ac = Ω − A. We want the probabilities
of these events to be related by P(Ac ) = 1 − P(A).

• If A, B are events, then the possibility of both A and B occurring should also be an event.
Now if the outcome of a random experiment is ω ∈ Ω, then both A and B occur if and only
if ω ∈ A and ω ∈ B, i.e. if and only if ω ∈ A ∩ B. Thus the event of both A and B occurring
corresponds the set A ∩ B.

• In the same way, if A and B are events, then the possibility of at least one of A or B occurring
should be an event as well. This corresponds to the set A ∪ B. We say that events are disjoint
or mutually exclusive if they cannot occur simultaneously. Thus if A, B are disjoint, the
ω ∈ A implies ω 6∈ B. Clearly, therefore, A and B are disjoint if and only if A ∩ B = ∅ (i.e.
the event that both A and B occur is impossible). For disjoint events A and B, we want
P(A ∪ B) = P(A) + P(B). This is because P(A ∪ B) = NA∪B N = NA N +NB
= P(A) + P(B) (where
NA is the number of elements in the set A, etc.)

• In fact we demand more: Given a countable list of events A1 , A2 , A3 , . . . , the possibilities of


either all of these events occurring, or of at least one of these events ocurring, should also

T ∞
S
be events. They correspond to the sets An and An respectively. If the events An are
n=1 n=1

S ∞
P
mutually disjoint, i.e. if An ∩ Am = ∅ whenever n 6= m, then we want P( An ) = P(An )
n=1 n=1

The concept of probability has rather a lot in common with that of length, area and volume:
12 Events and σ–algebras

• “Probability” is measured by non–negative number P(A) assigned to a subset A of


the sample space Ω.
“Area” is measured by non–negative number |A| assigned to a subset of R2 .

• P(∅) = 0;
|∅| = 0.
S P
• If An are disjoint events, thenSP( n AnP) = n P(An );
If An are disjoint sets, then | n An | = n |An |.

When we isolate and study the common features of probability and length/area/volume, we
get the subject ofR measure theory. We shall show that we can develop a theory which allows us
to form integrals f dµ of functions f with respect to measures µ (rather than variables). It will
turn out that the integral with respect to Lebesgue measure (yet to be defined) is a more powerful
generalization of the ordinary Riemann integral. It will also transpire that the integral with respect
to a probability measure precisely captures the notion of probabilistic expectation.
Armed with the intuition and motivation provided by the above examples, we now proceed with
the formal theory.

1.2 Events and σ–algebras


To model a random experiment, we need to define three objects:
• A sample space Ω, representing the possible outcomes of the experiment. The outcomes ω ∈ Ω
are called sample points.

• A family F of events.
An event is a (permissible/relevant) subset of Ω. If A is an event, we say that A occurs if the
outcome ω is an element of A.
We shall require F to be a σ–algebra (which we define below).

• A probability measure P which assigns a probability P(A) to each event A.


Thus P : F → [0, 1].
The triple (Ω, F, P) will be called a probability space (subject to certain conditions on F and P).
Recall that an event E ∈ F is said to have occurred if the outcome ω of the random experiment
belongs to E.

Intuitively, we think of F as a set of events E for which we can decide whether or not E
occurred at the termination of the experiment.
Note: . . . whether or not. . . .
This intuition imposes the following constraints on F:
(a) Ω ∈ F and ∅ ∈ F.
Indeed, every outcome ω belongs to Ω, and thus the event Ω always occurs — it’s the certain
event.
Events and Probabilities 13

Similarly, no outcome ω belongs to ∅, and thus the event ∅ never occurs — it’s the impossible
event.
(b) If E ∈ F, then E c ∈ F, i.e. F is closed under complementation.
For if we can decide whether or not E occurred, then we can also decide whether or not E c
occurred: For suppose that the outcome of the experiment is ω. If E occurred, then ω ∈ E, so
ω 6∈ E c , hence E c did not occur.
Similarly, if E did not occur, then E c did occur.
(c) If E1 , E2 ∈ F, then E1 ∩ E2 ∈ F, i.e. F is closed under intersection.
For if we can decide whether or not E1 occurred, and also whether or not E2 occurred, then
we can decide whether or not E1 ∩ E2 occurred: E1 ∩ E2 occurred iff ω ∈ E1 ∩ E2 iff ω ∈ E1
and ω ∈ E2 iff both E1 and E2 occurred.
Thus if we can decide whether or not E1 , E2 occurred, we can also decide whether or not
E1 ∩ E2 occurred.
(d) Similarly, F is closed under union: The event E1 ∪ E2 occurs iff either E1 occurred, or E2
occurred (or both).
(e) We can generalize (c) and (d)T somewhat: If ES1 , E2 , E3 , . . . , En , . . . , is a countable sequence
of members of F, then also n En ∈ F and n En ∈ F, i.e. F is closed under countable
intersections
T and –unions. S
For n En occurred iff each of the En occurred, and n En occurred iff at least one of the En
occurred.
T Thus if weS can decide whether or not each En occurred, we can also decide whether
or not n En and n En occurred.
This leads to the following definitions:
Definition 1.2.1 Let Ω be a set. A collection A of subsets of Ω is called an algebra (or
field) on Ω if

(i) ∅ ∈ A;

(ii) A ∈ A ⇒ Ac ∈ A;

(iii) A, B ∈ A ⇒ A ∪ B ∈ A.

A collection F of subsets of Ω is called a σ–algebra (or σ–field) if it satisfies (i), (ii) and
S
(iii)σ If An ∈ F (for n ∈ N), then n An ∈ F.
Thus a σ–algebra on Ω is a non–empty family of subsets of Ω which is closed under complementation
and countable unions.
Remarks 1.2.2 (i) By De Morgan’s rules, an algebra is closed under (finite) intersections, and
a σ–algebra is closed under countable intersections:
!c
\ [
An = Acn
n n
14 Events and σ–algebras

(ii) If A is an algebra and A, B ∈ A, then A − B and A∆B belong to A.

(iii) If Ω is a set, then F0 := {∅, Ω} is the smallest σ–algebra on Ω, and F∞ := P(Ω) is the biggest
σ–algebra on Ω.
T
(iv) If {Fi : i ∈ I} is a family of σ–algebras on Ω, then F := i∈I Fi is also a σ–algebra on Ω.

S T c
Events are organized in σ–algebras. The set–theoretic operations , , · , correspond
to logical combinations or, and, not of events.

Frequently, the events of interest form a collection C which is not a σ–algebra. Suppose that C
is a collection of events which can be decided, i.e. if E ∈ C, then we can decide whether or not E
occurred. We can then also decide whether or not E c occurred, but E c may not be an element of
C. The bigger set F of all events that can be decided, given that we can decide all the events in C,
is a σ–algebra containing C.

Definition and Proposition 1.2.3 Let C be a family of subsets of Ω. There exists a


unique smallest σ–algebra F which contains C (i.e. C ⊆ F, and if G is any σ–algebra such
that C ⊆ G, then F ⊆ G also).
F is called the σ–algebra generated by C, and denoted by

F = σ(C)
T
Proof: Let F = {G : G a σ–algebra with C ⊆ G}, and let F = F. Then F ∈ F. (Why?)
Moreover, if G is a σ–algebra which contains C, then G ∈ F, and so F ⊆ G. (Why?) a

We repeat the following important intuition

σ(C) consists of all those events F for which we can decide whether or not F has occurred,
given that we know exactly which of the E ∈ C have occurred.

Exercise 1.2.4 Let Ω = (0, 1], and let A be the family of all those sets which can be written as a
union of finitely many intervals of the form (a, b], where 0 ≤ a < b ≤ 1. Show that A is an algebra,
but not a σ–algebra. 

The following exercise is extremely important for building intuition:

Exercise 1.2.5 Let Ω be a set. A partition of Ω is a collection of pairwise disjoint subsets


{Bi : i ∈ I} whose union is Ω, i.e.
[
Bi ∩ Bj = ∅ when i 6= j Bi = Ω
i∈I

The Bi are called the blocks of the partition.


Events and Probabilities 15

(a) Suppose that Ω is a set, and that B := {Bn : n ∈ N} is a partition of Ω with countably many
blocks. Show that the σ–algebra σ(B) generated by B is the precisely family of those sets that
can be written as countable unions of the blocks Bn .
[Hint: Consider union and complements of B2 ∪ B4 ∪ B6 ∪ . . . and B2 ∪ B3 ∪ B5 ∪ B7 ∪ B11 ∪ . . . .]

(b) Show that if Ω is a countable set, and if F is a σ–algebra on Ω, then there is a countable
partition B := {Bn : n ∈ N} which generates F.
[Hint: Define a relation ∼ on Ω by

ω ∼ ω0 ⇔ ∀F ∈ F[ω ∈ F ↔ ω 0 ∈ F ]

Show that ∼ is an equivalence relation, and consider the equivalence classes of ∼.]

(c) In a gambling game, a die is rolled. The sample space is Ω = {1, 2, . . . , 6}. I will tell you
whether or not the outcome is even, and whether or not the outcome is ≤ 4. What is the
σ–algebra on Ω which contains this information?

(d) Suppose that F = σ(B), where B is a partition consisting of n blocks. Explain why F has
exactly 2n = 2no. of blocks elements.


Definition 1.2.6 If the sample space Ω is a topological space, we define the Borel algebra
of Ω by
B(Ω) = σ(open sets of Ω)
In particular, B(R) is the smallest σ–algebra on R which contains all the open intervals of
R.

B(R) is one of the most important σ–algebras that we shall work with.

Exercise 1.2.7 Prove that the following sets belong to B(R):


(i) All closed intervals [x, y], where x ≤ y ∈ R.

(ii) The half–open intervals (x, y] and [x, y), where x < y ∈ R.

(iii) Every singleton {x}, where x ∈ R.

(iv) Every countable subset of R.

(v) The sets Q of rational numbers and Irr of irrational numbers.




Proposition 1.2.8 Define

C = collection of all intervals of the form (∞, x], where x ∈ R

Then σ(C) = B(R).


16 Measures

Proof: As (−∞, x] = ∞ 1
T
n=1 (−∞, x + n ), we see that C ⊆ B(R), and thus σ(C) ⊆ (R). (Why?)
Conversely, suppose first that I = (a, b) is an open interval. Then I = (−∞, a]c ∩ n (−∞, b− n1 ],
S
so I ∈ σ(C), i.e. σ(C) contains every open interval.
Now since every open subset of R can be represented as a union of countably many open
intervals3 , we see that σ(C) contains every open subset of R. Thus B(R) ⊆ σ(C).
a

1.3 Measures
The notion of measure generalizes the concepts of length, area, volume, mass, and probability.

Definition 1.3.1 Let F be a σ–algebra on a set Ω. A function µ : F → R̄ is called a


(countably additive, non–negative) measure if and only if

(i) µ is non–negative: 0 ≤ µ(A) ≤ ∞ for each A ∈ F.

(ii) µ(∅) = 0.

(iii) µ is countably additive (or σ–additive): If A1 , A2 , · · · ∈ F is a countable sequence of


pairwise disjoint sets, then
[ X
µ( An ) = µ(An )
n n

If µ(Ω) = 1, then µ is called a probability measure.

If F is a σ–algebra on a set Ω, then the pair (Ω, F) is called a measurable space. The elements
of F are called measurable sets, or events in the probabilistic framework. If, in addition, µ is
a measure on F, the triple (Ω, F, µ) is called a measure space. The symbols P, Q are used for
probability measures, and (Ω, F, P) will always denote a probability space.

Example 1.3.2 Important: Lebesgue Measure: We shall soon show that there is a unique
measure λ on (R, B(R)), which assigns every interval its length, i.e.

λ(a, b) = λ(a, b] = λ[a, b] = b − a

This measure is called Lebesgue measure, and provided the original impetus for the development of
the subject of measure theory.
Example 1.1.15 makes clear that Lebesgue measure is also important in probability theory:
Consider the experiment of drawing a uniformly distributed random number from the unit interval
3
Let U ⊆ R be a bounded open set, and let {qn : n ∈ N} S enumerate the rationals in U . Define rn := sup{r :
(qn − r, qn + r) ⊆ U }, and let In = (qn − rn , qn + rn ). Then n In ⊆ U . Conversely, if x ∈ U , there is ε > 0 so that
(x − ε, x + ε) ⊆ U . Choose qn such that |x − qn | < 2ε . Then if |y − qn | < 2ε , we have |x − y| ≤ |x − qn | + |qn − y|
S < ε,
and thus x ∈ (qn − 2ε , qn + 2ε ) ⊆ (x −Sε, x + ε) ⊆ U . It follows that rn ≥ 2ε , and thus that x ∈ In . Hence U ⊆ n In .
If U is not bounded, note that U = n (U ∩ (−n, n)) is a union of bounded open sets.
Events and Probabilities 17

[0, 1]. The probability of drawing a number between a and b (where 0 ≤ a ≤ b ≤ 1) is P([a, b]) =
b − a. Thus the appropriate measure P is just Lebesgue measure, restricted to [0, 1].
There are higher dimensional analogues of Lebesgue measure: There is a measure, also denoted
λ and called Lebesgue measure, on (Rn , B(Rn )), which assigns to every n–dimensional rectangle its
volume. 
Example 1.3.3 Let Ω be a set. For A ⊆ Ω, define |A| = no. of elements of A. Then | · | defines a
measure on (Ω, P(Ω)), called counting measure. 
Exercise 1.3.4 Let (X, F) be a measurable space, and let x0 ∈ X. Define δx0 : F → R by
(
1 if x0 ∈ F
δx0 (F ) =
0 if x0 6∈ F
Show that δx0 is a measure on (X, F).
δx0 is called the Dirac measure, or point mass, at x0 . 
Example 1.3.5 Suppose that F : R → R is an increasing right–continuous function, i.e.
F (s) ≤ F (t) when s ≤ t and F (t) = lim F (s)
s↓t

We shall prove later that there is a unique measure µF on (R, B(R)) with the property that
µF (a, b] = F (b) − F (a) for all a < b ∈ R
µF is called the Lebesgue–Stieltjes measure associated with F . Note that if F (t) := t, then µF = λ
is Lebesgue measure. 
Note that, for general measures, we allow +∞ as a value. For example, the length of the real
line is +∞, so λ(R) = +∞, were λ is Lebesgue measure on (R, B(R)). However, we often need to
get a “handle” on infinity:
Definition 1.3.6 A measure µ on a measurable space (Ω, F) is called

(a) finite, if µ(Ω) < ∞;

(b) σ–finite, if Ω is the countable union of sets of finite measure, i.e. if there is a sequence
S1 , A2 , · · · ∈ F of measurable sets such that each µ(An ) < ∞, and such that Ω =
A
n An .

Exercise 1.3.7 (a) Lebesgue measure on (R, B(R)) is σ–finite, but not finite.
(b) Counting measure on (R, B(R)) is not σ–finite. 
The following exercise is often useful:
Exercise 1.3.8 Suppose that (Ω, F, µ) is a measure space, and that A ∈ F. Define
F ∩ A = {F ∩ A : F ∈ F}
(this is an abuse of notation), and let µA = µ|F ∩ A. Then (A, F ∩ A, µA ) is a measure space also
— the restriction of (Ω, F, µ) to A. 
18 Measures

1.4 Properties of Measures


1.4.1 Additivity properties
Recall that
A − B := A ∩ B c

Proposition 1.4.1 (Additivity Properties)


Suppose that (Ω, F, µ) is a measure space, and let A, B, A1 , A2 , · · · ∈ F.

(a) If A ⊆ B, then µ(A) ≤ µ(B).

(b) If A ⊆ B and µ(A) < ∞, then µ(B − A) = µ(B) − µ(A).

(c) µ(A ∪ B) = µ(A) + µ(B) − µ(A ∩ B)


S P
(d) Countable Subadditivity: µ( n An ) ≤ n µ(An ).

Proof: Suppose that A ⊆ B. Then B = A ∪ (B − A) is a union of disjoint sets, and hence


µ(B) = µ(A) + µ(B − A). It follows that µ(B) ≥ µ(A). Furthermore, if µ(A) < ∞, then we obtain
µ(B) − µ(A) = µ(B − A), proving (a), (b).
(c) follows from the fact that A ∪ B can be written as a disjoint union A ∪ B = A ∪ (B − A ∩ B),
so that µ(A ∪ B) = µ(A) + µ(B − A ∩ B) = µ(A) + µ(B) − µ(A ∩ B).
(d): Define Bn inductively by

B1 := A1 Bn+1 := An+1 − An
S
Then
S B n S⊆ A n for all n, so that µ(B ) n ≤ µ(A n ). Moreover, A n = k≤n Bk for all n, and
n An = n Bn . Hence
! !
[ [ X X
µ An = µ Bn = µ(Bn ) ≤ µ(An )
n n n n

Exercise 1.4.2 Suppose that (Ω, F,TP) is a probability space, and that A, A1 , A2 , · · · ∈ F. Show
that if P(An ) = 1 for n ∈ N, then P( n An ) = 1 also. 

Next, we introduce some useful terminology:

Definition 1.4.3 Let (Ω, F, µ) be a measure space, and let A ⊆ Ω.

(a) We say that A is µ–null if there exists B ∈ F such that A ⊆ B and µ(B) = 0.

(b) We shall say that a statement ϕ holds µ–almost everywhere (or µ–almost surely in the
probabilistic framework), if the set of ω ∈ Ω where ϕ fails to hold is µ–null.

We abbreviate µ–almost everywhere and µ–almost surely by µ–a.e. and µ–a.s. respectively.
Events and Probabilities 19

Remarks 1.4.4 Note that in (a) above, the set A might not belong to F so µ(A) might be
undefined. However, clearly µ(A) “ought” to be zero. Later, this insight will allow us to extend
measures to σ–algebras larger than the ones we start off with.
As an example of (b), consider the reals with Lebesgue measure: (R, B(R), λ). Every point is
λ–null, since λ{x} = λ[x, x] = x − x. Hence the set Q of rational numbers is λ–null: The set Q is
countable, and has an enumeration Q = {qn : n ∈ N}. By countable additivity,
X
λ(Q) = λ{qn } = 0
n

If the functions f, g : R → R are defined by


(
1 if x ∈ Q
f := 0 g(x) := IQ (x) :=
0 else

Then
f, g are equal λ–almost everywhere


Exercise 1.4.5 (a) Let N denote the family of all µ–null sets. Prove that N is closed under
countable unions.

T F, P) is a probability space and if An ∈ F are such that P(An ) = 1 for all


(b) Prove that if (Ω,
n ∈ N, then P( n An ) = 1 also.


1.4.2 Limit Operations on Sets


Definition 1.4.6 Let (An )n∈N be a sequence of sets.

• We say that (An )n is increasing if A1 ⊆ A2 ⊆ AS


3 ⊆ ....
We say that An ↑ A if (An )n is increasing, and n An = A.

• Similarly, we say that (An )n is decreasing if A1 T


⊇ A2 ⊇ A3 ⊇ . . . .
We say that An ↓ A if (An )n is decreasing, and n An = A.

• We define the limit superior of the sequence (An )n by


∞ [
\ ∞
lim sup An = Ak
n
n=1 k=n

• We define the limit inferior of the sequence (An )n by


∞ \
[ ∞
lim inf An = Ak
n
n=1 k=n
20 Measures

Remarks 1.4.7 1. Suppose that if (An )n∈N is a sequence of sets. Define



[ ∞
\
Bn := Ak Cn := Ak
k=n k=n

Observe that the sequence

(Bn )n is decreasing (why?) and that Bn ↓ lim supn An

Similarly,

(Cn )n is increasing (why?) and that Cn ↑ lim inf n An

2. Also note the following simple interpretations of the above limit operations:

[
x ∈ lim sup An ⇔ ∀n [x ∈ Ak ]
n
k=n
⇔ ∀n ∃k ≥ n [x ∈ Ak ]
⇔ x belongs to infinitely many of the sets Ak

Similarly,

\
x ∈ lim inf An ⇔ ∃n [x ∈ Ak ]
n
k=n
⇔ ∃n ∀k ≥ n [x ∈ Ak ]
⇔ x belongs to all the Ak from some n onwards

In particular, x ∈ lim inf n An iff x eventually belongs to all the An , i.e. belongs to all but
finitely many of the An .


In light of Remarks 1.4.7(2), we use the following terminology in probability theory:

(An , i.o.) = lim sup An (An , ev.) = lim inf An


n n

where i.o. means infinitely often, and ev. means eventually.

Thus x ∈ (An , i.o.) iff x belongs to infinitely many of the sets An , etc.

Proposition 1.4.8 (a) lim inf n An ⊆ lim supn An


c c
(b) lim supn An = lim inf n Acn , lim inf n An = lim supn Acn

Exercise 1.4.9 Prove Proposition 1.4.8. 


Events and Probabilities 21

Remarks 1.4.10 ? Recall the definitions of lim supn xn , lim inf n xn of a sequence hxn in of real num-
bers — cf. Appendix B.3. How is this related to the definitions lim supn An and lim inf n An of a
sequence hAn in of sets?
For A ⊆ Ω, define the indicator function IA : Ω → {0, 1} of A by
(
1 if ω ∈ A
IA : Ω → R : ω 7→
0 else

(Elsewhere in the mathematical literature, indicator functions are often called characteristic func-
tions, but in probability theory, the term characteristic function has a different meaning.)
Suppose that (An : n ∈ N) is a countable sequence of subsets of Ω. For ω ∈ Ω, we have

lim sup IAn (ω) = 1


n
⇔ ω ∈ (An , i.o)
⇔ I(An ,i.o) (ω) = 1

Thus lim supn IAn = Ilim supn An .


A similar argument shows that lim inf n IAn = Ilim inf n An . 

1.4.3 Limits of Sets and Measures


Proposition 1.4.11 (Continuity properties)
Suppose that Ω, F, µ) is a measure space, and let A1 , A2 , · · · ∈ F.

(a) If An ↑ A, then µ(An ) ↑ µ(A).

(b) If An ↓ A, and if µ(An0 ) < ∞ for some n0 , then µ(An ) ↓ µ(A).

Proof: (a) Define Bn inductively for n ∈ N by

B1 := A1 Bn+1 = An+1 − An

Then
[ [ X n
X n
[
µA = µ( An ) = µ( Bn ) = µ(Bn ) = lim µ(Bk ) = lim µ( Bk ) = lim µ(An )
n→∞ n→∞ n→∞
n n n k=1 k=1

(b) Note that ∞


T T∞
n=1 An = n=n0 An , so we may assume without loss of generality that n0 = 1.
Then A1 − An ↑ A1 − A as n → ∞, so that µ(A1 − An ) ↑ µ(A1 − A). Now recall that µ(A1 − A) =
µ(A1 ) − µ(A). a

Exercise 1.4.12 (a) Show that Propn. 1.4.11(b) may fail if we drop the assumption that at least
one of the µAn is finite.
22 Lebesgue Measure from Coin Tossing

(b) Suppose that µ is finitely additive on the measurable space (Ω, F). Show that if

µAn → 0 whenever An ↓ ∅

then µ is countably additive.




We end this section with some important results:

Proposition 1.4.13 (a) FATOU’S LEMMA: If µ is a measure on (Ω, F), and if


A1 , A2 , · · · ∈ F, then
µ(lim inf An ) ≤ lim inf µ(An )
n n

(b) REVERSE FATOU LEMMA: If µ is a finite measure on (Ω, F), and if A1 , A2 , · · · ∈ F,


then
µ(lim sup An ) ≥ lim sup µ(An )
n n
T S T S
Proof: (a) Let Bn = m≥n Am . Then lim inf n An = n m≥n Am = n Bn , so Bn ↑ lim inf n An .
By Propn. 1.4.11(a), we see that µ(Bn ) ↑ µ(lim inf n An ). Also, clearly Bn ⊆ An , and so µ(An ) ≥
µ(Bn ). It follows that
lim inf µ(An ) ≥ lim inf µ(Bn ) = µ(lim inf An )
n n n
(b) is left as an exercise. a

Exercise 1.4.14 We prove the reverse Fatou lemma:


S
(a) Let Bn = m≥n Am . Explain why Bn ↓ lim supn An . Conclude that µ(Bn ) ↓ µ(lim supn An )

(b) Explain why µ(Bn ) ≥ sup µ(Am ).


m≥n

(c) Conclude that µ(lim supn An ) ≥ lim sup µ(Am ) = lim supn µ(An ).
n n≥m

(d) Where did we need the fact that µ is a finite measure?




1.5 Lebesgue Measure from Coin Tossing


Consider again the random experiment of tossing a coin infinitely many times. We want to find an
appropriate probability space for this experiment. It is clear that, if the coin is fair, each outcome
is equally likely, and thus that the probability of a given outcome must be zero (because there
are infinitely many outcomes).We can say this before we have even decided upon an appropriate
sample space. This we tackle now:
Letting 1 stand for “Heads” and 0 for “Tails”, we take the sample space to be

Ω̂ = {0, 1}N
Events and Probabilities 23

i.e. Ω̂ is the set of N–indexed sequences of 0’s and 1’s. We take a slightly different view, however.
Every sequence of 0’s and 1’s can be regarded as the dyadic or binary expansion of a real number.
For example, the sequence
1 1 0 1 0 0 0 1 ...
can be thought of as the binary number
1 1 1 1 1 1 1 1
{z . .}. = 1 · 2 + 1 · 22 + 0 · 23 + 1 · 24 + 0 · 25 + 0 · 26 + 0 · 27 + 1 · 28 + . . .
0.11010001
|
binary

= |0.816
{z . .}.
decimal

We thus have a correspondence between sequences (an : n ∈ N) of 0’s and 1’s, and real numbers
between 0 and 1:

X an
(an : n ∈ N) 7→
2n
n=1

Clearly (0, 0, 0, . . . ) 7→ 0, (1, 1, 1, . . . ) 7→ 1, (1, 0, 0, 0, . . . ) 7→ 21 , etc. In this way, every sequence


of 0’s and 1’s provides us with a unique number between 0 and 1. The only problem is that this
correspondence is not one-to-one:
1 1
(1, 0, 0, 0, . . . ) 7→ (0, 1, 1, 1, . . . ) 7→
2 2
However, we can deal with this in a clever way. Call a dyadic expansion terminating if it eventually
ends in all 0’s (tails), i.e. if there are only finitely many 1’s. Let T = {ω ∈ Ω̂ : ω is terminating}. It
is clear that every terminating expansion has associated with it a non–terminating expansion that
corresponds to the same number: If al = 1 is the last 1 in the terminating sequence (an : n ∈ N)

0.a1 a2 a3 . . . al−1 al 000 . . . = 0.a1 a2 a3 . . . al−1 0111 . . .


| {z } | {z }
binary binary

Moreover, a little thought shows that if we chuck out the terminating sequences from Ω̂, then the
correspondence above is a bijection between Ω̂ − T and (0, 1]. How many such terminating dyadic
expansions are there? It is not hard to see that T is countable. Moreover, since each element of
Ω̂ has probability 0, the probability of the event T is also 0 (being a countable union of events of
probability 0). It is therefore practically certain that the event T won’t occur. The terminating
dyadic expansions are therefore, in a sense, redundant. Nothing is lost by chucking them out
(except a set of measure 0). We may therefore take the sample space to be the set

Ω = (0, 1]

A natural algebra for this experiment is

F = all events that can be decided after only finitely many tosses

Here are some examples of elements of F:


24 Other Families of Sets

(a) A = the first toss is “Heads” . These are the dyadic numbers with a 1 in the first place, i.e.
all the numbers from 0.1000 · · · = 12 to 0.1111 · · · = 1, but not including 0.1000 . . . , because it
is terminating. Thus A = ( 12 , 1]
(b) B = the third toss is “Tails”. This is the set of all dyadic numbers with a 0 in the third place.
These are the numbers in intervals
binary = decimal
(0.000000 . . . , 0.000111 . . . ] = (0, 0.125]
(0.010000 . . . , 0.010111 . . . ] = (0.25, 0.375]
(0.100000 . . . , 0.100111 . . . ] = (0.5, 0.625]
(0.110000 . . . , 0.110111 . . . ] = (0.75, 0.875]
Hence B = (0, 0.125] ∪ (0.25, 0.375] ∪ (0.5, 0.625] ∪ (0.75, 0.875].
(c) C = There are 2 “Heads” and 1 “Tail” in the first 3 tosses. A little thought shows that
C = (0.375, 0.5] ∪ (0.625, 0.75] ∪ (0.75, 0.875]

Reasoning along these lines, it is clear that F is the σ–algebra generated by all intervals of the
form ( 2kn , k+1 n
2n ], where n ∈ N and k < 2 . It is therefore not hard to see that F = B(0, 1] (since
every real number can be approximated arbitrarily closely by a dyadic rational, i.e. a number of
the form 2kn .
Having identified the “right” σ–algebra, we turn to the probability measure appropriate for this
experiment.
(a) P(A) = probability that first toss is “Heads”. Assuming a fair coin, this is clearly 12 . Now the
event A corresponds to the interval ( 12 , 1], and λ( 12 , 1] = 21 (where λ is the Lebesgue measure
introduced in Example 1.3.2. Thus P(A) = λ(A).
(b) P(B) = probability that third toss lands “Tails”. This is clearly also 12 , as the third toss
is just as likely to land “Heads” as it is “Tails”. Now λ(B) = λ(0, 0.125] + λ(0.25, 0.375] +
λ(0.5, 0.625] + λ(0.75, 0.875] = 4 × 18 = 12 , and thus P(B) = λ(B) in this case also.
(c) P(C) = probability that there are 2 heads and 1 tail in the first 3 tosses. This probability is
3 −3 3

clearly 2 2 = 8 . In this case we therefore also have P(C) = λ(C).
It therefore becomes apparent that the “right” probability space for the random experiment of
tossing a coin infinitely many times is just the same as the random experiment of picking a number
from (0, 1] (Example 1.3.4).
All of this leads us to formulate the following principle:

Borel’s Principle:
Consider the random experiment of tossing a (fair) coin infinitely many times, and let E
be an event. Interpret E as a subset of (0, 1]. Then P(E) = λ(E), i.e. the probability
that the event E occurs is just the Lebesgue measure of the associated subset of the unit
interval.
Events and Probabilities 25

1.6 Other Families of Sets


It turns out that σ–algebras can be quite complicated to deal with, especially if the sample space
is finite. In many cases, it is easier to work with simpler classes of sets, especially π–systems.

Definition 1.6.1 Let C be a collection of subsets of Ω

(a) C is called a π–system if it is closed under finite intersections.

(b) C is called a λ–system if

(i) Ω ∈ C;
(ii) A, B ∈ C and A ⊆ B implies B − A ∈ C;
(iii) If A1 , A2 , · · · ∈ C and An ↑ A, then A ∈ C.

(c) We denote by π(C) and λ(C) the π–, respectively, λ–system generated by C, i.e. the
smallest π–, respectively, λ–system on Ω which contains C.

Why do π(C), λ(C) always exist? It follows from the easily proved fact that the intersection of
an arbitrary family of π–systems (resp. λ–systems) is again a π–system (resp. λ–system).

Proposition 1.6.2 A family C of subsets of Ω is a σ–algebra iff it is both a π–system


and a λ–system.

Proof: It is clear that a σ–algebra is also a π–system and a λ–system.


Conversely, suppose C is both a π– and a λ–system. Then C is closed under complementation,
by (i), (ii) of Defn. 1.6.1(b). De Morgan’s Laws applied to S Defn. 1.6.1(a) show that C is closed
under finite unions. Finally, given A1 , A2 , · · · ∈ C, let A = n An . Define
[
Bn = Am
m≤n

Then each Bn ∈ C, and Bn ↑ A. Hence A ∈ C, by Defn. 1.6.1(b)(iii), and thus C is closed under
countable unions. a

The following technical result often allows us to work with “easy” π–systems, instead of the
“difficult” σ–algebras:

Theorem 1.6.3 (Dynkin’s Lemma, Monotone Class Theorem)

(a) If C is a π–system on Ω, then


λ(C) = σ(C)

(b) Suppose that C is a π–system and that D is a λ–system (both on a set Ω), and also
that C ⊆ D. Then σ(C) ⊆ D.
26 Other Families of Sets

Proof: (a) Let D = λ(C). By Propn. 1.6.2, it suffices to show that D is a π–system. We do this
in two steps.
STEP I: Fix C ∈ C, and define

DC = {A ∈ D : A ∩ C ∈ D}

Then C ⊆ DC ⊆ D (because C is a π–system). We now show that DC = D. To that end, it suffices


to show that DC is a λ–system (because then DC is a λ–system containing C, and D is the smallest
such). We therefore verify (i)-(iii) of Defn. 1.6.1:
(i) is obvious.
If A, B ∈ DC and A ⊆ B, then (B − A) ∩ C = (B ∩ C) − (A ∩ C). But B ∩ C, A ∩ C ∈ D by
definition of DC , and thus (B − A) ∩ C ∈ D, because D is a λ–system. Thus (B − A) ∈ DC .
Finally, if A1 , A2 , · · · ∈ DC and An ↑ A, then A1 ∩ C, A2 ∩ C, · · · ∈ D and (An ∩ C) ↑ A ∩ C. Hence
A ∩ C ∈ D, and so A ∈ DC .
We now know that DC = D for every C ∈ C.
STEP II: Now, fix any D ∈ D, and define

DD = {A ∈ D : A ∩ D ∈ D}

First note that if C ∈ C, then DC = D, so D ∈ DC . It follows that D ∩ C ∈ D, and thus that


C ∈ DD , for every C ∈ C. Thus C ⊆ DD , for all D ∈ D.
It follows as above that DD is a λ–system, and thus that DD = D, for all D ∈ D.
In particular, if A, B ∈ D, then A ∈ DB , and so A ∩ B ∈ D. This shows that D is a π–system,
and thus a σ–algebra.
(b) follows directly from (a): If C is a π–system and D a λ–system such that C ⊆ D, then
σ(C) = λ(C) ⊆ λ(D) = D — the smallest λ–system containing D is D itself. a

As an application, here is an easy but useful result: Two probability measures which agree on
a π–system agree on the σ–algebra generated by that π–system.

Proposition 1.6.4 Suppose that µ1 , µ2 are finite measures on a measurable space (Ω, F),
and let C be a π–system such that Ω ∈ C and σ(C) = F. Then if µ1 = µ2 agree on C, they
agree on F.

Proof: Suppose that the probability measures µ1 , µ2 agree on C. Define D := {A ∈ F : µ1 (A) =


µ2 (A)}. It is easy to verify that D is a λ–system:

• µ1 (Ω) = 1 = µ2 (Ω), so Ω ∈ D.

• If A, B ∈ D and A ⊆ B, then by the additivity properties of measures,

µ1 (B − A) = µ1 (B) − µ1 (A)
= µ2 (B) − µ2 (A) because A, B ∈ D
= µ2 (B − A)

so that B − A ∈ D also.
Events and Probabilities 27

• Finally if An ∈ D and An ↑ A, then

µ1 (A) =↑ lim µ1 (An ) =↑ lim µ2 (An ) = µ2 (A)


n n

by the continuity properties of measures, so A ∈ D as well.

Hence D is a λ–system. Since by assumption we have C ⊆ D, it follows by Dynkin’s Lemma


(Theorem. 1.6.3) that F = D, and thus that µ1 (A) = µ2 (A) for every A ∈ F. a

1.7 Lebesgue Measure


In this technical section, we construct the natural notion of length, namely Lebesgue measure on
R.
We proceed as follows:

1. We first define the notion of outer measure on a set Ω, i.e. a monotone countably subadditive
map µ∗ : P(Ω) → R̄+ ..

2. We show that with associated with each such outher measure is associated the σ–algebra M(µ∗ )
of µ∗ –measurable sets. Moreover, if we denote the restriction µ∗  M(µ∗ ) by = µ, then µ is a
measure, i.e. (Ω, M(µ∗ ), µ) is a measure space.

3. We define the Lebesgue outer measure λ∗ : P(R) → [0, +∞] as follows: Given an interval I ⊆ R,
define |I| to be the length of the interval. Then, for A ⊆ R, we define

X [
λ∗ (A) := inf

|In | : each In an interval, A ⊆ In
n=1 n

4. We show that λ∗ (I) = |I| for every interval I: λ∗ assigns to each interval its length.

5. Finally, we show that B(R) ⊆ M(λ∗ ): Every Borel set is λ∗ –measurable.

1.7.1 Outer Measures and their σ–Algebras


Definition 1.7.1 Let Ω be a non–empty set.

(a) A map µ∗ : P(Ω) → [0, +∞] is said to be an outer measure on Ω iff:

(i) µ∗ (∅) = 0;
(ii) µ∗ is monotone: A ⊆ B implies µ∗ (A) ≤ µ∗ (B);
(iii) µ∗ is countably sub–additive: If A1 , A2 , · · · ⊆ Ω, then µ∗ ( n An ) ≤ n µ∗ (An ).
S P

(b) A set A ⊆ Ω is said to be µ∗ –measurable if and only if

µ∗ (E) = µ∗ (E ∩ A) + µ∗ (E ∩ Ac ) for all E ⊆ Ω


28 Lebesgue Measure

Note that we require µ∗ A to be defined for every subset A ⊆ Ω. We haven’t mentioned a base
σ–algebra. But there is one:

Theorem 1.7.2 Let µ∗ be an outer measure on Ω, let M(µ∗ ) be the family of all µ∗ –
measurable sets, and define µ := µ∗  M(µ∗ ). Then M(µ∗ ) is a σ–algebra, and µ is a
measure on (Ω, M(µ∗ )).

Proof: We must show that M(µ∗ ) is a σ–algebra, and that µ∗ is countably additive on M(µ∗ ).
Certainly ∅ is µ∗ –measurable. Also, it is obvious that A ∈ M(µ∗ ) implies Ac ∈ M(µ∗ ).
Next, we show that M(µ∗ ) is closed under finite intersections: Let A, B ∈ M(µ∗ ), and let
E ⊆ Ω. Then
µ∗ ()E = µ∗ (E ∩ A) + µ∗ (E ∩ Ac )
= µ∗ (E ∩ A ∩ B) + µ∗ (E ∩ A ∩ B c ) + µ∗ (E ∩ Ac )
= µ∗ (E ∩ (A ∩ B)) + µ∗ (E ∩ (A ∩ B)c )
≥ µ∗ E
where the third line follows because µ∗ (E ∩(A∩B)c ) = µ∗ (E ∩(A∩B)c ∩A)+µ∗ (E ∩(A∩B)c ∩Ac ) =
µ∗ (E ∩ B ∩ Ac ) + µ∗ (E ∩ Ac ), and the final line holds because µ∗ is sub–additive. It follows that
A ∩ B ∈ M(µ∗ ). Hence M(µ∗ ) is an algebra.
Next, let A, B ∈ M(µ∗ ) be disjoint, and let E ⊆ Ω. Then B ⊆ Ac , and hence

µ∗ (E ∩ (A ∪ B)) = µ∗ (E ∩ (A ∪ B)) ∩ A) + µ∗ (E ∩ (A ∪ B) ∩ Ac ) = µ∗ (E ∩ A) + µ∗ (E ∩ B) (∗)

Specializing to E = Ω proves that µ∗ is finitely additive on M(µ∗ ).



Now let A S1 , A2 , . . . be a countable
S sequenceS of disjoint elements of M(µ ), and let E ⊆ Ω.
Define Bn = m≤n Am , and let A = n An = n Bn . Then by monotonicity and (∗), we see that
X
µ∗ (E ∩ A) ≥ µ∗ (E ∩ Bn ) = µ∗ (E ∩ Am )
m≤n

As n → ∞, and invoking the fact that µ∗ is subadditive, we see that


X
µ∗ (E ∩ A) ≥ µ∗ (An ) ≥ µ∗ (A) ≥ µ∗ (E ∩ A) (∗∗)
n

so that equality holds throughout. Countable additivity of µ∗ on M(µ∗ ) is then obtained by


specializing to the case E = Ω.
Also, since Bn ∈ M(µ∗ ), we see, by monotonicity and (∗∗), that

µ∗ (E) = µ∗ (E ∩ Bn ) + µ∗ (E ∩ Bnc )
X
≥ µ∗ (E ∩ An ) + µ∗ (E ∩ Ac )
m≤n
→ µ (E ∩ A) + µ∗ (E ∩ Ac )

≥ µ∗ (E)

Hence equality holds throughout, and so A ∈ M(µ∗ ). This proves that M(µ∗ ) is a σ–algebra. a
Events and Probabilities 29

1.7.2 Lebesgue measure on R


First, we define the Lebesgue outer measure λ∗ : P(R) → [0, +∞] Given an interval I ⊆ R, define
|I| to be the length of the interval. Then, for A ⊆ R, define

X [
λ∗ (A) := inf

|In | : each In an interval, A ⊆ In
n=1 n

Remarks 1.7.3 Note that we also have



X [
λ∗ (A) = inf

|In | : each In a finite open interval, A ⊆ In
n=1 n
 P∞ S ∗

For let λ̄(A) = inf n=1 |In | : each In a finite open interval, A ⊆ Pλ (A) ≤
n In . Then clearly
λ̄(A). To prove the reverse inequality, fix A ⊆ R, and choose intervals In such that n |In | <
λ∗ A + 2ε . If |In | = +∞ for some n ∈ N, then clearly λ∗ (A) = +∞, in which case λ̄(A) = λ∗ (A)
(i.e. there is nothing to prove). Hence we may assume that each In is a finite interval.PNow let
Jn be an open interval such that In ⊆ Jn and |Jn | ≤ |In | + ε2−n−1 . Then each λ̄(A) ≤ n |Jn | ≤
ε ∗ ∗
P
n |In | + 2 < λ (A) + ε. Letting ε ↓ 0, we see that λ̄(A) ≤ λ (A), as required. 

Proposition 1.7.4 λ∗ is an outer measure on R, and λ∗ (I) = |I| for every interval I.

Proof: It is clear that λ∗ is a monotone increasing non–negative function with λ∗ (∅) = 0. To prove
that λ∗ is also countably sub–additive, let A1 , A2 , · · · ⊆ R, and fix an arbitrary ε > 0. By definition
of λ∗ we may, for each n ∈ N, choose open intervals In,1 , In,2 , . . . such that
[ X
An ⊆ In,k |In,k | ≤ λ∗ An + ε2−n all n ∈ N
k k

Then [ [[ [  XX X
An ⊆ In,k λ∗ An ≤ |In,k | ≤ λ∗ (An ) + ε
n n k n n k n

Since ε was arbitrary (> 0), it follows that λ∗ ∗


S  P
n An ≤ n λ (An ).

It remains to show that λ (I) = |I| for every interval I ⊆ R. It is obvious that we always
have λ∗ I ≤ |I|. To prove the reverse inequality, first assume that I is a compact interval (i.e.
S for some −∞ < a ≤ b < +∞). Choose finite open
I = [a, b],
4
intervals (a1 , b1 ), (a2 , b2 ), . . . such that
[a, b] ⊆ n (an , bn ) — cf. Remarks 1.7.3. By compactness , there is n such that
n
[
[a, b] ⊆ (ak , bk )
k=1
Pn
We now check by induction on n that b − a ≤ k=1 (bk − ak ). This is obvious if n = 1.
Now suppose that the assertion is true for n − 1, and let (a1 , b1 ), . . . , (an , bn ) be a cover for [a, b].
4
Or by a version of the Heine–Borel Theorem. . .
30 Lebesgue Measure

Without loss of generality (relabelling if necessary), we may assume that b ∈ (an , bn ). Then
(a , b1 ), . . . , (an−1 , bn−1 ) is a cover of the interval [a, an ]. By induction hypothesis, we have an − a ≤
P1n−1
k=1 (bk − ak ), and hence

n−1
X n
X
b − a = (b − an ) + (an − a) ≤ (bn − an ) + (bk − ak ) = (bk − ak )
k=1 k=1

as required.
It follows that b − a ≤ ∞
P
k=1 (bk − ak ). Since the (an , bn ) were an arbitrary open covering of
[a, b], it follows that b − a ≤ λ∗ [a, b], i.e. that |I| ≤ λ∗ (I) for every compact interval I.
It remains to deal with intervals that are non–compact. If I is a bounded interval, then there
is a compact interval J such that J ⊆ I and |I| ≤ |J| + ε (where we fix ε > 0). It follows that
|I| ≤ λ∗ (J) + ε. Now since also λ∗ (J) ≤ λ∗ (I), we must have |I| ≤ λ∗ (J) + ε ≤ λ∗ (I) + ε. Letting
ε ↓ 0, we see that |I| ≤ λ∗ (I) if I is a bounded interval.
Finally, if I is an unbounded interval, then λ∗ (I) = +∞: Indeed, if K > 0 is arbitrary, there is
a compact interval J ⊆ I such that |J| ≥ K. Then

λ∗ (I) ≥ λ∗ (J) ≥ K

Letting K ↑ ∞, we see that λ∗ (I) = +∞ = |I| when I is unbounded. a

Now that we have constructed an outer measure λ∗ , it follows by Thm. 1.7.2 that there is
a σ–algebra L(R) on R such that λ = λ∗ |L(R) is a measure on (R, L(R)). Indeed, we have
L(R) := M(λ∗ ), the family of all λ∗ –measurable sets. The σ–algebra L(R) is called the σ–algebra
of all Lebesgue measurable sets, or the Lebesgue algebra, on R.
Our next aim is to show that B(R) ⊆ L(R), i.e. that every Borel set is Lebesgue measurable.

Proposition 1.7.5 Every Borel set is Lebesgue measurable.

Proof: It suffices to prove that every interval of the form (−∞, a] is Lebesgue measurable, because
the collection of intervals of this form generates B(R). Let
P E ⊆ R be arbitrary, and let I = (−∞, a].
Fix ε > 0, and choose intervals I1 , I2 , . . . such that n |In | ≤ λ∗ (E) + ε. Note that if J is an
arbitrary interval, then so are J ∩ I, J ∩ I c , and |J| = |J ∩ I| + |J ∩ I c |.
Now X
λ∗ (E) + ε ≥ |In |
n
X X
= |In ∩ I| + |In ∩ I c |
n n
∗ ∗
≥ λ (E ∩ I) + λ (E ∩ I c )
Letting ε ↓ 0, we see that λ∗ (E) ≥ λ∗ (E ∩ I) + λ∗ (E ∩ I c ), for all E ⊆ R. a

We have almost proved the following theorem:

Theorem 1.7.6 There exists a unique measure λ on R, B(R)) such that λI = |I| for every
interval I.
Events and Probabilities 31

Proof: By Propn. 1.7.4, λ∗ is an outer measure with λ∗ (I) = |I| for every interval, and thus it
follows by Thm. 1.7.2 that there is a σ–algebra L(R) on R such that λ = λ∗ |L(R) is a measure on
(R, L(R)). By Propn. 1.7.5, B(R) ⊆ L(R).
It remains to prove uniqueness: Suppose that µ is another measure on (R, B(R)) with the
property that µ(I) = |I| for all intervals I. Let In = [−n, n]. Then λn := λ  In , νn := ν  In are
finite measures on (In , B(In )). Now Cn = {J : J an interval, J ⊆ In } is a π–system which generates
B(In ), and λn , νn agree on Cn . Hence, by Propn. 1.6.4, λn , µn agree on B(In ).
Now, if B ∈ B(R), then by Propn. 1.4.11,

λ(B) = lim λn (B ∩ In ) = lim µn (B ∩ In ) = µ(B)


n→∞ n→∞

1.7.3 Lebesgue Measure on Rd


It is possible to modify the construction of Lebesgue measure on R to obtain a unique measure on
(Rd , B(Rd ), also denoted by λ, which assigns to every rectangle its volume: A rectangle in Rd is a
set
R = I1 × I2 × · · · × Id where I1 , I2 , . . . , Id are intervals in R
The volume of such a rectangle is defined by vol(R) = |I1 | × |I2 | × · · · × |Id |.
We can then define a map λ∗ : P(Rd ) → [0, +∞] by

X [
λ∗ (A) = inf

vol(Rn ) : each Rn a rectangle, A ⊆ Rn
n=1 n

and show that λ∗ is an outer measure, and that every Borel set is λ∗ –measurable.
Instead of performing the construction outlined above, we elect to wait until we have constructed
products of measure spaces: Given measure spaces (Ωi , Fi , µi ), where i = 1, . . . , n, it is possible to
construct, in a canonical way, a measure space
Y O Y
( Ωi , Fi , µi )
i≤n i≤n i

It will turn out that O Y


B(Rd ) = B(R) λd = λ1
i≤d i≤d

where λd denotes the d–dimensional Lebesgue measure.


Thus it is possible to construct (Rd , B(Rd ), λd ) directly from (R, B(R), λ), and a repetition of
the construction of Lebesgue measure proves unnecessary.
32 Lebesgue Measure
Chapter 2

Measurable Functions and Random


Variables

2.1 Definition of Measurable Function


Let f : A → S be a map between two sets. Recall that f induces a set map f −1 : P(S) → P(A)
between the power sets — in the opposite direction — by

f −1 [T ] = {a ∈ A : f (a) ∈ T }

We call the set f −1 [T ] the pullback (or inverse image) of T along f . See Section A.3 for more
information.

Remarks 2.1.1 Here is some motivation for the definition of measurable function.
Suppose that X is a random variable on a probability space (Ω, F, P), i.e. a function X : Ω → R
which assigns a number X(ω) to every outcome ω ∈ Ω — we will make this notion more precise
shortly. We would like to be able to discuss the probability that X = 0, or that X lies between −1
and 1, etc. Thus we’d like to know

P(X = 0) := P({ω ∈ Ω : X(ω) = 0}) = P(X −1 {0})


P(−1 ≤ X ≤ 1) := P({ω ∈ Ω : −1 ≤ X(ω) ≤ 1}) = P(X −1 [−1, 1])

However, P(F ) makes sense only if F ∈ F. Thus, in order to be able to discuss the above proba-
bilities, it is necessary that the sets

X −1 {0} := {ω ∈ Ω : X(ω) = 0} X −1 [−1, 1] := {ω ∈ Ω : X(ω) ∈ [−1, 1]}

belong to F.
More generally, given a Borel set B, we want to be able to discuss the probability that the
outcome X(ω) belongs to B. For

P(X ∈ B) := P({ω ∈ Ω : X(ω) ∈ B})

33
34 Definition of Measurable Function

to make sense, it is necessary that the set


X −1 [B] := {ω ∈ Ω : X(ω) ∈ B}
belongs to F.
Thus: We can only meaningfully discuss the possible values of the random variable X in a
probabilistic setting if X −1 [B] ∈ F for every B ∈ B(R), i.e. it is necessary that
X −1 (B(R)) ⊆ F

A basic fact about pullbacks is that they preserve set operations — see Proposition A.3.1:
Proposition 2.1.2 If T, Tn ⊆ S, then

(a) f −1 [T c ] = (f −1 [T ])c

(b) f −1 [ n Tn ] = n f −1 [Tn ]
S S

(c) f −1 [ n Tn ] = n f −1 [Tn ]
T T

If f : A → S, and S is a family of subsets of S, we denote by f −1 S the family of subsets of A


defined by
f −1 S = {f −1 [T ] : T ∈ S}

Proposition 2.1.3 Suppose that (A, A), (S, S) are measurable spaces, and that f : A → S
is a map. Then

(i) A0 = f −1 S is a σ–algebra on A.

(ii) S 0 = {T ⊆ S : f −1 [T ] ∈ A} is a σ–algebra on S.
Exercise 2.1.4 Prove Propn. 2.1.2 and 2.1.3. 

f
Definition 2.1.5 (1.) Let (A, A), (S, S) be measurable spaces. A map A → S is said to
be A/S–measurable if and only if f −1 S ⊆ A (i.e. f −1 [T ] ∈ A for all T ∈ S).
If the σ–algebras A, S are obvious from context, we simply call f a measurable func-
tion.

(2.) A measurable function X from a probability space (Ω, F) to (R, B(R)) is called a
random variable.
A measurable function X from a probability space (Ω, F) to (Rd , B(Rd )) is called a
random vector.
More generally, any measurable function from a probability space to a measurable
space is called a random element.
f
(3.) If S is a topological space, then a measurable function (S, B(S)) → (R, B(R)) is called
a Borel function.
We are usually interested in the case where S = R or Rd .
Measurable Functions and Random Variables 35

Thus we have the following pullback condition for measurability:

A function is a measurable iff pullbacks of measurable sets


are measurable.

Remark 2.1.6 ? Note the similarity with the definition of continuous function: A function f between
topological spaces X, Y is continuous iff f −1 [V ] is an open subset of X whenever V is an open subset of Y ,
i.e. iff pullbacks of open sets are open. 

Remarks 2.1.7 (1) The notion of measure does not occur in the definition of measurable func-
tion/random variable: Only the measurable spaces ( = set + σ–algebra) play a role.
f
(2) If (A, A) → (S, S) is measurable, then

f f −1
A→S S → A

(3) If X is a random variable on (Ω, F) and B ⊆ R, we write

{X ∈ B} for X −1 [B] := {ω ∈ Ω : X(ω) ∈ B}

f
(4) We will also allow extended real–valued maps: A → R̄, where R̄ := [−∞, +∞].


To check if a function is measurable, it suffices to check the pullback condition on a generating


set:
f
Proposition 2.1.8 Suppose that (Ω, F), (S, S) are measurable spaces, and that Ω → S is
a map. Suppose further that C is a family of subsets of S such that σ(C) = S. Then f is
F/S–measurable iff f −1 C ⊆ F.

Proof: (⇒) is obvious: Clearly f −1 C ⊆ f −1 S, and if f is measurable, then f −1 S ⊆ F by definition


of measurability.
(⇐): Let T = {T ∈ S : f −1 [T ] ∈ F}. Then C ⊆ T , by assumption, and T is a σ–algebra, by
Propn. 2.1.2. (Check this!) Hence S = σ(C) ⊆ T ⊆ S i.e. T = S. a

2.2 Some Examples of Measurable Functions


Example 2.2.1 Let (S, S) be a measurable space. For each A ⊆ S define the indicator function
IA : S −→ R by (
1 if s ∈ A
IA (s) =
0 otherwise
36 Definition of Measurable Function

If A is a measurable set, i.e. A ∈ S, and if B ∈ B(R) is a Borel set, then

∅ if neither 0, 1 ∈ B



1 ∈ B and 0 6∈ B

 A if
−1
IA [B] =


 Ac if 0 ∈ B and 1 6∈ B

S if both 0, 1 ∈ B

−1
It follows that IA [B] ∈ S for every B ∈ B(R), so that IA is a measurable function.
−1
Similarly, if IA is a measurable function, then IA {1} = A ∈ S, since {1} is a Borel set. Thus:

A is a measurable set if and only if IA is a measurable function.

The above example is important enough to restate as a Proposition:

Proposition 2.2.2 Suppose that (Ω, F) is a measurable space, and that A ⊆ Ω. Then
the indicator IA : Ω → R is a measurable function iff A is a measurable set (i.e. IA is
F/B(R)–measurable iff A ∈ F).

We shall soon see that all measurable functions can be represented as limits of linear combinations
of measurable indicator functions, a fact that will be extremely important when we develop the
notions of integration and expectation.

Exercise 2.2.3 Suppose that Ω is a set, that F := {∅, Ω} is the trivial σ–algebra on Ω, and that
G := P(Ω) is the powerset–algebra on Ω.

f
(a) Determine all F/B(R)–measurable functions Ω → R.

f
(b) Determine all G/B(R)–measurable functions Ω → R.

Exercise 2.2.4 Suppose that (Ω, F), (S, S) are measurable spaces, and that f : Ω → S is F/S–
measurable.

(1.) Show that if G is a σ–algebra on Ω such that F ⊆ G, then f is also G/S–measurable.

(2.) Show that if T is a σ–algebra on S such that T ⊆ S, then f is also F/T –measurable.

Here are some special cases of Propn. 2.1.8:


Measurable Functions and Random Variables 37

Corollary 2.2.5 A function f : (Ω, F) → (R, B(R)) is measurable iff one of the following
conditions holds :

(a) {f ≤ c} ∈ F for all c ∈ R. (Recall that {f ≤ c} := {ω :∈ Ω : f (ω) ≤ c}.)

(b) {f < c} ∈ F for all c ∈ R.

(c) {f ≥ c} ∈ F for all c ∈ R.

(d) {f > c} ∈ F for all c ∈ R.

Proof: (a) In Propn. 2.1.8, take (S, S) = (R, B(R)) and C to be the collection of all intervals of the
form (−∞, c]. We already know that these intervals generate the Borel algebra on R — cf. Propn.
1.2.8.
(b),(c),(d) are proved similarly. a

Corollary 2.2.6 Any continuous function from R to R is a Borel function, i.e.


B(R)/B(R)–measurable.

Proof: Suppose that f : R → R is continuous, and c ∈ R. Given x ∈ {f < c}, let 0 < εx < c−f (x),
and choose δx > 0 so that |f (x) − f (y)| < εx wheneverS|x − y| < δx (by definition of continuity).
Then (x − δx , x + δx ) ⊆ {f < c}, and hence {f < c} = x∈{f <c} (x − δx , x + δx ) is a union of open
intervals. By footnote 3 of Chapter 1, it follows that {f < c} is a countable union of open intervals.
hence {f < c} ∈ B(R). a

Corollary 2.2.7 Any monotone function from R to R is a Borel function, i.e.


B(R)/B(R)–measurable.

f
Proof: If R → R is monotone, then {f < c} is an interval (for all c ∈ R): Suppose, for example,
that f is increasing, and let x0 = sup{x : f (x) < c}. Since f is increasing we have f (x) ≤ f (x0 )
whenever x ≤ x0 . Hence (
(−∞, x0 ] if f (x0 ) < c
{f < c} =
(−∞, x0 ) if f (x0 ) ≥ c
Hence {f < c} is an interval, and thus a member of B(R).
The case where f is decreasing can be dealt with similarly. a

The next proposition shows that if a σ–algebra F on Ω is generated by a partition {An : n ∈ N}


of Ω, then the F–measurable functions are precisely the functions which are constant on each of
the blocks An of the partition. The intuition behind this is as follows: If An is a block of a partition
which generates F, then F cannot distinguish between different elements of An . There is enough
information in F to decide whether the event An occurred, i.e. whether or not the outcome ω
belongs to An . But if ω1 , ω2 are distinct elements of An , then F cannot tell us which of ω1 , ω2 is
the outcome, if any. In the same way, an F–measurable function cannot tell the difference between
ω1 and ω2 : If ω1 , ω2 belong to the same block An and if f is F–measurable, then f (ω1 ) = f (ω2 ).
38 Combinations of Measurable Functions

Proposition 2.2.8 Suppose that (Ω, F) is a measurable space, and that A = {An : n ∈ N}
is a partition of Ω which generates F, i.e. σ(A) = F. Then f : (Ω, F) → R̄ is measurable
if and only if f is constant on each block An .

Proof: Recall that every element of F is a union of some of the blocks An — see Exercise 1.2.5.
Further recall that each ω ∈ Ω belongs to exactly one block An .
Suppose first that f : (Ω, F) → R̄ is F/B(R̄)–measurable. Suppose that ω1 , ω2 belong to the
same block Ak . Let c := f (ω1 ). Then f −1 {c} ∈ F, and hence f −1 {c} is a union of blocks. Now
ω1 ∈ f −1 {c}, and hence Ak is one of the blocks in the union which makes up f −1 {c}, i.e. Ak ⊆
f −1 {c}. Since ω2 ∈ Ak , we also have ω2 ∈ f 1 {c}, and thus f (ω2 ) = c also. Thus f (ω1 ) = f (ω2 ). It
follows that f is constant (with value c) on the block Ak .
Conversely, suppose that f is constant on the blocks. Let f take the value cn on the block An ,
i.e. f (ω) = cn for all ω ∈ An . If B ∈ B(R̄), then
[
f −1 [B] = {ω : f (ω) ∈ B} = {An : cn ∈ B}
n

Thus f −1 [B] is a union of blocks, and hence f −1 [B] ∈ F. a

2.3 Combinations of Measurable Functions


Measurable functions can be combined in a variety of ways to form new measurable functions:

Proposition 2.3.1 (a) Suppose that f, g : (Ω, F) → R̄ are measurable functions and that
α ∈ R. Then
f +g f2 αf f ·g f /g
are measurable functions, where we assume g 6= 0 on Ω for the case f /g.

(b) If fn : (Ω, F) → R̄ are measurable functions for n ∈ N, then

sup fn inf fn lim sup fn lim inf fn


n n n n

are measurable.

(c) If fn : (Ω, F) → R̄ are measurable functions for n ∈ N, and if fn → f pointwise on Ω,


then f is measurable.

Proof: (a) Suppose that f, g are measurable. First, we show that f + g is measurable. By Propn.
2.1.8 it suffices to show that

{f + g > c} ∈ F for all c ∈ R

Now f (s) + g(s) > c iff f (s) > c − g(s) iff f (s) > q > c − g(s) for some q ∈ Q. Thus
[ 
{f + g > c} = {f > q} ∩ {g > c − q}
q∈Q
Measurable Functions and Random Variables 39

Now {f > q}, {g > c − q} ∈ F because f, g are measurable. Since Q is a countable set, {f + g >
c} ∈ F also.
Next, we show that f 2 is measurable. This follows easily from Propn. 2.1.8 using the fact that
( √ √
2 {− c ≤ f ≤ c} if c ≥ 0
{f ≤ c} =
∅ else
To see that αf is measurable is easy, e.g. if α > 0, then {αf < c} = {f < αc }.
Next, to see that f g is measurable, use the polarization identity
f g = 14 [(f + g)2 − (f − g)2 ]
f
Finally, to see that g is measurable, it suffices to see that g1 is measurable. But if c > 0
   
{ g1 < c} = { 1c < g} ∩ {g > 0} ∪ { 1c > g} ∩ {g < 0}

Similar arguments work if c < 0 or c = 0.


(b) Note that [
{sup fn > c} = {fn > c}
n n
that inf n fn = − supn (−fn ), that lim supn fn = inf n supk≥n fk and that lim inf n fn = − lim supn (−fn ).
(c) Note that if fn → f , then f = lim supn fn = lim inf n fn . a

Corollary 2.3.2 If f, g : (Ω, F) → R̄ are measurable functions, then so are

• f ∨ g := max{f, g} and f ∧ g := min{f, g}.

• f + := f ∨ 0 = max{f, 0} and f − := −(f ∧ 0) = max{−f, 0}.

• |f |

Proof: f ∨ g = sup{f, g}, f ∧ g = inf{f, g}. Furthermore |f | = f + + f − , as is easily verified. a


Remarks 2.3.3 With respect to the above definitions, it is useful to note that:
f = f+ − f− |f | = f + + f −

Exercise 2.3.4 Suppose that hfn in is a sequence of measurable functions from a measure space
(S, S) to R̄. Prove that the set {s ∈ S : limn fn (s) exists} is measurable. 
Measurability, like continuity, is preserved under composition:
f g
Proposition 2.3.5 If (A, A) → (S, S) and (S, S) → (T, T ) are measurable functions,
g◦f
then (A, A) → (T, T ) is measurable.

Proof: If C ∈ T , then (g ◦ f )−1 [C] = f −1 [g −1 [C]]. Now g −1 [C] ∈ S because g is measurable. Thus
f −1 [g −1 [C]] ∈ A, because f is measurable. a
40 Approximation by Simple Functions

2.4 Approximation by Simple Functions


We have already seen that an indicator function IA is a measurable function iff A is a measurable
set. It follows that from Propn. 2.3.1 that linear combinations of such indicators are measurable
as well.
f
Definition 2.4.1 A measurable function (Ω, F) → R̄ is called a simple function if ranf
is a finite set.

Let (Ω, F) be a measurable space. Recall that a finite or countably infinite sequence (Fn )n of
members of F is said toSform a partition of Ω iff (i) The Fn are mutually disjoint (i.e. Fn ∩ Fm = ∅
when n 6= m), and (ii) n Fn = Ω.

Proposition 2.4.2 A measurable function f : (Ω, F → R̄ is simple iff it is a linear


combination of measurable indicator functions:
n
X
f= ci IFi
i=1

Moreover, the sets Fi ∈ F can be chosen to form a partition of Ω.

Proof: It is obvious that a function of the form f = ni=1 ci IFi (where each Fi ∈ F) is simple: f
P
can only take on values which are sums of finitely many of the ci .
} is a finite set. Define Fi = f −1 {ci }
Suppose now that f is simple, i.e. that ranf = {c1 , . . . , cnP
for i = 1, . . . , n. Then the Fi form a partition of Ω, and f = ni=1 ci IFi . a

Simple functions play an important part in integration theory. Many important results are
proved first for simple functions, and then extended to arbitrary measurable functions by taking
limits. The next result is therefore extremely important:

f
Proposition 2.4.3 (a) For any non–negative measurable function (Ω, F) → R̄+ there
exists a sequence of simple measurable functions fn , n ∈ N such that 0 ≤ fn ↑ f .
Moreover, if f is bounded, we can choose the fn so that fn → f uniformly.

(b) For any measurable function (Ω, F) → R̄, there is a sequence of simple measurable
functions such that fn → f . Moreover, if f is bounded, we can choose the fn so that
fn → f uniformly.

Proof: (a) Define


fn (s) := 2−n [2n f (s)] ∧ n
where [x] is the greatest integer ≤ x. This elegant definition deciphers as follows:
n2n −1
X
k
fn := 2n I{k2−n ≤f <(k+1)2−n } + nI{f >n}
k=0
Measurable Functions and Random Variables 41

which means that


k k k+1
If f (s) ≤ n, then fn (s) = 2n exactly when 2n ≤ f (s) < 2n
If f (s) > n, then fn (s) = n

Thus fn is simple and non–negative. Moreover, 0 ≤ f (s) − fn (s) < 2−n .


Next, we show that hfn in is an increasing sequence: Suppose that s ∈ S, and that fn+1 (s) =
m
2n+1
. Let k := [ m m k m m+1
2 ], so that k ≤ 2 < k + 1. Then 2n ≤ 2n+1 ≤ f (s) < 2n+1 ≤ 2n , so that
k+1

fn (s) = 2kn . It follows that fn (s) ≤ fn+1 (s), i.e. that hfn (s)in is increasing.
Next, we show that fn (s) → f (s) for all s ∈ S. If f (s) = +∞, then fn (s) = n for all
n ∈ N, so certainly fn (s) → f (s). If f (s) < ∞, choose N such that f (s) < N . If n ≥ N , then
0 ≤ f (s) − fn (s) ≤ 2−n , and thus |f (s) − fn (s)| ≤ 2−n . Thus fn (s) → f (s) in this case also.
Finally, if f is bounded, i.e. f ≤ N for some N ∈ N, then we see that |f (s) − fn (s)| ≤ 2−n for
all n ≥ N and all s ∈ S, i.e. fn → f uniformly.
(b) Now let f be an arbitrary measurable function to R̄. Then f is the difference of two non–
negative measurable functions f = f + − f − (cf. Remarks 2.3.3), and thus, as in (a), there exist
non–negative simple functions fn+ , fn− such that fn+ ↑ f + , fn− ↑ f − . Clearly then also (fn+ −fn− ) → f .
Now note that if f is bounded, so are f + , f − . If the fn+ and fn− converge uniformly, then also
(fn+ − fn− ) → f uniformly. a

Here follows a very powerful result:

Theorem 2.4.4 (Monotone Class Theorem)


Let H be a set of bounded functions from a set Ω into R satisfying the following conditions:

(i) H is a vector space.

(ii) The constant function 1 belongs to H.

(iii) Given any sequence hn of non–negative elements of H such that hn ↑ h, if h is


bounded, then h ∈ H.

Let A be a π–system on Ω with the property that IA ∈ H for every A ∈ A.


Then every bounded σ(A)–measurable function belongs to H.

Proof: Let D = {F ⊆ Ω : IF ∈ H}. It is not hard to show that D is a λ–system. By Dynkin’s


Lemma (Thm 1.6.3), D ⊇ σ(A).
Let h be a non–negative, bounded σ(A)–measurable function, with upper bound K, i.e.

0 ≤ h(ω) ≤ K for all ω ∈ Ω

Let hn , n ∈ N be a sequence of simple σ(A)–measurable functions such that hn ↑ h. Since h is


σ(A)–measurable, each A(n, k) ∈ D, i.e. IA(n,k) ∈ H.) Because H is a vector space, we now see
that hn ∈ H for each n ∈ N. Thus h ∈ H as well.
We have now shown that every non–negative bounded σ(A)–measurable function belongs to H.
The same result can be obtained for arbitrary bounded h by splitting into positive and negative
parts: h = h+ − h− . a
42 Pushing Measures along Functions: The Laws of a Random Variable

2.5 Pushing Measures along Functions: The Laws of a Random


Variable
The next proposition shows that measures can be pushed forward along measurable functions.

f
Proposition 2.5.1 Suppose that (Ω, F) → (S, S) is measurable, and that µ is a (proba-
bility) measure on (Ω, F). Define a set function µf −1 on S by

µf −1 (T ) = µ f −1 [T ]
 

Then µf −1 is a (probability) measure on (S, S). 

Exercise 2.5.2 Prove Propn. 2.5.1. 


X
Remarks 2.5.3 If (Ω, F, P) → R is a random variable, then PX −1 is a probability measure on
(R, B(R)), called the distribution or law of the random variable X. Note that

PX −1 B = P(X ∈ B)


Exercise 2.5.4 (a) Suppose that (Ω, F, P) is the die space, i.e. Ω = {1, 2, . . . , 6}, F = P(Ω) and
P(ω) = 61 for all ω ∈ Ω. Define X : Ω → R : ω 7→ ω 2 − 5. Show that X is F/B(R)–measurable,
and determine law of X, i.e. the measure PX −1 on (R, B(R)).

(b) Suppose that F : R → R : x 7→ x2 . Show that F is a Borel function, and calculate λF −1 [−1, 3]
(where λ is Lebesgue measure).


A measure µ on (R, B(R)) is said to be locally finite iff µ(I) < ∞ for every compact interval I.
The next theorem states that there is a one–to–one correspondence between locally finite measures
and increasing right–continous functions.

Theorem 2.5.5 ∗

(a) Suppose that F : R → R is a right–continuous increasing function with F (0) = 0.


There is a unique locally finite measure µ on (R, B(R)) with the property that

µ(a, b] = F (b) − F (a) −∞<a<b<∞

The measure µ is called the Lebesgue–Stieltjes measure associated with F .

(b) Conversely, given a locally finite measure µ on (R, B(R)), there is a unique right–
continuous increasing function F with F (0) = 0 so that

F (b) − F (a) = µ(a, b] −∞<a<b<∞


Measurable Functions and Random Variables 43

Exercise 2.5.6 ∗ We prove Thm. 2.5.5.

(a) Suppose that F is right–continuous increasing with F (0) = 0. Define a function g : R → R̄ by

g(t) = inf{s ∈ R : F (s) ≥ t} t∈R

(Recall that inf ∅ := ∞.)

(a.1) Show that g(t) ≤ x iff t ≤ F (x), so that g is a generalized inverse of F .


(a.2) Show that g is increasing and left–continuous.
(a.3) Explain why g is a Borel function.
(a.4) Define a measure µ on (R̄, B(R̄)) by µ := λg −1 , where λ is Lebesgue measure. Use (a.1)
to show that µ(a, b] = F (b) − F (a) whenever −∞ < a < b < ∞.
(a.5) Now prove the uniqueness of µ: Explain why if ν is any other measure on R that satisfies
ν(a, b] = F (b) − F (a) for all −∞ < a < b < ∞, then ν = µ.

(b) Suppose that µ is a locally finite measure on (R, B(R). Define


(
µ(0, x] if x ≥ 0
F (x) :=
−µ(x, 0] if x < 0

(b.1) Show that F is right–continuous increasing with F (0) = 0.


(b.2) Show that µ(a, b] = F (b) − F (a) whenever −∞ < a < b < ∞.
(b.3) Show that F is the unique function satisfying (b.1) and (b.2).
44 Pushing Measures along Functions: The Laws of a Random Variable
Chapter 3

Information and Independence

3.1 Conditional Probability and Independence of Events


Probability is all about information. I toss a coin and see that it lands heads. You don’t see
the coin. For you the probability that the coin has landed heads is 12 , but for me it is 1. New
information changes the probability measures.
Let (Ω, F, P) be a probability space, and let A, B be events. Knowledge that B has occurred
can change our estimation of the probability that A has occurred. We write P(A|B) for the
probability that A occurs given that we know that B has occurred. We call P(A|B) the conditional
probability of A given B.

Example 3.1.1 A die is rolled. Let A be the event that the outcome is a 6, let B be the event
that the outcome is an even number, and let C be the event that the outcome is an odd number.
Clearly P(A) = 16 . However, if we know for sure that the outcome is an even number, then the
probability of getting a 6 is 13 , i.e. P(A|B) = 13 . In the same way, if B occurs, then C cannot
possibly occur, so although P(C) = 21 , P(C|B) = 0. 

Basically, what’s happening here is that we have to modify our probability measure to accom-
modate the “new” information that B has occurred. If P( |B) is the new probability measure on
(Ω, F), then we must have P(B|B) = 1 and P(B c |B) = 0. If A is another event, then A occurs if
and only if A ∩ B occurs, since we know that B also occurs, and it makes sense to assume that the
new probability that A occurs is proportional to the old probability that A ∩ B occurs, i.e. that
P(A|B) = cP(A ∩ B) for some constant c. To ensure P(B|B) = 1, we must have c = P(B)−1 . We
therefore find that
P(A ∩ B)
P(A|B) =
P(B)
the standard formula given in elementary probability theory texts.

Exercises 3.1.2 (1.) Prove that P( |B) is a probability measure on (Ω, F).

45
46 Conditional Probability and Independence of Events

(2.) For events A1 , . . . , An , prove that

P(A1 ∩ · · · ∩ An ) = P(A1 )P(A2 |A1 )P(A3 |A1 ∩ A2 ) . . . P(An |A1 ∩ · · · ∩ An−1 )

Example 3.1.3 A couple has two children. Assuming that boys and girls are equally likely, and
given that one of the children is a girl, what is the probability that the other child is also a girl?
We can model this probability space as follows:

Ω = {bb, bg, gb, gg}


F = P(Ω)
1
∀ω ∈ Ω[P(ω) = ]
4
Let B be the event that at least one of the children is a girl, i.e. B = {bg, gb, gg}, and let A be the
event that both children are girls, i.e. A = {gg}. Then

P(A ∩ B) 1/4 1
P(A|B) = = =
P(B) 3/4 3

Two events A, B are said to be independent if knowledge of B tells us nothing about A, and
vice versa. By this we mean that our estimate of the probability that A occurs isn’t revised by the
knowledge that B has occurred. Thus:

P(A ∩ B)
P(A) = P(A|B) =
P(B)

and hence we have the multiplication law

P(A ∩ B) = P(A) · P(B)

The above equation gives us the definition of independent events:

Definition 3.1.4 Let (Ω, F, P) be a probability space. A (possibly infinite) set A = {Ai :
i ∈ I} of events is said to be an independent family provided that for any distinct
i1 , i2 , . . . , in ∈ I

P(Ai1 ∩ Ai2 ∩ · · · ∩ Ain ) = P(Ai1 )P(Ai2 ) . . . P(Ain )

Example 3.1.5 It’s worth pointing out that whether or not two events are independent depends
on the probability measure , i.e. it is possible for events to be independent under one measure, but
not under another. The notion of independence is therefore a genuinely probabilistic notion, which
has no analogue in general measure theory.
Information and Independence 47

(a) Consider the random trial of tossing coin twice. The sample space Ω is the 4–element set
{HH, HT, TH, TT} and the associated σ–algebra is just P(Ω). Intuitively, if the coin is fair,
the outcome of the first coin should have no influence on the second. Thus knowing that the first
coin has landed heads should make no difference to whether the second coin lands heads. Let
B = {HH,HT} be the event that the first coin lands heads, and let A = {HH,TH} be the event
that the second coin lands heads. Then P(A ∩ B) = P({HH}) = 41 , and P(A) · P(B) = 12 · 12 = 41 .
Thus P(A ∩ B) = P(A) · P(B), i.e. the events A and B are indeed independent.

(b) Consider the same experiment as in (a), but with one important difference: Before the exper-
iment starts, we are told that the coin is unfair. It has either two heads, or two tails, but we
are not told which. Each possibility is equally likely.
To model this, we use a different probability measure Q, which has

1
Q({HH}) = = Q({T T }) Q({HT }) = 0 = Q({T H})
2

In this case Q(A ∩ B) = 12 , whereas Q(A)Q(B) = 41 . Thus A and B are not independent under
Q.

Remarks 3.1.6 Can an event be independent of itself, i.e. given an event A, can the events A, A
be independent? Here we have to be a little careful. From the intuitive point of view, the answer
would seem to be no, since the information that the event A has occurred will certainly make us re-
evaluate our estimation of the probability that A has occurred! However, if we look at the definition,
A and A will be independent provided P(A ∩ A) = P(A)P(A), i.e. provided P(A) = P(A)2 . This can
happen only if P(A) is either 0 or 1. That’s not too far removed from our intuition. If P(A) = 1,
for example, then A happens almost surely, so telling us that A has happened does not really give
us any information. We were practically certain that it would anyway. 

Exercise 3.1.7 A gambling game involves the rolling of a fair die followed by the flipping of a fair
coin.

(a) Set up a reasonable probability space to model this situation.

(b) Let A be the event that the die lands on an even number, and let B be the event that the coin
lands tails. Show that A and B are independent events.

Exercise 3.1.8 An breathalizer test for drinking and driving is 95% accurate, i.e. it gives the
correct result 95% of the time. John lives in a small town with a 1000 inhabitants, about 50 of
whom are drunk on any given evening. One evening, John is stopped by the police, and tested.
The test says that John is drunk. What is the probability that John is drunk? 
48 Information in Random Variables

Exercise 3.1.9 (Monty Hall Problem)


On a TV game show there are three doors, 1, 2 and 3. Behind one door, there is a brand new
Porsche, but behind each of the remaining two doors there is a goat. You are a contestant on
this show, and you are asked to choose a door. You will win whatever is behind the door. The
probability of winning the Porsche is therefore 31 . Say you choose door 1. Now before opening
door 1, the game show host opens one of the other doors, and reveals a goat. (He can always do
this, because there are two goats. At least one of doors 2 and 3 must hide a goat.) Suppose he
opens door 2. He now asks you whether you would like to change your mind and opt for door 3.
Should you do it? What is the probability that the goat is behind door 3, conditional on the above
information? 

3.2 Information in Random Variables


We already introduced the notion of a σ–algebra generated by a family of sets. We can use this to
define the notion of a σ–algebra generated by a random variable.

Definition and Proposition 3.2.1 (a) Let (S, S) be a measure space, and suppose that
X is a collection of functions Ω → S. There is a smallest σ–algebra on Ω denoted by

σ(X )

is the such that all X ∈ X are σ(X )/S–measurable. σ(X ) is called the σ–algebra
generated by X .
We also write σ(Xi : i ∈ I) for the σ–algebra generated by the family X = {Xi : i ∈ I}.

(b) If X is a measurable function, then

σ(X) = {X −1 [T ] : T ∈ S}

Proof: (a) Let


C = {X −1 [T ] : X ∈ X , T ∈ S}
Then C is a family of subsets of Ω, and clearly σ(X ) = σ(C). (We already know what is meant by
σ(C), as C is a family of sets.)
(b) By (a), σ(X) is the smallest σ–algebra which includes the family C = {X −1 [T ] : T ∈ S}.
However, by Propn. 2.1.3, C is a σ–algebra, and thus C = σ(X). a

In the probabilistic framework, σ–algebras play the role of carriers of information: Earlier, we
saw that if (Ω, F, P) is a probability space, then

• F is the set containing all those events for which it can be decided whether or not they
occurred.

• If C is a family of events, then σ(C) is the set containing all those events for which it can be
decided whether or not they occurred, given that we can decide all the events in C.
Information and Independence 49

Similarly:

For a random variable X on a probability space Ω, the σ–algebra σ(X) can be interpreted
in two ways (which are two sides of the same coin):

• σ(X) is the information carried by X: It is the set of all events that can be decided,
given that we know value of X.

• It is the smallest amount of information that we need in order to be able to determine


the value of X.
Example 3.2.2 For example, consider the experiment of rolling a die, so that Ω = {1, 2, . . . , 6}
and F = P(Ω). Let the random variable X : Ω → R be defined by
(
0 if ω is even
X(ω) =
1 if ω is odd
It is easy to check that n o
σ(X) = ∅, {1, 3, 5}, {2, 4, 6}, Ω
Let’s consider our interpretations (i) and (ii) above:
(i) If we know the value of X, all we know is whether the outcome of rolling the die is an even
number or an odd number, i.e. all we can decide is whether {2, 4, 6} or {1, 3, 5} occurred (in
addition to being able to decide the certain and impossible events).
(ii) To know the value of X, all we need to know is whether the outcome of the die roll was even
or odd. We do not need to know the exact outcome of rolling the die.


Exercise 3.2.3 Suppose that X : (Ω, F) → (R, B(R)) is a function, and that ω1 , ω2 ∈ Ω are two
elements with the following property:
For all F ∈ F we have ω1 ∈ F ⇔ ω2 ∈ F
Show that if X is F–measurable, then X(ω1 ) = X(ω2 ). Thus if F cannot distinguish between ω1
and ω2 , neither can any F–measurable random variable.
[Hint: Define x := X(ω1 ) and consider X −1 {x}.] 

If X, Y are random variables such that σ(Y ) ⊆ σ(X), then the information needed to determine
the value of Y is a subset of the information required to determine the value of X. Hence, if we
know the value of X, we should also know the value of Y . This suggests that Y is a function of X.
The following theorem makes this precise.
Theorem 3.2.4 (Doob–Dynkin Lemma)
Suppose that Xi , Y : (Ω, F) → (R, B(R)) (i = 1, . . . , n) are measurable. Then Y
h
is σ(X1 , . . . , Xn )–measurable iff there is a Borel function Rn → R such that Y =
h(X1 , . . . , Xn ).
50 Independence of σ–algebras and Random Variables

Proof: (⇐): We first show that the map X : Ω → Rn : ω 7→h(X1 (ω), . . . , Xin (ω) is σ(X1 , . . . , Xn )/B(Rn )–
Qn
measurable. By Propn. 2.1.8, it suffices to check that X −1 i=1 (−∞, ci ] ∈ σ(X1 , . . . , Xn ) for all
(c1 , . . . , cn ) ∈ R , because the family of these lower orthants generates B(Rn ). But
n

n
hY i \n
X −1
(−∞, ci ] = Xi−1 (−∞, ci ]
i=1 i=1

so this is obvious. Now h(X1 , . . . , Xn ) = h ◦ X is a composition of measurable functions, and hence


measurable.
(⇒): First assume that Y is simple, i.e. Y = dj=1 yj IAj for some family of mutually disjoint
P
sets Aj (cf. Propn. 2.4.2). Since Y is assumed to be σ(X1 , . . . , Xn )–measurable, we see that each
Aj = Y −1 {yj } belongs to σ(X1 , . . . , Xn ). Define X = (X1 , . . . , Xn ), as above. Reasoning as in
Propn. 3.2.1, it is easy to see that

A ∈ σ(X1 , . . . , Xn ) iff A = X −1 [B] for some B ∈ B(Rn )

and thus Aj = X −1 [Bj ] for some Bj ∈ B(Rn ). Now define


n
X
h= yj IBj
j=1

Then h(X1 , . . . , Xn ) = Y , as required.


Now assume that Y is an arbitrary σ(X1 , . . . , Xn )–measurable random variable. Choose a se-
quence of simple random variables Yk (k ∈ N) such that Yk → Y pointwise (cf. Propn. 2.4.3).
Hence there exist Borel functions hk such that Yk = hk (X1 , . . . , Xn ). Let M = {x ∈ Rn :
hhk (x)ik converges}. Then M ∈ B(Rn ) (because M = g −1 {0}, where g = lim supk hk − lim inf k hk
f
is measurable, by Proposition 2.3.1). Define Rn → R by

lim hk (x) if x ∈ M
(
h(x) = k
0 else

Then h = limk hk IM is a Borel function. Now

Y (ω) = lim Yk (ω) = lim hk (X1 (ω), . . . , Xn (ω))


k k

which implies two things: (i) (X1 (ω), . . . , Xn (ω)) ∈ M , and (ii) Y = h(X1 , . . . , Xn ), as required. a

3.3 Independence of σ–algebras and Random Variables


The intuitive idea about independence is the following: Two events are independent if the informa-
tion that one of the events has occurred does not lead us to revise our estimation of the probability
that the other has occurred. Now σ–algebras are the carriers of information, and we would there-
fore like a definition of independence which involves σ–algebras. We therefore define independence
anew:
Information and Independence 51

Definition 3.3.1 Let (Ω, F, P) be a probability space.

• Sub–σ–algebras G1 , G2 , . . . of F are said to be independent if events in distinct Gn


are independent, i.e. whenever n1 , n2 , . . . nm ∈ N are distinct positive integers and
Gn1 ∈ Gn1 , Gn2 ∈ Gn2 , . . . , Gnm ∈ Gnm are events, we have
Y
P(Gn1 ∩ Gn2 ∩ · · · ∩ Gnm ) = P(Gnk )
k≤m

• A random variable X is independent of a σ–algebra G if σ(X), G are independent


σ–algebras.

• X1 , X2 . . . are independent random variables if σ(X1 ), σ(X2 ), . . . are independent


σ–algebras

The basic idea is that two σ–algebras are independent if there is no information about an event
in one σ–algebra that would lead us to revise our estimate of the probability of any event in the
other σ–algebra.
Other variations (e.g. the what it means for random variables X1 , . . . , Xn to be independent)
should be obvious.
We use the symbol ⊥ ⊥ to denote independence. Thus X ⊥⊥ G means that the random variable
X is independent of the σ–algebra G, etc.

Example 3.3.2 Suppose that A, B are events in some probability space (Ω, F, P). Then A =
{Ω, A, Ac , ∅} and B = {Ω, B, B c , ∅} are the σ–algebras of events that can be decided by knowledge
of A, B respectively. It is easy to show that A, B are independent events if and only if A, B are
independent σ–algebras. For example, P(A) = P(A ∩ B) + P(A ∩ B c ), and thus P(A ∩ B c ) =
P(A)[1 − P(B)] = P(A)P(B c ), by independence of A, B. It follows that A, B c are independent if
A, B are. The other combinations of events are similarly proven independent. 

Recall that we decomposed the notion of σ–algebra into to parts, namely π–systems and λ–
systems (cf. Proposition 1.6.2). The idea is that π–systems are easy to work with, whereas λ–
systems mesh well with the properties of measures. As an example, we have the following result:
If two π–systems are independent, so are the σ–algebras generated by those π–systems:

Theorem 3.3.3 Let {Ct }t∈T be a collection of independent π–systems on (Ω, F, P). Then
{σ(Ct )}t∈T is a collection of independent σ–algebras.

Proof: We must show that if t1 , . . . , tn ∈ T are distinct, then σ(Ct1 ), . . . , σ(Ctn ) are independent.
We proceed by recursion. Fix t1 , . . . , tn ∈ T , define Ftk := σ(Ctk ), and also fix Ct2 ∈ Ct2 , . . . , Ctn ∈
Ctn . Let
D := {F ∈ Ft1 : P(F ∩ Ct2 ∩ · · · ∩ Ctn ) = PF · PCt1 · . . . · PCtn }
By assumption, Ct1 ⊆ D. Using the additivity and continuity properties of measures, it is straight-
forward to check that D is a λ–system. Thus by Thm. 1.6.3 we have D = Ft1 for every selection of
52 Borel–Cantelli Lemmas

Ctk ∈ Ctk k = 2, 3, . . . , n, and hence the families Ft1 , Ct2 , Ct3 , . . . , Ctn are independent. Repeat: Fix
Ft1 ∈ Ft1 and Ct3 ∈ Ct3 , . . . Ctn ∈ Ctn . Redefine

D := {F ∈ Ft2 : P(Ft1 ∩ F ∩ Ct2 ∩ · · · ∩ Ctn ) = PFt1 · PF · PCt2 · . . . · PCtn }

Again, D is a λ–system containing Ct2 , and hence by Thm. 1.6.3 D = Ft2 . From this it follows
that Ft1 , Ft2 , Ct3 , . . . , Ctn are independent. Repeat the construction n − 2 more times to deduce
that Ft1 , . . . Ftn are independent. a

If you’re familiar with the elementary definition of independence for random variables, given in
introductory courses on probability and statistics, you will want to know the following:

Exercise 3.3.4 Random variables X, Y are said to be independent in the elementary sense iff

P(X ≤ x, Y ≤ y) = P(X ≤ x) · P(Y ≤ y) all x, y ∈ R

Prove that random variables arenindependent iff they


o are independent in the elementary sense.
[Hint: First show that set X := {X ≤ x} : x ∈ R is a π–system which generates σ(X).] 

The following result is very easy to within the measure–theoretic framework, but very difficult
to prove using the elementary definition of independence given in Exercise 3.3.4:

Theorem 3.3.5 Suppose that X1 , . . . , Xn+m are independent random variables, and that
f g
Rn → R and Rm → R are Borel functions. Then Y = f (X1 , . . . , Xn ) and Z =
g(Xn+1 , . . . , Xn+m ) are independent.

Proof: We see that σ(X1 , . . . , Xn ) and σ(Xn+1 , . . . , Xm+n ) are independent σ–algebras. Now
Y is σ(X1 , . . . , Xn )–measurable, i.e. σ(Y ) ⊆ σ(X1 , . . . , Xn ). Similarly Z is σ(Xn+1 , . . . , Xm+n )–
measurable, i.e. σ(Z) ⊆ σ(Xn+1 , . . . , Xm+n ). Thus σ(Y ) and σ(Z) are independent. a

In particular, if X ⊥
⊥ Y then also f (X) ⊥⊥ g(Y ) for any Borel functions f, g.

3.4 Borel–Cantelli Lemmas


Let (Ω, F, P) be a probability space, and let A = {An : n ∈ N} be a countable set of events. It may
help to think of A as a sequence of events, An+1 following An .

Proposition 3.4.1 (First Borel–Cantelli Lemma)


Let (Ω, F, P) be a probability space, and let {An : n ∈ N} be a sequence of events. If
X
P(An ) < ∞
n

then
P(An , i.o.) = 0
Information and Independence 53


S
Proof: Let Bn = Ak . Then Bn ↓ lim sup An = (An , i.o.). Hence by countable subadditivity
k=n


X
P(An , i.o.) ≤ P(Bn ) ≤ P (An )
k=n
P
for all n ∈ N. Now as n → ∞, the right–hand sum goes to zero, since n P(An ) converges. Hence
P(An , i.o.) = 0. a

Proposition 3.4.2 (Second Borel–Cantelli Lemma)


Let (Ω, F, P) be a probability space, and let {An : n ∈ N} be a family of independent events.
If X
P(An ) = ∞
n

then
P(An , i.o.) = 1

Proof: The proof depends on the fact that 1 − x ≤ e−x for all x ∈ R, an inequality which is easily
proved using first–year calculus. Now clearly (An , i.o.) = lim sup An = (lim inf Acn )c = (Acn , ev.)c ,
∞ T

and thus it suffices to prove that P(Acn , ev.) = 0. But (Acn , ev.) = Ack by definition, and so
S
n=1 k=n

Ack ) = 0 for all n. Now by independence of the An , and thus the Acn ,
T
it suffices to show that P(
k=n
we have
n+m
n+m n+m n+m P
\ Y Y − P(Ak )
−P(Ak )
P( Ack ) = [1 − P(Ak )] ≤ e =e k=n

k=n k=n k=n


P
for all m ∈ N. Now since n P(An ) diverges, the power of e on the right must tend to zero as
∞ n+m
Ack ) = lim P(
T T c
m → ∞. Thus P( Ak ) = 0, as required. a
k=n m→∞ k=n

Remarks 3.4.3 The First Borel–Cantelli Lemma says that given events An , not necessarily in-
dependent, if the sum of the probabilities P(An ) converges, then (An , i.o.) is an event of zero
probability. The Second Borel–Cantelli Lemma says that if the An are independent and the sum
of the probabilities P(An ) diverges, then the event (An , i.o.) occurs almost surely, i.e. with prob-
ability 1. Thus for independent events An , there is no middle road: (An , i.o) is either an event of
probability 0 or an event of probability 1. 

Example 3.4.4 Suppose that Xn are independent exponentially distributed random variables with
parameter λ for n = 0, 1, 2, . . . , i.e.
(
1 − e−λx if x ≥ 0
P(Xn ≤ x) =
0 if x < 0
54 Borel–Cantelli Lemmas

• What is the probability that Xn ≥ 1 for infinitely many n?


P∞
Let An := {ω ∈ Ω : Xn (ω) > 1} Then P(An ) = e−1 > 0 for all n, and hence n=0 P(An ) =
+∞. By the second Borel–Cantelli lemma, P(An , i.o.) = 1.

• What is the probability that Xn ≥ n for infinitely many n?


Let Bn := {Xn ≥ n}. Then P(Bn ) = e−λn . It follows that ∞ 1
P
n=0 P(Bn ) = 1−e−λ < ∞ (a
geometric series). Hence P(Bn , i.o.) = 0, by the first Borel–Cantelli Lemma. This is true even
if the Xn are not independent.

Exercise 3.4.5 It is sometimes asserted that if a monkey hit the keys of a type writer at random,
it would eventually produce, in one continuous stream, the complete works of William Shakespeare.
Prove it. 
Chapter 4

Integration and Expectation

4.1 The Integral: Definition and Basic Properties


R
The aim of this section is to define the integral f dµ of a measurable function f w.r.t. a measure
µ. Why do we want this? Because

Expectation = Integration
Z
EP [X] = X dP

Throughout this section let (Ω, F, µ) be a measure space, and let mF be the set of all measurable
functions
R from (Ω, F) to R̄. We will define a (partial) linear functional, also denoted by µ, or by
· dµ, from mF to R̄, i.e.
Z Z
µ = · dµ : mF → R f 7→ µ(f ) = f dµ

R
The quantity µ(f ) = f dµ need not exist for every measurable function f . If it does exist, we
say that f is integrable.
Below follows a wish list of properties that we would like the integral to possess. Bear in mind
that not all wishes come true. In particular, we shall have to be content with a weaker version wish
(IV.).

WISH LIST:
R
I. IA dµ = µ(A).
R R R
II. (Linearity) A αf + βg dµ = α f dµ + β A g dµ
R R
III. (Monotonicity) If f ≤ g then f dµ ≤ g dµ
R R
IV. (Continuity) Suppose that fn → f . Then fn dµ → f dµ.

55
56 The Integral: Definition and Basic Properties

From these wishes, we will be able to derive instructions for the definition of the integral.
Note that wish (I.) states that the integral µ is, in some sense, an extension of the measure
µ: Every measurable set can be identified with a measurable function (the set R A is identified with
the indicator function IA ), and we require that µ(A) = µ(IA ) The integral f dµ = µ(f ) can be
thought of as extending the measure µ from measurable sets to measurable functions.
The definition of the integral proceeds in three steps:

Step 1. Define the integral µ on the set sF + of non–negative simple functions.

Step 2. Extend the definition to the set mF + of all non–negative measurable functions.

Step 3. Finally extend the definition to the set mF of measurable functions.

If ϕ is a non–negative simple function, there is only one way to define the integral to be consistent
with wishes (I.) and (II.):

Definition 4.1.1 (The P Integral: Step 1)


If ϕ ∈ sF + , i.e. ϕ = nk=1 ak IAk is a simple non–negative function, we define
Z n
X
ϕ dµ := ak µ(Ak )
k=1

Some things need checking:


R
Proposition 4.1.2 (a) The definition of ϕ dµ does not depend on the representation
Pn Pl
of
P ϕ as a linear combination
P of indicators, i.e. if ϕ = k=1 a k I A k
= j=1 bj IBj , then
k ak µ(Ak ) = j bj µ(Bj ).
R
(b) IA dµ = µ(A), i.e. µ(IA ) = µ(A).

(c) (Linearity) If ϕ, ψ ∈ sF + and α, β > 0, then αϕ + βψ dµ = α ϕ dµ + β ψ dµ


R R R

(d) (Monotonicity) If ϕ ≤ ψ ∈ sS + , then ϕ dµ ≤ ψ dµ.


R R

Proof: We may assume that the ak are all distinct from each other, and that the bj are all distinct
from each other. Thus Ak , Bj P ∈ σ(ϕ), the σ–algebra generated by ϕ. A little thought shows that
there is a representation ϕ = m cm ICm of ϕ such that (Cm )m forms a partition of S, and such
that each Ak and Bj is a union of some Cm ’sm — just let the Cm ’s be the blocks of the partition
P the σ–algebra σ(ϕ). In particular, for each k, m, either Ak ∩ Cm = ∅, or Cm ⊆ Ak .
that generates
Also, cm = k {ak : Cm ⊆ Ak }. A similar statement holds for the Bj .
(a) We have
X X X XX
ak µ(Ak ) = ak µ(Ak ∩ Cm ) = ak µ(Ak ∩ Cm )
k k m m k
XX X
= {ak µ(Cm ) : Cm ⊆ Ak } = cm µ(Cm )
m k k
Integration and Expectation 57

(b) is obvious. P P
(c) Suppose
P that ϕ = k ak IAk , ψ = j bj IBj , where (Ak )k , (Bj )j are partitions of S. Then
ϕ + ψ = k,j (ak + bj )IAk ∩Bj and hence
X X X
µ(ϕ + ψ) = (ak + bj )µ(Ak ∩ Bj ) = ak µAk + bj µBj
k,j k j

(d) is obvious. a
R
R If f is a non–negative measurable function, then wish (III.) requires that we must have f dµ ≥
ϕ dµ whenever ϕ is simple, with f ≥ ϕ. By Proposition 2.4.3, there is a sequence hψn i of simple
non–negative functions such that ψn ↑ f .Then
Z Z
f dµ = lim ψn dµ (by wish (IV.)
n
Z
= sup ψn dµ (because if hxn i is increasing, then lim xn = sup xn )
n n n
Z 
≤ sup ϕ dµ : ϕ ∈ sF + , ϕ ≤ f
Z
≤ f dµ (by wish (III.))

The most parsimonious choice — and one that does not depend on the approximating sequence
hψn i — is therefore:

Definition 4.1.3 (The Integral: Step 2)


If f ∈ mF + is a non–negative measurable function, we define
Z nZ o
f dµ := sup ϕ dµ : ϕ ≤ f, ϕ ∈ sF +
R
Note that f dµ may be equal to +∞.
P
Exercise 4.1.4 (a) If ϕ = k ak IAk is non–negative
R simple, it is also non–negative measurable,
and thus we now have two definitions of ϕ dµ namely
Z X Z Z 
+
ϕ dµ = ak µ(Ak ) and ϕ dµ = sup ψ dµ : ψ ≤ ϕ, ψ ∈ sF
k

Show that these two values of µϕ coincide.


+
R for non–negative measurable functions, i.e. show that if f ≤ g ∈ mF , then
(b) RVerify (III.)
f dµ ≤ g dµ.


Proving that the integral defined in Step 2 is linear, i.e. that wish (II.) holds, is much more
difficult, and requires a version of (IV.) In fact, a weak version of (IV.) forms the foundation for
the whole edifice of integration theory:
58 The Integral: Definition and Basic Properties

Theorem 4.1.5 (Monotone Convergence Theorem, R MCT) R


Suppose that fn , f ∈ mF + such that fn ↑ f . Then fn dµ ↑ f dµ, i.e.
Z Z
↑ lim fn dµ =↑ lim fn dµ
n n
R R R
Proof: It is Reasy to see that h fn dµin is an increasing sequence, and that each
R fn dµR≤ f dµ,
P reals, i.e. may be +∞) and limn fn dµ ≤ f dµ.
so that limn fn dµ exists (in the extended
Let f ≥ ϕ ∈ sF + , and suppose ϕ = k ak IAk , where the Ak are disjoint, and each ak > 0. For
ε > 0, define X
ϕn = (1 − ε)ak IAk ∩{fn ≥(1−ε)ak }
k
Then ϕn is a non–negative simple measurable function with ϕn ≤ fn . Hence
Z Z X
fn dµ ≥ ϕn dµ = (1 − ε) ak µ(Ak ∩ {fn ≥ (1 − ε)ak })
k

Now note that


Ak ∩ {fn ≥ (1 − ε)ak } ↑ Ak
For if ω ∈ Ak , then ak = ϕ(ω) ≤ f (ω) = limn fn (ω), so that fn (ω) > (1 − ε)ak as soon as n is
sufficiently large, and thus ω ∈ Ak ∩ {fn ≥ (1 − ε)ak } if n is sufficiently large. By the continuity
properties of measures,

µ(Ak ∩ {fn ≥ (1 − ε)ak }) ↑ µ(Ak ) as n → ∞

which in turn yields


Z X Z
ϕn dµ ↑ (1 − ε) ak µ(Ak ) = (1 − ε) ϕ dµ as n → ∞
k
R R
Now fn dµ ≥ ϕn dµ for each n ∈ N, and thus
Z Z
lim fn dµ ≥ (1 − ε) ϕ dµ
n

This is true for any non–negative simple ϕ ≤ f and any ε > 0. Taking the supremum over those
ϕ, we see that
Z Z  Z
+
lim fn dµ ≥ (1 − ε) sup ϕ dµ : ϕ ≤ f, ϕ ∈ sF = (1 − ε) f dµ
n
R R
Letting ε → 0, we conclude that we also have limn fn dµ ≥ f dµ. a

In Propn. 4.1.2(b), Exercise 4.1.4(b) and Thm 4.1.5, we have seen that wishes (I.) and (III.),
and a weak version of wish (IV.) hold. We have also verified wish (II.) for non–negative simple
functions (cf. Proposition 4.1.2(c)). Now we can verify that wish (II.) holds for non–negative
measurable functions:
Integration and Expectation 59

Proposition 4.1.6 If f, g ∈ mF + and if α, β ≥ 0, then


Z Z Z
αf + βg dµ = α f dµ + β g dµ

Proof: By Proposition 2.4.3, we may choose sequences hϕn in , hψm im of non–negative simple func-
tions such that ϕn ↑ f , ψn ↑ g. Then each αϕn + βψn is non–negative simple, and (αϕn + βψn ) ↑
(αf + βg). Since wish (II.) holds for simple functions, by the Monotone Convergence Theorem, we
see that
Z Z Z Z Z Z
αf + βg dµ = lim αϕn + βψn dµ = α lim ϕn dµ + β lim ψn dµ = α f dµ + β g dµ
n n n

It remains to define the integral for arbitrary measurable functions. Recall that if f ∈ mF,
then
f = f+ − f− |f | = f + + f −

f +R := max{f, − + − +
Rwhere
+ −
R 0} and f := max{−f, 0}. Since f , f , |f | ∈ mF , the three integrals
f dµ, f dµ, |f | dµ have already been defined in Step 2.R Moreover,R by wish (II.)R for non–
negative measurable functions (i.e. Proposition 4.1.6), we have f + dµ + f − dµ = |f | dµ.
If we want wish (II.) to hold for arbitrary
R measurable functions, i.e. if we want to preserve
linearity, we have no choice but to define f dµ by

Z Z Z
f dµ = f + dµ − f − dµ

However, here we face a problem: If both f + dµ, f − dµ are equal to +∞, we have f dµ =
R R R

∞ − ∞, an indeterminate form.
We therefore demand that both integrals f + dµ, f − Rdµ are finite.
R R
+
R −Since both
R integrals are
non–negative, this is equivalent to demanding that the sum f dµ + f dµ = |f | dµ is finite.

Definition 4.1.7 (The Integral: Step 3) R


A function f ∈ mF is said to be µ–integrable if |f | dµ < ∞. The class of all µ–integrable
functions is denoted by L1 (S, F, µ).
If f ∈ L1 (Ω, F, µ), we define
Z Z Z
f dµ = f dµ − f − dµ
+
60 The Integral: Definition and Basic Properties

Notation:

(a) If f is an integrable function and A is a measurable set, we define


Z Z
f dµ := f IA dµ =: µ(f ; A)
A

to be the integral of f over the set A.

(b) If f (x)Z is an integrable function,


Z and we want to make the variable x explicit, we will
write f (x) µ(dx) instead of f dµ.

Rb
Remarks 4.1.8 Later, we will prove the following important fact: If the Riemann integral a f (x) dx
f
of a function R → R exists, then
Z b Z
f (x) dx = f (x) λ(dx)
a [a,b]

where λ is Lebesgue measure on (R, B(R)). If the Riemann integral of a function exists, then so
does the Lebesgue integral, and the two integrals coincide. This is obvious if f is a simple function,
as you can easily check, but the proof for general f is deferred to a later section. Note that the
Lebesgue integral may exist even when the Riemann integral does not. 

Obvious, but often useful, are the following facts:

Proposition 4.1.9 (a) If f is measurable, then f is integrable iff |f | is integrable.

(b) If f, g are measurable, g is integrable and |f | ≤ g, then f is integrable as well.

(c) If f is integrable, then µ{f = ±∞} = 0, i.e. f is finite µ–a.e.


R R
(d) If f is integrable, then | f dµ| ≤ |f | dµ.

Proof: (a) is obvious. (b) follows from the fact that µ|f | ≤ µg < ∞ (because we have (III.),
monotonicity, for non–negative measurable functions). R
(c) Let A = {s ∈ S : |f (s)| = ∞}. ThenR nIA ≤ |f | for all n ∈ N, and hence n µ(A) ≤ |f | dµ.
Letting n → ∞, we see that we must have |f | dµ R= +∞ if µ(A) > 0.
dµ − f − dµ| ≤ | f + dµ| +
R + R R
(d)
R − follows
R because by the triangle inequality | f dµ| = | f
| f dµ| = |f | dµ. a

Exercise 4.1.10 The decomposition f = f + − f − is but one of many ways that f can be decom-
posed as a difference of non–negative measurable functions. Show that if f = g − h is a difference of
non–negative functions, then µf = µg − µh. Thus the definition of the integral of f is independent
of the representation of f as a difference of non–negative measurable functions.
[Hint: Apply Proposition 4.1.6 to f + + h = g + f − .] 
Integration and Expectation 61

Looking at our wish list of properties, i.e. (I.)–(IV.), we see that (I.) holds
R + automatically. (III.)
+ ≤ g + and f − ≥ g − , so
R +
(monotonicity)
R − is easy: If f ≤ g, then f f dµ ≤ g dµ and
f dµ ≥ g − dµR (becauseR (III.) holds for non–negative measurable functions, cf. Exercise
R

4.1.4(b)), and hence f dµ ≤ g dµ.


We finish this subsection by dealing with (II.) (linearity), and leave (IV.) (continuity) to the
next section.

Theorem 4.1.11 If f, g ∈ L1 (Ω, F, µ), and α, β ∈ R, then


Z Z Z
αf + βg dµ = α f dµ + β g dµ

R R R R R
Proof: It suffices to prove that f + g dµ = f dµ + g dµ and that αf ) dµ = α ; dµ (for
f, g ∈ L1 and α ∈ R). Now
f + g = (f + + g + ) − (f − + g − )
is a representation of Rf + g as a difference of non–negative measurable functions. By Exercise
+ dµ − f − + g − dµ. Proposition 4.1.6 implies that
R + R
4.1.10,
R it follows
R that R f + g dµ = f + g
f + g dµ = f dµ + g dµ.
Similarly, an application of Proposition 4.1.6 and Exercise 4.1.10 to αf = αfR + − αf − (if α ≥ 0),
or αf = (−α)f − − (−α)f + (if α < 0) yields the conclusion that αf dµ = α f dµ.
R
a

Exercise 4.1.12 Show that L1 (Ω, F, µ) is a vector space. Also give an example to show that it
may not be closed under multiplication. 

4.2 Lebesgue’s Dominated Convergence Theorem


f g
Let Φ be a set of functions Ω → R, and let Ω → R+ . We say that Φ is dominated by g if and only
if |f | ≤ g for every f ∈ Φ.
It follows by Proposition 4.1.9 (b) that if Φ is dominated by g, and if g is integrable, then each
f ∈ Φ is integrable as well.
The following proposition serves as stepping stone in the proof of the Dominated Convergence
theorem, but is also very useful in other situations. See Section B.3 for the definitions and properties
of lim supn xn and lim inf n xn .

Proposition 4.2.1 (a) FATOU’S LEMMA: If fn ∈ mF + for n ∈ N, then


Z Z
lim inf fn dµ ≤ lim inf fn dµ
n n

(b) REVERSE FATOU LEMMA: Suppose that fn ∈ mF + for n ∈ N, and that there is
g ∈ L1 (Ω, F, µ) such that {fn : n ∈ N} is dominated by g. Then
Z Z
lim sup fn dµ ≤ lim sup fn dµ
n n
62 Dominated Convergence Theorem

Proof: (a) Let f = lim inf n fn , and define gn = inf m≥n fmR. Then gnR↑ f (by definition of
R lim inf),
and so Rthe Monotone Convergence Theorem implies R that g n dµ
R ↑ f dµ. Moreover, R gn dµ ≤
inf m≥n R fm dµ (by monotonicity, (III.)), and so f dµ = limn gn dµ ≤ limn inf m≥n fm dµ =
lim inf n fn dµ. a

Exercise 4.2.2 Prove the Reverse Fatou Lemma by applying Fatou’s Lemma to the sequence
g − fn . (Why do we require that g ∈ L1 ? Cancellation! For x + y = x + z does not imply that
y = z when x = ∞.) 

Remarks 4.2.3 Under suitable conditions, we see that we have


Z Z Z Z
lim inf fn dµ ≤ lim inf fn dµ ≤ lim sup fn dµ ≤ lim sup fn dµ
n n n n

This provides a useful mnemonic: The terms with the limits on the outside (of the integral) are on
the inside (of the string of inequalities).
The mnemonic Terms with limits on the inside are on the outside also works. 
Theorem 4.2.4 (Dominated Convergence Theorem, DCT)
Suppose that f1 , f2 , f3 , . . . is a sequence of measurable functions on (Ω, F, µ) such that

(i) limn fn (ω) exists for all ω ∈ Ω;

(ii) There is a g ∈ L1 (Ω, F, µ) such that |fn | ≤ g for all n ∈ N.

Then the function f := limn fn belongs to L1 (Ω, F, µ), and


Z Z Z Z
f dµ = lim fn dµ i.e. lim fn dµ = lim fn dµ
n n n

Proof: Since |fn | ≤ g, the functions g ± fn are non–negative measurable functions, and thus by
Fatou’s lemma, we see that
Z  Z  Z
g dµ + lim inf ± fn dµ = lim inf g ± fn dµ
n n
Z
≥ lim inf g ± fn dµ
n
Z
= g ± f dµ
Z Z
= g dµ ± f dµ
R R R R 
Subtracting
R g dµ < ∞ from R we see that lim inf n fn dµ ≥ f dµ and that lim inf n − fn dµ ≥
R both sides,
− f dµ, i.e. that lim supn fn dµ ≤ f dµ. Combining, we obtain
Z Z Z Z
f dµ ≤ lim inf fn dµ ≤ lim sup fn dµ ≤ f dµ
n n

a
Integration and Expectation 63

Exercise 4.2.5 (1.) Let fn = n1 I[0,n] for n ∈ N.

(a) Show that fn −→ 0 as n −→ +∞.


R
(b) Show that fn dλ = 1 for all n ∈ N.
(c) Why does this not contradict
(i) the Monotone Convergence Theorem?
(ii) Fatou’s Lemma?
(iii) the Lebesgue Dominated Convergence Theorem?

(2.) Let fn = nI(0, 1 ] for n ∈ N.


n

(a) Find the function lim fn .


n→+∞
R R
(b) Show that lim fn dλ 6= lim fn dλ.
n→+∞ n→+∞
(c) Why does this not contradict the Lebesgue Dominated Convergence Theorem?


4.3 Measure Zero


Suppose that (Ω, F, µ) is a measure space. It may be possible to extend the measure µ to a class
of sets larger than F, where the measure of the added new sets is determined by µ and F. For
example, suppose that

(i) F ∈ F is such that µ(F ) = 0;

(ii) A ⊆ F

then “clearly” µ(A) = 0 also. However, if A 6∈ F, then µ(A) is undefined! Yet, µA clearly “ought”
to be zero. By adding all those sets whose measure “ought” to be zero, we get a new σ–algebra
F µ , called the completion of F w.r.t µ.

Definition 4.3.1 Let (Ω, F, µ) be a measure space, and let A ⊆ Ω.

(a) We say that A is µ–null if there exists B ∈ F such that A ⊆ B and µ(B) = 0.
(It is not necessary that A ∈ F.)

(b) We denote the family of µ–null sets by N µ .

(c) The measure space (Ω, F, µ) is said to be complete iff every µ–null set is measurable,
i.e. belongs to F.

Exercise 4.3.2 Show that the family N µ of µ–null sets closed under countable unions, i.e. that if
Nn ∈ N , for n ∈ N, then also n Nn ∈ N µ .
µ
S

64 Measure Zero

Definition and Proposition 4.3.3 Let (Ω, F, µ) be a measure space. Let N µ be the
family of µ–null sets.

(a) The family of sets


F µ := {F ∪ N : F ∈ F, N ∈ N µ ]}
is a σ–algebra, called the completion of F w.r.t. µ. Moreover, F µ = σ(F ∪ N µ )

(b) We have

G ∈ F µ iff there are F1 , F2 ∈ F such that F1 ⊆ G ⊆ F2 and µ(F1 ) = µ(F2 )

(c) We can extend the measure µ to a measure µ̄ on the σ–algebra F µ in the obvious way:

If G = F ∪ N , where F ∈ F, N ∈ N µ , define µ̄(G) := µ(F )

(d) The space (Ω, F µ , µ̄) is a complete measure space.

Proof: (a) We first show that F µ is a σ–algebra. That F µ is closed under countable unions follows
straightforwardly from the fact that both F and N µ are closed under countable unions. To check
that F µ is closed under complementation, suppose that F ∪ N ∈ F µ , where F ∈ F, N ∈ N µ . We
must show that (F ∪ N )c ∈ F µ as well. Choose G ∈ F such that µ(G) = 0 and N ⊆ G. Then

(F ∪ N )c = (F ∪ G)c ∪ [G − (F ∪ N )]

Now (F ∪ G)c ∈ F, and G − (F ∪ N ) ∈ N µ (being a subset of G). Hence (F ∪ N )c ∈ F µ , proving


that F µ is a σ–algebra. Clearly F µ = σ(F ∪ N µ ).
(b) If F1 , F2 ∈ F are such that µ(F1 ) = µ(F2 ), and if F1 ⊆ G ⊆ F2 , then G = F1 ∪ (G − F1 ),
where (G − F1 ) ⊆ (F2 − F1 ), so that G − F1 ∈ N µ . It follows that G ∈ F µ .
Next, if G ∈ F µ , then (by definition of F µ ) there is F1 ∈ F, N ∈ N µ such that G = F1 ∪ N .
Also (by definition of N µ ) there is F ∈ F such that µ(F ) = 0 and N ⊆ F . If we now define
F2 := F1 ∪ F , we see that F1 , F2 ∈ F, with F1 ⊆ G ⊆ F2 , and µ(F1 ) = µ(F2 ).
(c) We need to verify two things: That the extension µ̄ of µ is well–defined on F µ , and that
it is a measure. To see that it is well–defined, suppose that G = F1 ∪ N1 = F2 ∪ N2 are two
representations of G, where F1 , F2 ∈ F, N1 , N2 ∈ N µ . We must show that µ(F1 ) = µ(F2 ). But
F1 = F1 ∩(F1 ∪N1 ) = F1 ∩(F2 ∪N2 ) = (F1 ∩F2 )∪(F1 ∩N2 ). It follows easily that µ(F1 ) = µ(F1 ∩F2 ).
Similarly µ(F2 ) = µ(F1 ∩ F2 ), and hence µ(F1 ) = µ(F1 ∩ F2 ) = µ(F2 ).
Next, we show that µ̄ is a measure on F µ : Suppose that Gn (n ∈ N) are mutually disjoint
members of F µ . Choose Fn ∈ F, Nn ∈ N µ such that Gn =SFn ∪Nn (for n ∈ N). Then the S Fn (n ∈ N)
are mutually disjoint, and hence µ̄(G n ) = µ(F n ). Now n G n = F ∪ N , where F := n Fn ∈ F
and N := n Nn ∈ N µ (because a countable
S
union of µ–null sets is µ–null). Then by definition of
µ
S S P P
the extension of µ on F , we have µ̄( n Gn ) = µ(F ) = µ( nS Fn ) = nP µ(Fn ) = n µ̄(Gn ), where
we used the fact that µ is a measure on F to deduce that µ( n Fn ) = n µ(Fn ).
(d) Suppose that N is a µ̄–null set for (Ω, F µ , µ̄). We must show that N ∈ F µ . By definition
of µ̄–null set there exists G ∈ F µ such that N ⊆ G and such that µ̄(G) = 0. By definition of µ̄(G),
Integration and Expectation 65

there exist F ∈ F such that G ⊆ F and µ(F ) = 0. Putting all this together, we obtain N ⊆ F and
µ(F ) = 0. Thus N ∈ N µ ⊆ F µ . a

Unless it is likely to cause confusion, we shall usually refre to the measure µ̄ by the name µ.

Definition 4.3.4 We shall say that a statement Φ holds µ–almost everywhere (or µ–
almost surely if µ is a probability measure), if the set {ω ∈ Ω : Φ(ω) is not true} where Φ
fails to hold to hold is a µ–null set.

First note that completing a measure space does not create any interesting new measurable
functions:
f
Proposition 4.3.5 Let (Ω, F µ , µ) be the completion of (Ω, F, µ). Then a function Ω → R̄
g
is F µ –measurable iff there is an F–measurable function Ω → R̄ such that f = g µ–a.e.
Proof: (⇒): First suppose that f = IA is an indicator function, where A ∈ F µ . Then there exist
F ∈ F and N ∈ N µ such that A = F ∪ N . Clearly {ω ∈ Ω : IA (ω) 6= IF (ω)} = N is µ–null, so
IA = IF µ–a.e., and IF is F–measurable.
It is now straightforward to see that the proposition holds for simple functions as well.
If f is an arbitrary F µ –measurable function, we may (by Proposition 2.4.3) choose a sequence fn
of simple F µ –measurable functions such that fn → f . Then choose simple F–measurable functions
gn such that fn = gn µ–a.e., for all n ∈ N. LetS g = lim supn gn . Then g is F–measurable and f = g
µ–a.e. (because {ω ∈ Ω : f (ω) 6= g(ω)} ⊆ n {ω ∈ Ω : fn (ω) 6= gn (ω)}, a countable union of null
sets, and hence µ–null).
(⇐): Suppose that f = g µ–a.e. for some F–measurable functiong. To show that f is F µ –
measurable, we must show that f −1 [B] ∈ F µ for any Borel set B. Certainly g −1 [B] ∈ F, because
g is F–measurable. Let N ∈ F be such that {f 6= g} ⊆ N and µ(N ) = 0. Then a little thought
will show that
f −1 [B] = (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ) (?)
| {z } | {z }
∈F ∈N µ

To see that f −1 [B] ⊆ (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ): If ω ∈ f −1 [B], then either


(i) ω ∈ N , in which case ω ∈ f −1 [B] ∩ N , or else
(ii) ω 6∈ N , in which case f (ω) = g(ω), so ω ∈ g −1 [B] − N .
In both cases, it follows that f −1 [B] ⊆ (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ).
To see that f −1 [B] ⊇ (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ): If ω ∈ (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ), then either
(i) ω ∈ f −1 [B] ∩ N , or else
(ii) ω ∈ g −1 [B] − N , in which case f (ω) = g(ω), so ω ∈ f −1 [B].
In both cases, it follows that (g −1 [B] − N ) ∪ (f −1 [B] ∩ N ) ⊆ f −1 [B].

It now follows by (?) that f −1 [B] ∈ F µ . a

The proof of Proposition 4.3.5 suggests an alternative characterization of the completion F µ :


66 Measure Zero

Exercise 4.3.6 Show that if (Ω, F, µ) has completion (Ω, F µ , µ), then

F µ = {A ⊆ S : A∆F is a µ–null set, for some F ∈ F}

[Hint: Let N µ be the family of null sets, and let G := {A ⊆ S : A∆F ∈ N µ for some F ∈ F}. First
show that F, N µ ⊆ G, and conclude that F µ ⊆ G. Next, note that if A ∈ G for some F ∈ F, then
there is N ∈ F such that A∆F ⊆ N . Show that A = (F − N ) ∪ (A ∩ N ), where F − N ∈ F and
A ∩ N ∈ N µ , and conclude that A ∈ F µ .] 

Remarks 4.3.7 If f is F–measurable and if f = g µ–a.e., we cannot generally conclude that g is


also F–measurable. That conclusion is valid, however, if F is complete w.r.t. µ, i.e. if F = F µ . 

Next, we shall show that two functions which are equal µ–a.e. have the same integrals. To do
so, the following lemma will be useful:
R
Lemma 4.3.8 On (Ω, F, µ), if h ≥ 0 is measurable, then h dµ = 0 iff h = 0 µ–a.e.

Proof: It is easy to see that the statement is true if h is a simple non–negative measurable function.
+
R
For general h ∈ mF , choose simple R hn suchR that 0 ≤ hn ↑ h. If h dµ = 0, then by the
Monotone Convergence Theorem, 0 ≤ hn dµ ≤ h dµ = 0, so that, by the above, hn = 0 µ–a.e.
(since the result holds for simple functions, and hn is simple). ThusRalso h = limn Rhn = 0 µ–a.e.
Conversely, if h = 0 µ–a.e., then also hn = 0 µ–a.e., and hence h dµ = limn hn dµ = 0, by
the MCT. a

Theorem 4.3.9 If f, g are measurable functions on (Ω, F, µ) such that f = g µ–a.e., and
if f is integrable, then g is integrable, and µf = µg.
R R R
Proof: We have 0 ≤ | f dµ − g dµ| ≤ |f − g| dµ, R by Proposition
R 4.1.9. But f = g µ–a.e. iff
|f − g| = 0 µ–a.e., so Lemma 4.3.8 shows that 0 ≤ | f dµ − g dµ| ≤ 0. a

We can use this to improve the convergence theorems. For example:

Theorem 4.3.10 (Dominated Convergence Theorem)


Suppose that f1 , f2 , f3 , . . . is a sequence of measurable functions on a complete measure
space (Ω, F, µ) such that

(i) limn fn (ω) exists for µ–a.e. ω ∈ S;

(ii) There is a g ∈ L1 (Ω, F, µ) such that |fn | ≤ g µ–a.e. for all n ∈ N.

Define f by f (ω) = limn fn (ω) if this limit exists, and let f (ω) be arbitrary otherwise.
Then f ∈ L1 (Ω, F, µ), and Z Z
f dµ = lim fn dµ
n
Integration and Expectation 67

Proof: Let
N = {ω ∈ Ω : lim fn (ω) does not exist} ∪ {ω ∈ Ω : |fn (ω)| > g(ω)}
n

Then N is a null set, and thus in F (because the measure space is assumed complete). Define
f¯n = fn IN c ḡ = gIN c f¯ = f IN c
These
R functions
R are also F–measurable, and ḡ = g µ–a.e, f¯ = f µ–a.e., and f¯n = fn µ–a.e. Since
ḡ dµ = g dµ < ∞ (by Theorem 4.3.9), we see that ḡ is integrable. Also, we have
lim f¯n (ω) = f¯(ω) |f¯n (ω)| ≤ ḡ(ω)
for all ω ∈ Ω
n

that f¯ is integrable, and that f¯ dµ = limn f¯n dµ. But


R R
By
R Theorem
R 4.2.4, we
R can conclude
f dµ = f¯ dµ and fn dµ = f¯n dµ, by Theorem 4.3.9.
R
a

4.4 Riemann Integral vs. Lebesgue Integral


Rb
We recall briefly how the Riemann integral a f (t) dt is defined: Let f be a real–valued function
defined and bounded on an interval [a, b]. A partition P of [a, b] is a a finite ordered set {a = t0 <
t1 < t2 < · · · < tn = b}. The size or mesh of such a partition is denoted ||P ||, and defined by
||P || := max(tk − tk−1 )
k

A tagged partition is a partition P together with a choice t∗k ∈ [tk−1 , tk ] for each k = 1, . . . , n.
Tagged partitions will be indicated by a ∗, i.e. if P is a partition, then P ∗ denotes an associated
tagged partition.
With each tagged partition, we can associate a Riemann sum
n
X n
X
S(P ∗ , f ) := f (t∗k ) (tk − tk−1 ) = f (t∗k ) ∆k t
k=1 k=1
Rb
The Riemann integral a f dt should be the limit of the Riemann sums, over all tagged partitions
P ∗ , as ||P || → 0. To be precise, we say
lim S(P ∗ , f ) = L exists
σ(P )→0

if and only if for every ε > 0 there is δ > 0 such that


|S(P ∗ , f ) − L| < ε whenever ||P || < δ
Then we define Z b
f dt := lim S(P ∗ , f )
a ||P ||→0

provided this limit exists, and say that f is Riemann integrable on [a, b].
With each partition {a = t0 < t1 < · · · < tn = b} it is possible to associate three natural tagged
partitions, namely those having tags equal to the left endpoint, right endpoint and midpoint of
each interval. This yields:
68 Riemann Integral vs. Lebesgue Integral

P
• The lefthand Riemann sum k f (tk−1 )∆k t;
P
• The righthand Riemann sum k f (tk )∆k t;
P tk−1 +tk
• The symmetric Riemann sum k f( 2 )∆k t.

where ∆k t := tk −tk−1 . If f is Riemann integrable over [a, b], then each of these sums must converge
as k|P || → 0, and all to the same limit.

Remarks 4.4.1 A slightly different definition uses Darboux sums rather than Riemann sums.
Given a real–valued functions f defined andbounded on an interval [a, b], and a partition P = {a =
t0 < t1 < · · · < tn = b}, let the upper and lower Darboux sums be defined by
n
X
U (P, f ) := sup{f (t) : t ∈ [tk−1 , tk ]} · (tk − tk−1 )
k=1
Xn
L(P, f ) := inf{f (t) : t ∈ [tk−1 , tk ]} · (tk − tk−1 )
k=1

It is easy to see that, for any set of tags for P , we have

L(P, f ) ≤ S(P ∗ , f ) ≤ U (P, f )

i.e. the Darboux sums give the most extreme values of the Riemann sums for any given partition.
Furthermore, given any ε > 0, it is possible to find tags P ∗ , P 0 for the same partition P such that
|S(P ∗ , f ) − L(P, f )|, |U (P, f ) − S(P 0 , f )| ≤ ε: To see this, observe that if we choose t∗k ∈ [tk−1 , tk ]
so that f (t∗k ) − inf ε
f (t) < n||P || , then
tk−1 ≤t≤tk

n  n
X  X ε
0 ≤ S(P ∗ , f ) − L(P, f ) = f (t∗k ) − inf f (t) ∆k t < · ||P || = ε
tk−1 ≤t≤tk n||P ||
k=1 k=1

A similar argument works for U (P, f ) − S(P 0 , f ). It follows easily that

f
A bounded function [a, b] → R is Riemann integrable if and only if the limits lim L(P, f )
||P ||→0
and lim U (P, f ) exist and are equal.
||P ||→0

But observe that Riemann sums may be defined even when f is Banach space–valued, however,
whereas Darboux sums, being dependent on sup’s and inf’s, make sense for real–valued functions
only. 
Rb
From calculus, we know that the Riemann integral a f dt exists when f is continuous (or even
piecewise continuous) on [a, b] — cf. also Thm. 4.4.4. When the function is too discontinuous, we
run into trouble, however:
Integration and Expectation 69

Example 4.4.2 Consider the Dirichlet function


(
1 if t ∈ Q
IQ (t) :=
0 else

where Q is the set of rational numbers. If P = {a = t0 < t1 < · · · < tn = b} is any partition of
[a, b], no matter how fine, we can always find tags t∗k , t0k ∈ [tk−1 , tk ] so that t∗k is rational, and t0k is
irrational. Thus IQ (t∗k ) = 1, IQ (t0k ) = 0. It follows that
X X
S(P ∗ , f ) = 1 · (tk − tk−1 ) = 1 − 0 = 1 S(P 0 , f ) = 0 · (tk − tk−1 ) = 0
k k

and thus S(P ∗ , f ), S(P 0 , f ) cannot be made to lie arbitrarily close to each other, no matter how
fine the partition P . Thus lim S(P, f ) does not exist. 
||P ||→0

When the Riemann integral is firstR b encountered in calculus, it is taught as “the area under a
curve”: If f ≥ 0 is continuous, then a f dt is the area under the curve described by f , between
t = a and t = b. For A ⊆ R, define the indicator function of A by
(
1 if t ∈ A
IA (t) :=
0 else

Consider now the IQ , where Q is the set of rational numbers. This function is very discontinuous.
If we try to compute this “curve” R 1 over the interval [0, 1] using the Riemann integral, we run into
trouble: The Riemann integral 0 IQ dt does not exist.
We can make a convincing argument that the area under the curve over the interval [0, 1] should
be zero, as follows: Use the fact that Q is countable to enumerate the rational numbers in [0, 1],
i.e. write [0, 1] ∩ Q = {qn : n ∈ N}. For any ε > 0, define
ε ε
Bn := [qn − ,q
2n+1 n
+ 2n+1
] for n ∈ N, f = ISn Bn

The area under the curve of f is made up of (possibly overlapping)P rectangles of heightP 1 centered
at the rational numbers. Thus the area under f over [0, 1] is ≤ n 1 · (length of Bn ) = n 2εn = ε.
It is also clear that 0 ≤ IQ ≤ f , and thus that the area under IQ is less than the are under f , i.e.
that the area under IQ is ≤ ε. Since this is true for any ε > 0, we conclude that the area under IQ
is 0.
Thus we have the following:
Z 1
The Riemann integral IQ dt is undefined, but it should be zero
0

The Riemann integral is simply not powerful enough to handle functions like IQ .
You may counter that a function such as IQ is pathological, and unlikely to be encountered in
practice. It is true that we chose it here simply to make a point. However, the following example
should cause you to feel uneasy about the assertion that IQ is “pathological”:
70 Riemann Integral vs. Lebesgue Integral

Example 4.4.3 Consider the function g(t) defined by

g(t) = lim lim cos(m!πt)2n


m→∞ n→∞

If t is a rational number, i.e. t = pq where p ∈ Z, q ∈ N, then m! t ∈ Z for all m ≥ q. Consequently,


cos(m!πt) = ±1 for all m ≥ q, and thus cos(m!πt)2n = 1 for all n and all m ≥ q. It follows that
g(t) = 1 when t is rational.
On the other hand, if t is irrational, then 0 ≤ | cos(m!πt)| < 1 for all m, and so 0 ≤ cos(m!πt)2n <
1 for all m. Now if 0 < x < 1, then xn → 0. It follows that limn→∞ cos(m!πt)2n = 0 for all m, and
thus that g(t) = 0 when t is irrational. Hence

IQ = lim lim cos(m!πt)2n


m→∞ n→∞

The “pathological” function IQ therefore appears as a limit of perfectly ordinary functions. 

Unlike the Lebesgue integral, the Riemann integral does not handle limits well. We now show
that when the Riemann integral exists, then the Lebesgue integral (w.r.t. Lebesague measure λ)
does too, and the integrals’ values coincide:

Theorem 4.4.4 Let f be a bounded real–valued function on the compact interval [a, b].
Then

(a) f is Riemann integrable iff f is continuous λ–a.e.

(b) If f is Riemann integrable, then f is Lebesgue integrable, and the integrals are equal:
Rb R
a f dt = [a,b] f dλ.

Proof: Assume that f is Riemann integrable. Then we can choose a sequence Pn of successively
finer partitions of [a, b] such that U (Pn , f )−L(Pn , f ) < n1 — see Remarks 4.4.1 for the definitions of
U (Pn , f ) and L(Pn , f ). Define functions gn , hn on [a, b] as follows: For each n, gn (a) = hn (a) = f (a).
If Pn = {a = tn0 < t1 < tn2 < · · · < tnmn = b}, then gn , hn are step functions, with steps determined
by Pn , defined as follows: If t ∈ [a, b], then t ∈ (tnk−1 , tnk ] for some k, and we define

gn (t) = inf{f (x) : tnk−1 < x ≤ tnk } hn (t) = sup{f (x) : tnk−1 < x ≤ tnk }

Then gn , hn are clearly simple Borel functions, designed so that


Z Z
gn dλ = L(Pn , f ) hn dλ = U (Pn , f )
[a,b] [a,b]

Moreover hgn in is a bounded increasing sequence, with gn ≤ f , and hhn in is a bounded decreasing
sequence, with hn ≥ f . Define g = limn gn , hR = limn hn . Then g, h are Borel
R b functions,R and by the
Dominated Convergence Theorem we have [a,b] g dλ = limn L(f, Pn ) = a f dt and [a,b] h dλ =
Rb R
limn U (f, Pn ) = a f dt. Hence a,b] h − g dλ = 0.
Now since h ≥ g, Lemma 4.3.8 implies that h = g λ–a.e. on [a, b]. Since g ≤ f ≤ h, we must
R R Rb
have g = f = h λ–a.e., and thus f dλ = g dλ = limn L(f, Pn ) = a f dt. This proves (b).
Integration and Expectation 71

S
Next, note that if t 6∈ n Pn , and if h(t) = g(t), then f is necessarily continuous at t: For then
g(t) = f (t) = h(t), i.e.
lim inf{f (x) : tnkn −1 < x ≤ tnkn } = f (t) = sup inf{f (x) : tnkn −1 < x ≤ tnkn }
n n

(where kn is such that t ∈ (tnkn −1 , tnkn ])


and thus all values
S of f (x) must lie close to f (t) if x is close
to t. Hence any discontinuity of f must belong to n Pn ∪ {t : g(t) 6= h(t)}, a set of λ–measure
zero. This shows that if f is Riemann integrable, the f is continuous λ–a.e., proving one direction
of (a).
Conversely, suppose that f is continuous λ–a.e. Let Pn be a partition of [a, b] that divides it
into 2n subintervals of equal length, and construct simple Borel functions gn , hn as above. If f is
continuous at t, then obviously Rlimn gn (t) = f (t) = limn hn (t). Hence limn (hn − gn ) = 0 λ–a.e. By
the DCT, we see that 0 = limn [a,b] hn − gn dλ = limn (U (f, Pn ) − L(f, Pn )), from which Riemann
integrability easily follows. a
d b b ∂
R R
Remarks 4.4.5 An oft–used fact in calculus is that dt a f (x, t) dx = a ∂t f (x, t) dx, provided
that ∂f ∂t is bounded — differentiation under the integral sign. Differentiation involves the taking of
limits. This canRbe justified via the DCT.
b Rb
Let G(t) := a f (x, t) µ(dx). We want to show that G0 (t0 ) = a ∂f ∂t (x, t0 ) µ(dx), under certain
commonly satisfied conditions: Suppose that there exists a µ–integrable function M (x) such that
| ∂f
∂t (x, t)| ≤ M (x) for all x, and all t ∈ (t0 − δ, t0 + δ) (where δ > 0). Let hhn in be a sequence of
non–zero reals such that hn → 0, and such that each |hn | < δ. Then
G(t0 + hn ) − G(t0 ) h Z b f (x, t + h ) − f (x, t ) i Z b
0 0 n 0
G (t0 ) = lim = lim µ(dx) = lim gn (x) µ(dx)
n hn n a hn n a

where gn (x) := f (x,t0 +hhnn)−f (x,t0 ) . Note that gn (x) → ∂f


∂t (x, t0 ). We claim that the sequence gn is
dominated by M . Indeed, by the Mean Value Theorem, there is, for each x and each n ∈ N, a
tnx ∈ (t0 − |hn |, t0 + |hn |) ⊆ (t0R − δ, t0 + δ) suchR that gn (x) = ∂f n
∂t (x, tx ), and thus |gn (x)| ≤ M (x).
Since M is µ–integrable, limn gn (x) µ(dx) = limn gn (x) µ(dx), and we are done. 

4.5 Chain Rule, Change of Variables


In Section 2.5 we saw how to obtain a new measure ν := µf −1 from a measure µ and a measurable
function f . We will return to this shortly, and discuss integrals with respect to this new measure.
But first: Here is another way of obtaining new measures from old:
f
Definition and Proposition 4.5.1 Suppose that (Ω, F, µ) → (R̄, B(R̄)) is a non–
negative measurable function. Define a set mappingν : F → R̄ by
Z
ν(A) := f dµ = µ(f IA )
A

Then ν is a measure on (Ω, F).


f is called the µ–density of ν, or the Radon–Nikodým derivative of ν with respect to µ.

We write f = dµ .
72 Chain Rule, Change of Variables

S∞
Proof: We need only check that ν is countably additive. P Suppose that A = k=1 Ak is a union R of a
family
R of mutually disjoint membersR of F. Put f n = Rk≤n f IA .
k P
Then f
Rn ↑ f IA , and so
P f n dµ ↑
A f dµ, by the MCT, i.e. lim n fn dµ = ν(A). But fn dµ = k≤nR f IAk dµ =
P∞ k≤n ν(A k ),
and thus fn dµ ↑ P∞
R P
k=1 ν(A k ) as n → ∞. We conclude that also lim n fn dµ = k=1 ν(A k ), and
hence that ν(A) = ∞ k=1 ν(Ak ) . a

The following proposition explains the notation dµ :

Proposition 4.5.2 (Chain Rule)


f g dν
Suppose that Ω → R̄+ and Ω → R̄ are measurable on (Ω, F, µ), and define ν by dµ = f.
Z Z
gf dµ = g dν (?)

i.e. Z Z Z

gf dµ = g dµ dµ = g dν

This means that whenever one of side of (?) exists, the so does the other, and the two
sides are equal.
R R
Proof:PIf g = IA is an indicator function,R then
P I A f dµ
R = ν(A) =P IA dν, by definition
R of ν.
If g = k≤n αk IAk is simple, then gf dµ = k≤n αk IAk f dµ = k≤n αk ν(Ak ) = g dν, by
linearity of the integral. So the result holds for simple g.
If g is a non–negative measurable function, we may choose simple gn ↑ g, by Proposition 2.4.3.
Since f ≥ 0, we R have also gn f R↑ gf . Then by the
R MCT and R the fact that the result holds for simple
gn , we obtain gf dµ = limn gn f dµ = limn gn dν = gR dν. R R
Finally, if g is an arbitrary measurable function, then |gf | dµ = |g|f dµ = |g| dν, since
f, |g| are non–negative. Hence gf is µ integrable if and only if gR is ν–integrable.
R + (by Proposition
R −
4.1.9). Now split g into its positive and negative parts to see that gf dµ = g f dµ− g f dµ =
g dν − g − dν = g dν.
R + R R
a

Remarks 4.5.3 The above proof illustrates a useful technique, which David Williams1 calls the
standard machine. To prove something holds for all integrals of a certain type:

• First show that it holds for indicator functions;

• Use linearity to show that it holds for simple non–negative functions;

• Then use the MCT to lift the result to non–negative measurable functions;

• And finally split an arbitrary measurable f into its positive and negative parts, and use
linearity once again.


1
cf. his excellent (and short) book Probability with Martingales.
Integration and Expectation 73

f
We now come back to the measure µf −1 defined in Section 2.5: Recall that if (Ω, F, µ) → (T, T )
is measurable, then the map
µf −1 : T → R̄ : B →7 µf −1 [B]
g
defines a measure on (T, T ) — see proposition 2.5.1. Also if (T, T ) → (R̄, B(R̄)) is measurable,
g◦f
then so is (Ω, F) → (R̄, B(R̄)), by Proposition
R 2.3.5.
The next result shows that the integrals g ◦ f dµ and g d(µf −1 ) are equal:
R

Proposition 4.5.4 (Change of Variables)


Given a measure space (Ω, F, µ), a measurable space (T, T ) and two measurable maps
f : Ω → T and g : T → R̄, then
Z Z
g ◦ f dµ = g d(µf −1 )

i.e. whenever one side of this equation exists, then so does the other, and the two sides
are equal.
Exercise 4.5.5 Prove Proposition 4.5.4.
[Hint: Use the standard machine for indicators, then simple – and then R non-negative
R measurable
g. For arbitrary measurable g, to check integrability, observe that |g ◦ f | dµ = |g| ◦ f dµ =
−1
R
|g| d(µf ), because |g| is non–negative. Now split g into its positive and negative parts, ] 

4.6 Definition of Expectation


We begin with two examples that lead up to the definition of the expected value of a random
variable.
Example 4.6.1 Let (Ω, F, P) be a probability space, and let X be a discrete random variable, i.e.
a random variable with at most countably many values. In that case X can be written

X
X= xk IAk
k=1

where the xkR are the values that X can take, and Ak = {ω ∈ Ω : X(ω) = xk }. We are interested in
the value of X dP, assuming that it exists. Using the Lebesgue Dominated Convergence Theorem,
it is easy to see that
Z X∞ ∞
X
X dP = xk P(Ak ) = xk P(X = xk )
k=1 k=1
n
P ∞
P
(because Xn = xk IAk is dominated by |X| = |xk |IAk , assuming that the Ak are mutually
k=1 k=1
disjoint.) But this sum is just the definition of the expected value of a discrete random variable,
i.e. Z
E[X] = X dP

if X is discrete random variable. 


74 Definition of Expectation

Example 4.6.2 Let (Ω, F, P) be a probability space, and let X be a continuous random variable,
i.e. a random variable that has a probability density function fX such that
Z x Z
P(X ≤ x) = fX (t) dt = fX dλ
−∞ (−∞,x)

(This is the definition of a continuous random variable.) Now let PX := P ◦ X −1 be the law of X.
Recall that PX is a probability measure on (R, B(R)) with the property that

PX (B) = P(X ∈ B) B ∈ B(R)

If we define νX on (R, B(R)) so that


Z
dνX
= fX i.e. νX (B) = fX dλ
dλ B

Then we know that νX is a probability measure. Moreover, since


Z a
νX (−∞, a] = fX (x) λ(dx) = P(X ≤ a) = PX (−∞, a] all a ∈ R
−∞

it follows that νX = PX . This is because µX , νX agree on the π–system of intervals of the form
(−∞, a] which generates B(R) — see Proposition 1.6.4.
In particular:

dPX
fX =

i.e. the density of a continuous random variable X is precisely the Radon–Nikodým
derivative of the law of X w.r.t. Lebesgue measure!
R
Now let g : R → R be a Borel function and consider the integral g(X) dP, assuming that it
exists. We use the Change of Variable Formula (Proposition 4.5.4) to obtain
Z Z
g(X) dP = g ◦ X(ω) P(dω)
Z
= g(x) P ◦ X −1 (dx)
Z
= g(x) PX (dx)

However, by the Chain Rule (Proposition 4.5.2)


Z Z Z
dPX
g(x) dPX = g(x) dλ = g(x)fX (x) λ(dx)

and hence Z Z
g(X) dP = g(x)fX (x) λ(dx)
Integration and Expectation 75

But the integral on the right is just the definition of the expected value of a continuous random
variable, i.e. Z
E[g(X)] = g(X) dP
R
In particular, with g(x) := x, we have E[X] = X dP. 

We now wipe the slate clean and redefine the expectation of a random variable as follows:

Definition 4.6.3 (Expectation)


Let (Ω, F, P) be a probability space. If X ≥ 0 or X ∈ L1 (Ω, F, P) then the expected
value of X is defined to be Z
EX := X dP

The two examples show that this definition will give the same results as the two earlier definitions
of expectation for discrete and continuous random variables respectively. Moreover, we now also
have a definition of the expected value of a random variable which is neither discrete nor continuous.
Some notation: If X is a random variable on a probability space (Ω, F, P) and F ∈ F, we define
the integral E[X; F ] of X over F by
Z Z
E[X; F ] := X dP := XIF dP
F

Now that we’ve defined the expectation of a random variable as its integral with respect to the
probability measure, several facts about expectation are immediately obvious:

Proposition 4.6.4 Let (Ω, F, P) be a probability space.

(a) Ec = c for any constant random variable c.

(b) E[aX +bY ] = aE[X]+bE[Y ] for any random variables X, Y (whose expectations exist)
and any a, b ∈ R.

(c) If X ≥ Y are random variables, then E[X] ≥ EY .

(d) If X, Y are random variables and X = Y almost surely, then E[X] = E[Y ] (where one
expectation exists if and only if the other exists).

Remark 4.6.5 We’ll stress once more the point that our definition of expectation as a Lebesgue
integral is superior to the definitions in elementary courses. Suppose that you want to prove the
following general theorem:
E[X + Y ] = E[X] + E[Y ]
If X, Y are discrete, this is easy. If X, Y have a joint density function, this is a little harder, but
doable. Suppose, however, that X, Y do not have a joint density function nor a joint probability
mass function. It is easy to construct such pairs. For example, suppose X is a standard normal
random variable, and that Y is a Bernoulli variable with values 1 and −1. In that case, the pair
76 Inequalities

(X, Y ) can assume uncountably many values, so there is no joint probability mass function. On
the other hand, were the joint density f to exist, we would have f (x, y) 6= 0 only for pairs (x, y)
R a set whose total area is zero, namely the union of two horizontal lines y = ±1. It follows that
in
R2 f (x, y) dx dy = 0, so f cannot be a density.
Of course, the fact that E[X + Y ] = E[X] + E[Y ] follows directly from the linearity of the
integral. 

Limit theorems about integration translate directly into limit theorems about expectation.

Theorem 4.6.6 (Convergence)


Suppose that X is a random variable and that Xn is a sequence of random variables on a
probability space (Ω, F, P) with the property that

Xn → X a.s.

Then

(a) If Xn ≥ 0 and Xn ↑ X, then E[Xn ] ↑ E[X].

(b) If Xn ≥ 0, then E[X] ≤ lim inf E[Xn ].

(c) If there is a non-negative random variable Y with finite expectation such that |Xn | ≤ Y
for all n, then E[|Xn − X|] → 0, so that E[Xn ] → E[X].

Proof: (a) is the Monotone Convergence Theorem, (b) is Fatou’s Lemma, and (c) is the Lebesgue
Dominated Convergence Theorem. a

4.7 Inequalities
We end this chapter by proving some important inequalities. For a random variable X, we define
its variance Var(X) by
Var(X) := E[(X − E[X])2 ]
i.e. Var(X) is an estimate of how far we expect an outcome X(ω) to be away from the mean E[X].

Proposition 4.7.1 Let (Ω, F, P) be a probability space.

• (Markov’s Inequality) If X is a random variable, and if g is a function which is


increasing and non–negative, and whose domain includes range of X (so that the
composition g(X) is defined), then

Eg(X) ≥ g(c)P(X ≥ c)

• (Chebyshev’s Inequality) For ε > 0, we have

P(|X − E[X]| ≥ ε) ≤ ε−2 Var(X)


Integration and Expectation 77

Proof: Let F = {ω ∈ Ω : X(ω) ≥ c} = {X ≥ c}. Markov’s inequality follows directly from the
fact that Z Z Z
g(X) dP ≥ g(X) dP ≥ g(c) dP
F F
Chebyshev’s Inequality is a direct consequence of Markov’s — cf. next exercice. a

Exercise 4.7.2 (a) Show that if X ≥ 0, then P(X ≥ c) ≤ 1c EX.

(b) Prove Chebyshev’s Inequality.




Next, we tackle Jensen’s Inequality, whose importance is hard to overstate. This result states
g
that if R → R is a convex function (roughly, a concave–up function), then E[g(X)] ≥ g(E[X]). We
begin with a definition:

Definition 4.7.3 Let U be an open subinterval of R. A function g : U → R is said to be


convex if and only if for any x, y ∈ U and any λ ∈ [0, 1] we have

g(λx + (1 − λ)y) ≤ λg(x) + (1 − λ)g(y)

Remarks 4.7.4 The following remarks feature in the proof of Jensen’s inequality, the next propo-
sition, and should be digested thoroughly.

(a) Recall that if ~a, ~b are points in Rn , then {λ~a +(1−λ)~b : 0 ≤ λ ≤ 1} is simply the line segment in
Rn joining ~b to ~a. The point with coordinates (λx+(1−λ)y, g(λx+(1−λ)y)) is simply a point on
the graph of g between x and y. On the other hand, the point λx+(1−λ)y, λg(x)+(1−λ)g(y))
is a point on the line segment joining (y, g(y)) to (x, g(x)). These two points have the same
x–coordinate, namely λx + (1 − λ)y. We can now interpret convexity geometrically: A function
g is convex if and only if its graph lies below any chord joining two points on the graph of g.

(b) This means that g is concave up (i.e. g 00 ≥ 0 if g 00 exists).

(c) Let g : U → R, where U is an open subinterval of R. For u, v ∈ U , define ∆(u, v) = g(u)−g(v)


u−v .
Geometrically, ∆(u, v) is the slope of the chord joing u to v on the graph of g. Then g is convex
if and only if u < v < w in U implies ∆(u, v) ≤ ∆(v, w). This is easy to see geometrically. A
v−w
more rigorous proof: If u < v < w, define λ = u−w . Then v = λu + (1 − λ)w. It follows that

v−w u−v
g(v) ≤ λg(u) + (1 − λ)g(w) = g(u) + g(w)
u−w u−w
Hence
(u − v)g(v) + (v − w)g(v) ≥ (v − w)g(u) + (u − v)g(w)
Rearranging yields the result.

(d) Now if v < w in U and we let u ↑ v, then

(i) ∆(u, v) increases as u ↑ v.


78 Inequalities

(ii) ∆(u, v) ≤ ∆(v, w), and thus ∆(u, v) is bounded from above as u ↑ v.

Since a sequence which is increasing and bounded from above must converge, D− (v) = lim ∆(u, v)
u↑v
exists. Similar reasoning shows that D+ (v) = lim ∆(v, w) must exist for every v ∈ U . Thus
w↓v
left– and right derivatives exist at every point v. Moreover, D− (v) ≤ D+ (v), because each
∆(u, v) is ≤ each ∆(v, w). If these limits are equal, then g is differentiable at v.

(e) A convex function is automatically continuous, and thus Borel measurable: For let v ∈ U . If
there is a discontinuity at v, then it is easy to see that either lim ∆(u, v) or lim ∆(v, w) does
u↑v w↓v
not exist.

Proposition 4.7.5 (Jensen’s inequality)


Suppose that g : U → R is a convex function on an open interval U ⊆ R, and that X is a
random variable with values in U (almost surely) such that both X and g(X) have finite
expected values. Then
E[g(X)] ≥ g (E[X])

Proof: We use notation and results from Remarks 4.7.4. Let v ∈ U , and let D− (v) = lim ∆(u, v)
u↑v
and D+ (v) = lim ∆(v, w). Then D− (v), D+ (v) both exist, and D− (v) ≤ D+ (v). Now suppose that
w↓v
m is a real number satisfying D− (v) ≤ m ≤ D+ (v), and that x ∈ U . We consider two cases: If (i)
x ≤ v, then ∆(x, v) ≤ D− (v) (since ∆(u, v) increases as u ↑ v) and thus ∆(x, v) ≤ m. It follows
that g(x) ≥ m(x − v) + g(v). Next, if (ii) x ≥ v, then ∆(v, x) ≥ D+ (v) (because ∆(v, w) decreases
as w ↓ v) and thus ∆(v, x) ≥ m. It follows that g(x) ≥ m(x − v) + g(v). Hence, in either case, we
have
g(x) ≥ m(x − v) + g(v)for all v ∈ U , x ∈ U , and D− (v) ≤ m ≤ D+ (v)
Geometrically, this means that the graph of g lies above the graph of the line m(x − v) + g(v). Note
that both the graph of g and that of the line go through the point (v, g(v)).
We are now ready to prove Jensen’s inequality: Put v = EX. Then v ∈ U because X takes
values almost surely in U . We thus have

g(X) ≥ m(X − EX) + g(EX) whenever D− (EX) ≤ m ≤ D+ (EX)

If we now take expectations on both sides (i.e. if we integrate with respect to P on both sides),
then
E[g(X)] ≥ m(EX − EX) + Eg(EX) = g(EX)
a
Chapter 5

Products and Independence

5.1 Product Spaces


5.1.1 Introduction
We begin with some motivating examples:
Example 5.1.1 (a) Denote by µ(A) the area of a subset A of R2 . We know how to define µ on
rectangles, i.e. sets of the form A = B1 × B2 , where B1 , B2 are intervals in R: Indeed
µ(A) = λ(B1 ) × λ(B2 ) (∗)
where λ is Lebesgue measure. So µ is to be a measure on (R2 , B(R2 )) such that µ(B1 × B2 ) =
λ(B1 )λ(B2 ). Of course, many sets in B(R2 ) do not have the form B1 × B2 , and we would like
µ to be defined for them as well. So (∗) cannot serve as a definition of the area measure µ.
(b) In probability theory, it is quite natural to consider the product of two probability spaces.
Such products typically model sequences of independent experiments. For example, let Ω1 =
{H, T }, F1 = P(Ω1 ) and let P1 {H} = 12 = P1 {T }. Then (Ω1 , F1 , P1 ) models the tossing of a
fair coin. Now let Ω2 = {1, 2, . . . , 6}, F2 = P(Ω2 ) and P2 {1} = P2 {2} = · · · = P2 {6} = 61 .
Then (Ω2 , F2 , P2 ) models the rolling of a fair die. The underlying set of the probability space
which models the combined random experiment “First toss a fair coin, and then roll a fair die”
can clearly be taken to be the cartesian product Ω = Ω1 × Ω2 . The natural σ–algebra will be
F = P(Ω1 × Ω2 ), and it is not hard to see that this σ–algebra is generated by the π–system
{B1 × B2 : B1 ∈ F1 , B2 ∈ F2 }. Now the event B1 × B2 ⊆ Ω consists of all those outcomes
ω = (ω1 , ω2 ) ∈ Ω1 × Ω2 having ω1 ∈ B1 and ω2 ∈ B2 . Thus B1 × B2 occurs in the combined
random experiment iff B1 and B2 occur in each of the individual experiments.
The probability measure associated with the combined random experiment would therefore
naturally satisfy
P(B1 × B2 ) = P1 (B1 )P2 (B2 ) (∗∗)
But not every event in P(Ω1 × Ω2 ) is of the form B1 × B2 , so (∗∗) cannot serve as a definition
of the combined measure P.


79
80 Product Spaces

The aim of this section is to construct, out of two measure spaces (S, S, µ), (T, T , ν) an new
measure space (S × T, S ⊗ T , µ ⊗ ν) satisfying the following requirements:

(i) A subset of S ×T is called a measurable rectangle if it has the form A×B, where A ∈ S, B ∈ T .
S ⊗ T will be the σ–algebra on S × T which is generated by the measurable rectangles, i.e. it
is the smallest σ–algebra on S × T which has all rectangles with measurable sides as members.
(Please note that this is an abstract definition. the sets S, T do not have to be R, so A × B
may look nothing like a rectangle in the plane.)

(ii) For each measurable rectangle A × B, we require that (µ ⊗ ν)(A × B) = µ(A) · ν(B)

Remarks 5.1.2 (a) A remark on notation: We will be working with functions of more than
one variable,
R and may integrate with respect to just one of those variables. Thus, for ex-
ample,
RR f (x, y) µ(dx) integrates the function f (x, y) over x, keeping y fixed. The integral
f (x, y) µ(dx) ν(dy) is a double integral
R that first integrates f w.r.t. µ over the variable x,
and then integrates the function y 7→ f (x, y) µ(dx) w.r.t ν over the variable y.

(b) Several times below, we will prove a result for finite measures, and then refer to a “standard
argument” to lift the result to σ–finite measures. This is done as follows: Suppose that µ is
σ–finite on (S, S), and that a result Φ has been proved to hold for finite measures. Since µ is
σ–finite, there exists a sequence of measurable sets An ↑ S such that µAn < ∞ for all n ∈ N.
The measures µn := IAn · µ are finite on (S, S), so that result Π holds for the µn . By the MCT,
if f ∈ mS + , then
µf = µ(lim f IAn ) = lim µn f
n n

This is often enough to show that Φ holds for µ as well.




5.1.2 Products of Measure Spaces


Given two measurable spaces (S, S), (T, T ), we can construct a σ–algebra S ⊗ T on the cartesian
product S × T , as follows: Define projections πS : S × T → S, πT : S × T → T by

πS : (s, t) 7→ s πT : (s, t) 7→ t

The interpretation is as follows: (s, t) denotes a sample point in a space of “combined” outcomes:
i.e. s ∈ S occurred and t ∈ T occurred. Given such a combined outcome ω = (s, t), πS (ω) = s
measures which outcome occurred in S, and πT (ω) = t measures which outcome occurred in T .
Given that we know a combined outcome ω = (s, t), we should also know the component outcomes
s and t. Thus the projection mappings πS , πT should be measurable. The product σ–algebra
S ⊗ T is defined to be the smallest σ–algebra on S × T which makes these maps measurable. To
recapitulate:
Products and Independence 81

Definition 5.1.3 Let (S, S) and (T, T ) be measurable spaces. Define projections πS :
S × T → S, πT : S × T → T by

πS : (s, t) 7→ s πT : (s, t) 7→ t

Then define S ⊗ T := σ(πS , πT ) to be the smallest σ–algebra for which both projections
are measurable.

Exercise 5.1.4 Above, we stated that S ⊗ T is the σ–algebra on S × T which is generated by


the measurable rectangles, but that is not how we define S ⊗ T . To verify this assertion, let
R := {A × B : A ∈ S, B ∈ T } be the set of all measurable rectangles. Note that R is a π–system.
Show that S ⊗ T = σ(R).
Hence the product σ–algebra is generated by the π–system of all measurable rectangles.
[Hint: A × B = (A × T ) ∩ (S × B), and A × T = πS−1 [A].] 

Exercise 5.1.5 Show that B(R2 ) = B(R) ⊗ B(R).


[Hint: Using Exercise 5.1.4, it is easy to see that B(R2 ) ⊇ B(R) ⊗ B(R). For the opposite direction,
show that any open set in R2 can be written as a countable union of sets of the form U × V , where
U, V are open intervals in R. Cf. footnote 3 in Chapter 1.] 

Suppose that (S, S, µ) and (T, T , ν) are measure spaces. We would like to construct a measure
µ ⊗ ν on (S × T, S ⊗ T ). One way that suggests itself is to define
Z Z 
(1) (µ ⊗ ν)(B) := IB (s, t) ν(dt) µ(ds) B ∈S ⊗T

Another is to define it as
Z Z 
(2) (µ ⊗ ν)(B) := IB (s, t) µ(ds) ν(dt) B ∈S ⊗T

Exercise 5.1.6 Check that

(µ ⊗ ν)(A × B) = µ(A) · ν(B) A ∈ S, B ∈ T

for both of the above possible definitions of µ ⊗ ν. 

We shall soon see that (i) the above definitions are both possible, and (ii) they coincide.
We first investigate
RR the possibility of defining µ ⊗ ν in the above manner. To be able to perform
a double integral f (s, t) ν(dt) µ(ds) it is necessary that:

(i) for each Rs ∈ S, the map t 7→ f (s, t) must T –measurable, so that we can calculate the inner
integral f (s, t) ν(dt);
R
(ii) the map s 7→ RF (s) := f (s, t) ν(dt) must be S–measurable, so that we can calculate the
outer integral F (s) µ(ds).
82 Product Spaces

The following lemma gives us what we need:

Lemma 5.1.7 Suppose that (S, S) and (T, T ) are measurable spaces, that µ is a σ–finite
measure on (S, S), and that f : S × T → R+ is S ⊗ T –measurable. Then

(i) For each t ∈ T , the map s 7→ f (s, t) is S–measurable.


R
(ii) The map t 7→ f (s, t) µ(ds) is T –measurable.

Proof: We apply the Monotone Class Theorem (Theorem 2.4.4). First assume that µ is a finite
measure, and let
H = {f ∈ mS ⊗ T : f is bounded and satisfies (i) and (ii)}
It is easy to verify that H is a vector space (we need the finiteness of µ in order to avoid expressions
of the form ∞ − ∞), and that that each IA×B ∈ H, where A ∈ S, B ∈ T . By the MCT, H is
closed under bounded limits of increasing non–negative sequences. Moreover, the set R := {A × B :
A ∈ S, B ∈ T } is a π–system with the property that IR ∈ H for every R ∈ R, and thus by Thm.
2.4.4 every bounded S ⊗ T –measurable function belongs to H (since σ(R) = S ⊗ T ). Now each
non–negative measurable function f is the limit of bounded non–negative measurable functions
(f = limn f ∧ n), and thus another application of the MCT shows that every f ∈ m(S ⊗ T )+
satisfies (i) and (ii).
Now drop the assumption that µ is a finite measure. Because µ is σ–finite, we can choose
An ↑ S such that µ(An ) < ∞. The measures µn defined by dµ n
dµ := IAn areRfinite measures,
R
and thus
R each map t 7→ f (s, t)µn (ds) is T –measurable (where f ≥ 0). Since f (s, t) µ(ds) =
limn f (s, t) µn (ds), the MCT implies that the result holds for µ. a

We now know that it is possible to define µ ⊗ ν in the ways indicated. What we don’t (yet)
know is that these constructions define a measure, and that they coincide.
For definiteness, we arbitrarily fix one of the above definitions:
Definition and Proposition 5.1.8 Suppose that (S, S, µ) and (T, T , ν) are σ–finite
measure spaces. Define a map µ ⊗ ν : S ⊗ T → R̄+ by
ZZ
(µ ⊗ ν)B := IB (s, t) ν(dt) µ(ds) = µs (ν t IB (s, t)) B ∈S ⊗T

Then µ ⊗ ν is a σ–finite measure on (S × T, S ⊗ T ), called the product measure of µ, ν.

Exercise 5.1.9 Prove the above result, i.e. show that µ ⊗ ν defines a σ–finite measure on (S ×
T, S ⊗ T ). 

The next two results show that (modulo certain conditions) we can calculate the integral w.r.t.
µ ⊗ ν as a double integral, and the order of integration doesn’t matter:
Z ZZ ZZ
f d(µ ⊗ ν) = f (s, t) ν(dt) µ(ds) = f (s, t) µ(ds) ν(dt)

We first show this for non–negative measurable functions:


Products and Independence 83

Theorem 5.1.10 (Tonelli)


Suppose that (S, S, µ) and (T, T , ν) are σ–finite measure spaces. If f ∈ m(S ⊗ T )+ , then
Z ZZ ZZ
f d(µ ⊗ ν) = f (s, t) ν(dt) µ(ds) = f (s, t) µ(ds) ν(dt) (∗)

Proof: We use the Monotone Class Theorem (Thm. 2.4.4). First assume that µ, ν are finite
measures. The result is obvious if f = IA×B , where A × B measurable rectangle, (or cf. Exercise
5.1.6). The class
H = {f ∈ m(S ⊗ T ) : f is bounded and satisfies (∗)}
is easily seen to satisfy the requirements of Thm. 1.6.3 , and thus implies that H contains every
bounded S ⊗ T –measurable function. The result for arbitrary non–negative f follows by MCT.
A standard argument lifts the result to the case where µ, ν are merely σ–finite. a

As a by–product, we obtain the result that our two possible definitions of µ ⊗ ν as iterated
integrals coincide: If B ∈ S ⊗ T , then IB is a non–negative measurable function, and we may apply
Tonelli’s Thm. R R
For non–negative functions f , the integral f dµ always makes sense, but we may have f dµ =
∞. For arbitrary measurable f , we therefore have to be more careful:

Theorem 5.1.11 (Fubini)


Suppose that (S, S, µ) and (T, T , ν) are σ–finite measure spaces. If f ∈ L1 (S × T, S ⊗
T , µ ⊗ ν), then
Z ZZ ZZ
f d(µ ⊗ ν) = f (s, t) ν(dt) µ(ds) = f (s, t) µ(ds) ν(dt)

Here the map t 7→ µs f (s, t) belongs to L1 (T, T , ν) for ν–a.e. t ∈ T . Similarly, the map
s 7→ ν t f (s, t) belongs to L1 (S, S, µ) for µ–a.e. s ∈ S.
R
Proof: The result holds for |f |, by Tonelli’s Thm., and hence
R NS = {s ∈ S : |f (s, t)| ν(dt) = +∞}
is µ–null by Proposition 4.1.9. Similarly, NT = {t ∈ T : |f (s, t)| µ(ds) = +∞} is ν–null. Redefine
f (s, t) to be zero when either s ∈ NS or t ∈ NT ; this won’t affect the integral of f , by Theorem
4.3.9. The result follows by splitting f into positive and negative parts. a

Remarks 5.1.12 (a) Fubini’s Theorem allows the interchange of the order of integration, provided
the integrand is integrable w.tr.t the product measure. It follows from Fubini’s Theorem that
Z Z  Z Z 
f dν dµ = f dµ dν

provided that f ∈ L1 . See Exercise 5.1.13 for what can happen if f 6∈ L1 .

(b) To check if f ∈ L1 (S ×R RT, S ⊗ T , µ ⊗ ν), observe that Tonelli’s Theorem applies to |f |. Thus
if the double integral 1
|f (s, t)| µ(ds) ν(dt) is finite, then f ∈ L (S × T, S ⊗ T , µ ⊗ ν), and
Fubini’s Theorem may be applied to f .
84 Product Spaces

(c) Fubini’s Theorem also easily extends to arbitrary finite products: If (Si , Si , µi ) are σ–finite
measure spaces for i = 1, . . . , n, then

(i) S1 ⊗ · · · ⊗ Sn is the σ–algebra on S1 × · · · × Sn which is generated by the projections


πi : S1 × · · · × Sn → Si : (s1 , . . . , sn ) 7→ si . It is also generated by the family of measurable
“rectangles” R = {A1 × · · · × An : Ai ∈ Si for i = 1, . . . , n}.
(ii) µ1 ⊗ · · · ⊗ µn is the unique measure on S1 ⊗ · · · ⊗ Sn which assigns to every rectangle the
measure
(µ1 ⊗ . . . × µn )(A1 × · · · × An ) = µ1 A1 · · · · · µn An
(iii) Fubini’s Theorem states that if f : S1 × · · · × Sn −→ R̄ is µ1 ⊗ · · · ⊗ µn –integrable, then
Z Z Z Z  
f d(µ1 ⊗ · · · ⊗ µn ) = ... f dµn . . . dµ2 dµ1
S1 ×···×Sn S1 S2 Sn

and that any interchange of the order of integration is permissible.




Exercise 5.1.13 Let


x2 − y 2
f (x, y) =
(x2 + y 2 )2
Show that
Z 1Z 1 Z 1Z 1
π π
f (x, y) λ(dy) λ(dx) = f (x, y) λ(dx) λ(dy) = −
0 0 4 0 0 4
What can you conclude about Z
f d(λ ⊗ λ)
[0,1]×[0,1]

Exercise 5.1.14 (a) Show that if X is a non–negative p


R ∞ p−1 R ∞random variable and p > 0, then E[X ] =
0 px P(X > x) λ(dx). In particular, E[X] = 0 P(X > x) dx when X ≥ 0.
R X(ω) p−1
[Hint: Observe that X p (ω) = 0 λ(dx) = R pxp−1 I{0≤x<X(ω)} λ(dx). Now apply
R
px
Tonelli’s Theorem with the product measure P ⊗ λ.]

(b) The method of Breeden and Litzenberger: It is a well–known fact in mathematical


finance that the t = 0–price C(K) of a european call option with strike K and maturity T on
underlying asset S is given by the risk–neutral valuation formula:

C(K) = e−rT EQ [(ST − K)+ ]

where r is the c.c. riskless rate — a market observable — and Q is a so–called risk–neutral
measure, which is not directly observable. The method of Breeden–Litzenberger recovers the
distribution function F (x) := Q(ST ≤ x) and the density function f of the asset price ST at
maturity under the risk–neutral measure Q from market–observable prices of (liquid) vanilla
Products and Independence 85

european calls C(K) (provided that C(K) is known for a largish set of strikes K). This
distribution can then be used to price more exotic european–style options, whose prices are not
market–observable.
Show that the riskneutral distribution– and density functions of ST are given by:

∂C(y) ∂ 2 C(y)
F (y) = 1 + e+rT f (y) = e+rT
∂y ∂y 2

R∞ R∞
[Hint: by(a) C(y) = e−rT 0 Q((ST − y)+ > x) λ(dx) = e−rT y 1 − F (x) λ(dx).]


5.2 Independence
The covariance Cov(X, Y ) and correlation ρX,Y between two random variables are defined by

Cov(X, Y )
Cov(X, Y ) := E[(X − EX)(Y − EY )] ρX,Y :=
Var(X)Var(Y )

when this integral exists. It is an important and well–known that independent random variables
are uncorrelated (i.e. have ρX,Y = 0):

Theorem 5.2.1 Suppose that X, Y ∈ L1 (Ω, F, P) are independent random variables.


Then XY ∈ L1 (Ω, F, P) also, and

E[XY ] = E[X] · E[Y ] i.e. Cov(X, Y ) = 0

The same result holds in the extended sense if X, Y ≥ 0.

Proof: Recall that X, Y are independent if and only if σ(X), σ(Y ) are independent σ–algebras.
We use the standard machine, but need to pay special attention to measurability.
If X, Y are both indicator functions,

X = IA , Y = IB

then Z
E[XY ] = IA∩B dP = P(A ∩ B) = P(A)P(B) = E[X]E[Y ]

because A ∈ σ(X), B ∈ σ(Y ), and thus A, B are independent events. Next if


n
X m
X
X= aj IAj Y = bk IBk
j=1 k=1
86 Independence

are non–negative simple random variables (where the aj are distinct, as are the bk ) then each
Aj ∈ σ(X) and each Bk ∈ σ(Y ). Thus the Aj , Bk are independent, and it follows that
Z
E[XY ] = XY dP
X
= aj bk P(Aj ∩ Bk )
j,k
X X
= aj P(Aj ) P(Bk )
j k
= E[X]E[Y ]

again because P(Aj ∩ Bk ) = P(Aj )P(Bk ) for each j, k. If X, Y are non–negative random variables,
then by Proposition 2.4.3 we may choose simple non–negative random variables Xn ↑ X, Yn ↑ Y ,
with each Xn measurable w.r.t. σ(X), and each Yn measurable w.r.t. σ(Y ) (e.g. take Xn :=
n −1
n2P
k
2n I{ kn ≤X< k+1
2 2 n }
+ nI{X≥n} to be the usual staircase functions). In that case Xn , Yn are
k=0
independent (being measurable with respect to independent σ–algebras) and Xn Yn ↑ XY . By the
Monotone convergence Theorem

E[XY ] = lim E[Xn Yn ]


n→∞
= lim E[Xn ]E[Yn ]
n→∞
= lim E[Xn ] lim E[Yn ]
n→∞ n→∞
= E[X]E[Y ]

Finally if X, Y ∈ L1 , first observe that E[|XY |] = E[|X|]E[|Y |] (because the result has been proved
for the non–negative independent random variables |X|, |Y |), so that XY is integrable when X, Y
are integrable. Now split X, Y up into their positive and negative parts and apply linearity. a

Remarks 5.2.2 Note that it is not true in general that if X, Y ∈ L1 , then XY ∈ L1 . However,
the above theorem shows that this is the case if X, Y are independent. 

Actually, there is an easier proof of Propn. 3.3.5, if we adopt another point of view: If X, Y
(X,Y )
are random variables, then (X, Y ) is a random vector, i.e. a map (Ω, F, P) → (R2 , B(R2 )).
Its distribution is a probability measure on (R2 , B(R2 )) given by µX,Y (B) = P{(X, Y ) ∈ B},
where B ∈ B(R2 ). If µX , µY are the distributions of X, Y respectively, then the product measure
µX ⊗ µY is another probability measure on (R2 , B(R2 )). It turns out that X, Y are independent iff
µX,Y = µX ⊗ µY :

Theorem 5.2.3 Let X, Y be random variables on (Ω, F, P). Let µX,Y , µX , µY be the
distributions of the random elements (X, Y ), X and Y . Then X, Y are independent iff
µX,Y = µX ⊗ µY .
Products and Independence 87

Proof: Suppose that X, Y are independent. If A × B is a measurable rectangle in B(R2 ) =


B(R) ⊗ B(R), then

µX,Y (A × B) = P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B) = µX A · µY B = (µX ⊗ µY )(A × B)

Hence µX,Y , µX ⊗ µY agree on a π–system that generates B(R2 ) (the family of measurable rectan-
gles). NowµX,Y and µX ⊗ µY agree on a π–system that generates B(R2 ), so by Proposition 1.6.4
they are equal: µX,Y = µX ⊗ µY .
Conversely, if µX,Y = µX ⊗ µY , then if x, y ∈ R, we have

P(X ≤ x, Y ≤ y) = µX,Y ((−∞, x] × (−∞, y]) = µX (−∞, x] · µY (−∞, y] = P(X ≤ x)P(Y ≤ y)

Hence X, Y are independent (cf. Exercise 3.3.4). a

Exercise 5.2.4 Use the Change of Variables Theorem to show that


Z Z
XY dP = xy dµX,Y

Now prove Propn. 5.2.1 once more, using Fubini’s Theorem. 


88 Independence
Chapter 6

Spaces of Random Variables

6.1 Topological Vector Spaces


6.1.1 Normed Vector Spaces
Definition 6.1.1 A normed space is a pair (V, || · ||), where V be a vector space and || · ||
is a norm on V , i.e. a function || · || : V → R with the following properties:

(i) ||x|| ≥ 0 for all x ∈ V ;

(ii) ||x|| = 0 if and only if x = 0;

(iii) ||αx|| = |α| ||x|| for all x ∈ V and α ∈ K;

(iv) ||x + y|| ≤ ||x|| + ||y|| for all x, y ∈ V (Triangle Inequality);

The norm ||v|| of a vector v ∈ V should be interpreted as its length, as in the following standard
examples.

Examples 6.1.2 (a) V := R with ||v|| := |v| (the absolute value).

(b) V := Rn with the Euclidean norm


 1
n 2
X
2
||v|| :=  vk  where v = (v1 , . . . , vn )
j=1

(c) C with ||z|| = |z| (the modulus)




Here are some other norms on Rn :

Exercise 6.1.3 For x ∈ Rn , let x = (x1 , . . . , xn )

89
90 Topological Vector Spaces

(a) Define || · ||1 on Rn by


||x||1 = |x1 | + · · · + |xn |
Show that || · ||1 is a norm on R1 .

(b) Define || · ||∞ on Rn by


||x||∞ = max{|x1 |, . . . , |xn |}
Show that || · ||∞ is a norm on Rn ;

We will give more interesting examples shortly.

Exercise 6.1.4 Let (V, || · ||) be a normed space.

(a) Prove that ||x|| = || − x|| for all x ∈ V .

(b) Prove that


||x − y|| ≥ ||x|| − ||y|| for all x, y ∈ V

Here is a first look at function spaces:

Exercise 6.1.5 Suppose that [a, b] is a closed interval in R. Let C[a, b] be the set of all continuous
functions f : [a, b] → R.

(a) Show that C[a, b] is a vector space over the scalar field R, where the operations of addition and
scalar multiplication are defined pointwise.

(b) Define || · ||1 : C[a, b] → R by


Z b
||f ||1 := |f (t)| dt
a
Show that || · ||1 is a norm on C[a, b].

(c) Define || · ||∞ : C[a, b] → R by

||f ||∞ := sup{|f (t)| : t ∈ [a, b]}

Show that || · ||∞ is a norm on C[a, b].

If (V, || · ||) is a normed space, then we can define the distance d(v, w) between two elements
v, w ∈ V by
d(v, w) := ||v − w||
It is easy to see that the following holds:
Spaces of Random Variables 91

Proposition 6.1.6 For any u, v, w in a normed space (V, || · ||):

(i) d(v, w) ≥ 0.

(ii) d(v, w) = 0 if and only if v = w.

(iii) d(v, w) = d(w, v).

(iv) d(v, w) ≤ d(v, u) + d(u, w) (∆–inequality)

The properties (i)-(iv) of the preceding proposition capture the notion of the distance between two
points.

Definition 6.1.7 If X is any set, then a distance function d : X × X → R which satisfies


conditions (i)-(iv) of Proposition 6.1.6 is called a metric on X. The pair (X, d) is called a
metric space.

Thus we see that:

Proposition 6.1.8 Every norm induces a metric or distance: If (V, || · ||) is a normed
space, then (V, d) is a metric space, where d : V × V → R is defined by

d(v, w) := ||v − w||

Once we have a metric, i.e. a concept of distance, we can use it to talk about limits and
convergence:

Definition 6.1.9 Suppose that (X, d) is a metric space, and that hxn i is a sequence in
X, and that x ∈ X. We say that

xn → x or that lim xn = x
n

if and only if d(xn , x) → 0.

In the above definition, note that hd(xn , x)i is a sequence of real numbers, so that we already
know what d(xn , x) → 0 means — see Appendix B.1:

d(xn , x) → 0 ⇐⇒ ∀ε > 0 ∃N ∀n ≥ N (d(xn , x) < ε)

We briefly introduce the following notions — see also Appendix B.4:


92 Topological Vector Spaces

Definition 6.1.10 (i) A sequence hxn in in a metric space (X, d) is called a Cauchy
sequence iff ∀ε > 0 ∃N ∈ N ∀n, m ≥ N [d(xn , xm ) < ε], i.e. from some point
N onwards any two terms of the sequence (not necessarily successive) lie within a
distance of ε of each other.)

(ii) A metric space (X, d) is said to be complete if every Cauchy sequence in X converges
(to a point in X).

(iii) A normed vector space which is complete (w.r.t. the metric induced by the norm) is
called a Banach space.

In Appendix B.4, it is proved that every Cauchy sequence in R converges. It follows easily that
(Rn , equipped with the usual Euclidean norm, is a Banach space.

6.1.2 Inner Product Spaces


Next, we consider an additional structure on a real vector space V :

Definition 6.1.11 An inner product space is a pair (V, h·, ·i), where V is a vector space
over R and h·, ·i is an inner product on V , i.e. a function h·, ·i : V × V → R with the
following properties:

(i) hx, yi = hy, xi for all x, y ∈ V ;

(ii) hx, xi ≥ 0 for all x ∈ V ;

(iii) hx, xi = 0 if and only if x = 0;

(iv) hx, y + zi = hx, yi + hx, zi for all x, y, z ∈ V ;

(v) hαx, yi = αhx, yi for all x, y ∈ V and α ∈ R.

Rn is an inner product space when equipped with the usual dot product:

n
X
hx, yi := x · y = x j yj
j=1

where x = (x1 , . . . , xn ), y = (y1 , . . . , yn ).


We shall shortly present more interesting examples of inner product spaces. For the moment,
we note that every inner product induces a norm, in the same way that the dot product in Rn
n √
yields a length: The (Euclidean) length of a vector p x ∈ R is given by x · x. It turns out that if,
in an inner product space V , we define ||x|| := hx, xi, then || · || is a norm on V .
To prove that, we need the following result:
Spaces of Random Variables 93

Proposition 6.1.12 (Cauchy–Schwarz Inequality)a


If (V, h·, ·i) is an inner product space, then
p p
|hx, yi| ≤ hx, xi hy, yi

for all x, y ∈ V .
Moreover we have equality iff y is a scalar multiple of x.
a
Also called the Cauchy–Bunyakovskii–Schwarz Inequality

Proof: Let (V, h·, ·i) be an inner product space. Note that for α ∈ R and x, y ∈ V we have
0 ≤ hαx − y, αx − yi = α2 hx, xi − αhx, yi − αhy, xi + hy, yi
= α2 hx, xi − 2αhx, yi + hy, yi
Now, with x, y held fixed, consider the righthand side of the above inequality as a quadratic polyno-
mial in α. Since it is always non–negative, it can have at most one root, and thus its discriminant
is ≤ 0, i.e.
hx, yi2 − hx, xi hy, yi ≤ 0
which yields the result upon taking square roots.
We have hx, yi2 − hx, xi hy, yi = 0 only when this quadratic has a root, i.e. if there is α such
that hαx − y, αx − yi = 0. But then αx = y, i.e. y is a scalar multiple of x. a

Proposition 6.1.13 If (V, h·, ·i) is an inner product space, then the map || · || : V → R
given by p
||x|| := hx, xi
defines a norm on V .

Proof: We only verify the triangle inequality. Note that the Cauchy–Schwarz inequality states
that |hx, yi| ≤ ||x|| ||y||. Thus
||x + y||2 = hx + y, x + yi = hx, xi + 2hx, yi + hy, yi
= ||x||2 + 2hx, yi + ||y||2
≤ ||x||2 + 2||x|| ||y|| + ||y||2
= (||x|| + ||y||)2
a

Thus an inner product induces a norm, and a norm induces a metric. An inner product space
which is complete (w.r.t. the metric induced by the inner product) is called a Hilbert space. In
particular, every Hilbert space is also a Banach space.
In Rn , the dot product does not only induce a length; it also induces an angle: The angle θ
between two vectors x, y ∈ Rn is given by
x·y
cos θ =
||x|| ||y||
94 Topological Vector Spaces

We can imitate this definition in an abstract inner product space (V, h·, ·i), and define the angle
between x, y ∈ V by
hx, yi p
cos θ := where ||x|| := hx, xi
||x|| ||y||
By the Cauchy–Schwarz inequality it follows immediately that | cos θ| ≤ 1, so that this definition
makes sense. It also follows that | cos θ| = 1 if and only if x is a scalar multiple of y, i.e. iff x, y are
parallel. We can also define orthogonality in an abstract inner product space, in the obvious way:

Definition 6.1.14 Suppose that (V, h·, ·i) is an inner product space. We say that x, y ∈ V
are orthogonal, and write x ⊥ y, if and only if hx, yi = 0.
If G ⊆ V , we say that x ⊥ G iff ∀g ∈ G(x ⊥ g).

The following proposition is an easy exercise:

Proposition 6.1.15 Let V be an inner product space with induced norm || · ||.

(a) (Pythagoras’ Law) If v, w ∈ V , with v ⊥ w, then ||v + w||2 = ||v||2 + ||w||2

(b) (Parallelogram Law) If v, w ∈ V , then ||v + w||2 + ||v − w||2 = 2||v||2 + 2||w||2

6.1.3 Orthogonal Projection in Hilbert Spaces


If W is a linear subspace of Rn , then we can project any x ∈ Rn onto W . That is, we can represent
x as a sum
x = x|| + x⊥ where x|| ∈ W, x⊥ ⊥ W
One can think of x|| as the best approximation to x in W : It is the vector in W which lies closest
to x. We call x|| the orthogonal projection of x onto W .
We would like to extend this idea of orthogonal projection to arbitrary Hilbert spaces. Suppose
that V is a Hilbert space, and that W is a closed linear subspace of V . If v0 ∈ V , we would like to
find the best approximation of v0 in W . This is the unique vector w0 with the properties that

(i) w0 ∈ W , and

(ii) ||v0 − w0 || = inf{||v0 − w|| : w ∈ W }, i.e. w0 is the vector in W that lies closest to v0 .

(iii) Moreover, (v0 − w0 ) ⊥ W .

The vector w0 satisfying (i)–(iii) is called the orthogonal projection of v0 onto W . Indeed, v0 =
w0 + (v0 − w0 ) decomposes v0 into a vector in W and a vector orthogonal to W .
It remains to show that orthogonal projections exist and are unique. To be able to do that, we
need an additional condition: We say that a linear subspace W of a Hilbert space V is closed if W
is itself a Hilbert space. This simply means that if hwn i is a sequence in W and wn → v for some
v ∈ V , then v ∈ W , i.e. that if a sequence in W converges, then the limit lies in W . Any linear
subspace of Euclidean space is automatically closed.
Spaces of Random Variables 95

Proposition 6.1.16 Let V be a Hilbert space, and let W be a closed linear subspace of
V . Any v0 in V has a unique decomposition
|| ||
v0 = v0 + v0⊥ where v0 ∈ W, v0⊥ ⊥ W
||
v0 is called the orthogonal projection of v0 onto W .
||
Moreover, v0 is the best approximation to v0 in W , i.e. the vector in W which lies closest
to v0 :
||
||v0 − v0 || = inf{||v0 − w|| : w ∈ W }

Proof: (a) Uniqueness: If


|| ||
v0 = v0 + v0⊥ = u0 + u⊥
0

|| ||
where v0 , u0 ∈ W and v0⊥ , u⊥
0 ⊥ W , then

|| ||
v0 − u0 = u⊥ ⊥
0 − v0 =: x

is a vector with the properties that x ∈ W and that x ⊥ W . This implies that x ⊥ x, i.e. that
|| ||
hx, xi = 0. Hence x = 0, and so v0 = u0 , v0⊥ = u⊥
0.
Existence: Let δ = inf{||v0 −w|| : w ∈ W }, and choose a sequence wn ∈ W such that ||v0 −wn || → δ.
We show that hwn in is a Cauchy sequence in W : for if ε > 0, we may choose N such that
||v0 − wn ||2 − δ 2 < ε whenever n ≥ N . By the Parallelogram Law it follows that if n, m ≥ N , then

1
2ε+2δ 2 > ||v0 −wn ||2 +||v0 −wm ||2 = 2||v0 − 21 (wn +wm )||2 +2|| 12 (wn −wm )||2 ≥ 2δ 2 + ||wn −wm ||2
2

Since hwn in is a Cauchy sequence, and since W is closed, there is w0 ∈ W such that wn → w0 . We
||
will show that w0 = v0 . The fact that ||v0 − w0 || ≤ ||v0 − wn || + ||wn − w0 || (for all n ∈ N) then is
easily seen to imply that ||v0 − w0 || = δ. In particular, ||v0 − w0 || = inf{||v0 − w|| : w ∈ W }.
It remains to show that v0 − w0 ⊥ W . Given an arbitrary w ∈ W and a real λ ∈ R, have
||v0 − w0 ||2 = δ 2 ≤ ||v0 − (w0 + λw)||2 , so that

−2λhv0 − w0 , wi + λ2 ||w||2 ≥ 0

Since this holds for all real λ we must have hv0 − w0 , wi = 0. (Another way to see this is to note
that the quadratic in λ has a unique root at λ = 0) and to calculate the discriminant.)
a
96 Lp Spaces

6.2 Lp Spaces
Definition 6.2.1 Suppose that (Ω, F, µ) is a measure space.

• If 1 ≤ p < ∞ (p need not be an integer), then Lp (Ω, F, µ) is defined to be the set of


f
all F–measurable functions Ω → R such that |f |p is µ–integrable.
f
• A function Ω → R is said to be essentially bounded iff there is a real number M such
that |f | ≤ M µ–a.e.
f
L∞ (Ω, F, µ) is defined to be the set of all essentially bounded F–measurable Ω → R.

• For each 1 ≤ p < ∞, we define a map || · ||p : Lp (Ω, F, µ) → R by


1
||f ||p := (µ|f |p ) p

• We also define a map || · ||∞ : L∞ (Ω, F, µ) → R by

||f ||∞ = inf{M : |f | ≤ M µ–a.e.}

The maps || · ||p are called Lp –norms, or just p–norms.

If the underlying measure space is understood from context, we shall write Lp instead of
Lp (S, F, µ).

Remarks 6.2.2 • Note that L1 is just the family of µ–integrable functions (cf. Propn. 4.1.7).

f
• Also, if S → R is F–measurable, then f ∈ Lp iff f p ∈ L1 iff ||f ||p < ∞.

• If f ∈ L∞ , then |f | ≤ ||f ||∞ µ–a.e. Yo see this, note that if M > ||f ||∞ , then |f | ≤ M µ–a.e.,
and hence {ω ∈ Ω : |f (ω)| > M } is a µ–null set. But then
[
{ω ∈ Ω : |f (ω)| > ||f ||∞ } = {ω ∈ Ω : |f (ω)| > ||f ||∞ + n1 }
n

is a countable union of µ–null sets, and hence itself µ–null.




In probability theory, the spaces L1 (Ω, F, P) and L2 (Ω, F, P) are the spaces that occur most fre-
quently, for reasons that will become apparent in Section 6.3.
We shall show that the Lp spaces are almost Banach spaces, and that L2 is almost a Hilbert
space.

Lemma 6.2.3 The Lp spaces are vector spaces.


Spaces of Random Variables 97

Proof: We show that Lp is closed under addition, and leave scalar multiplication as an easy
exercise.
If 1 ≤ p < ∞, and f, g ∈ Lp , then |f + g|p ≤ (|f |R+ |g|)p ≤ max{2|f |, 2|g|}p ≤ 2p (|f |p + |g|p ).
Hence if f, g ∈ Lp , then |f +g|p dµ ≤ 2p |f |p dµ + |g|p dµ < ∞, i.e. ||f +g||pp ≤ ||f ||pp +||g||pp <
R R 

∞.
If p = ∞, and f, g ∈ L∞ , then clearly ||f + g||∞ ≤ ||f ||∞ + ||g||∞ < ∞. a

For the next theorem, note that if a, b ≥ 0, and if 1 < p, q < ∞ are such that p−1 + q −1 = 1,
then
ap bq
ab ≤ +
p q
p
To see this, define h(t) = tb − tp , and find the maximum of h (using differential calculus). Alter-
natively, apply the Arithmetic–Geometric Mean inequality.

Theorem 6.2.4 Let (Ω, F, µ) be a measure space and let f, g be real–valued F–measurable
functions.
1 1
(a) (Hölder’s Inequality) Suppose that 1 ≤ p ≤ ∞ and that p + q = 1. If f ∈ Lp , and
g ∈ Lq , then f g ∈ L1 , and
||f g||1 ≤ ||f ||p ||g||q

(b) (Minkowski’s Inequality) Let p ≥ 1. If f, g ∈ Lp , then

||f + g||p ≤ ||f ||p + ||g||p

If p1 + 1q = 1, we say that p.q are Hölder conjugates.


Proof: (a) If p = 1 (so that q = +∞), then |f g| ≤ |f | ||g||∞ µ–a.e. and so ||f g||1 = µ|f g| ≤
µ|f | · ||g||∞ = ||f ||1 ||g||∞ < ∞.
If p > 1, put a = |f||f(ω)| |g(ω)|
||p , b = ||g||q and apply the remark just before the statement of the theorem
to conclude
|f (ω)g(ω)| |f (ω)|p |g(ω)|q
≤ +
||f ||p ||g||q p||f ||pp q||g||qq
Integrating both sides w.r.t. µ yields the result.
p
(b) This relation is easy to prove if p = 1 or p = ∞. For 1 < ∞ < p, note that q = p−1 , and
thus that |f + g|p−1 ∈ Lq . By Hölder’s inequality,
Z Z
p p−1
||f + g||p ≤ |f | |f + g| dµ + |g| |f + g|p−1 dµ

≤ ||f ||p · ||(f + g)p−1 ||q + ||g||p · ||(f + g)p−1 ||q


 
= ||f ||p + ||g||p ||f + g||p−1 p

It is now clear that || · ||p satisfies the following:


98 Lp Spaces

(i) ||f ||p ≥ 0 ||f ||p = 0 iff f = 0 µ–a.e.

(ii) If α ∈ R, then ||αf ||p = |α| ||f ||p .

(iii) ||f + g||p ≤ ||f ||p + ||g||p


Thus || · ||p is almost a norm on Lp . The requirement that ||f ||p = 0 iff f = 0 does’t hold, but
holds only almost everywhere. To get a bona fide norm, we must identify any two functions that
are equal µ–a.e.:

Definition and Proposition 6.2.5 Let (Ω, F, µ) be a measure space, and let 1 ≤ p ≤
∞. Define a relation ≡ on Lp by f ≡ g iff f = g µ–a.e. Then ≈ is an equivalence relationa
on Lp . Define [f ] := {g ∈ Lp : f ≡ g}. Then [0] = {g ∈ Lp : g = 0 µ–a.e.} is a vector
subspace of Lp , and [f ] = f + [0] := {f + g : g ∈ [0]}. Let

Lp (Ω, F, µ) = {[f ] : f ∈ Lp (S, S, µ)}

Then Lp is a vector space and the map, which by abuse of notation we also call || · ||p ,
which is defined by
||[f ]||p := ||f ||p
is a norm on Lp .
a
See Appendix A.4 for a discussion of equivalence relations.

Proof: That ≡ is an equivalence relation is straightforward, as is the statement that [0] is a vector
subspace of Lp . It is also easy to see that Lp is a vector space, if the operations are defined in
the natural way (e.g. [f ] + [g] := [f + g] — one must check that this is well–defined, i.e. that
if [f1 ] = [f2 ] and [g1 ] = [g2 ], then [f1 + g1 ] = [f2 + g2 ], but thatR is easy.) R[0] is clearly the zero
vector in Lp . Also, if [f1 ] = [f2 ], then f1 = f2 µ–a.e., and thus f1p dµ = f2p dµ, which shows
that ||f1 ||p = ||f2 ||p (in case p < ∞), and thus that || · ||p is well–defined on Lp . To see that it is
a norm, note that (i) ||[f ]||p = ||f ||p ≥ 0, and that ||[f ]||p = 0 iff f = 0 µ–a.e. iff [f ] = [0]; (ii)
||α[f ]||p = ||αf ||p = |α| ||[f ]||p , and (iii) ||[f ] + [g]||p = ||f + g||p ≤ ||[f ]||p + ||[g]||p .
In case p = ∞, it is also straightforward to see that || · ||∞ is a well–defined norm on L∞ . a

In practice, we usually don’t bother too much about the distinction between Lp and Lp .
Now that we know that || · ||p is a norm, we have a notion of convergence:

Definition 6.2.6 A sequence hfn in in Lp (Ω, F, µ) is said to converge to f ∈ Lp (Ω, F, µ)


in Lp (or in pth mean) iff ||fn − f ||p → 0.

We now have two notions of convergence for measurable functions: almost everywhere convergence,
and convergence in mean. We write
a.e. Lp
fn → f fn → f

We will introduce a third notion of convergence in Section 6.4, and discuss the rleationships between
these various notions.
Spaces of Random Variables 99

Theorem 6.2.7 (Riesz–Fischer)


If (Ω, F, µ) is a measure space and 1 ≤ p ≤ ∞, then Lp (Ω, F, µ) is a Banach space.

Proof: Suppose that hfn in is a Cauchy sequence in Lp , i.e. that supm≥n ||fm −fn ||p → 0 as n → ∞
(see Remark B.4.2(b)).
First assume that p < ∞. For k ∈ N, choose an increasing sequence hnk ik of naturalP no. such
||f − || −k . Then by the Monotone Convergence Theorem ||
that supm≥nPk m f nk p < 2 P k |fnk+1 −
fnk | ||p ≤ k ||fnk+1 − fnk ||p < ∞. By Proposition 4.1.9, we see that k |fnk+1 − fnk | < ∞ µ–
a.e., from which it follows easily that hfnk ik is a Cauchy sequence µ–a.e. Define f : Ω → R by
f (ω) = limk fnk (ω), if this limit exists, and f (ω) = 0 else. Then f is measurable, and fnk → f
µ–a.e. as k → ∞. Now by Fatou’s Lemma,

||f ||p ≤ lim inf ||fn ||p < ∞


n

(because Cauchy sequences are bounded — cf. Lemma B.4.5), so that f ∈ Lp , and similarly

||f − fn ||p = lim inf ||fnk − fn ||p ≤ sup ||fm − fn ||p → 0 as n → ∞


k m≥n

Lp
Thus fn → f .
Next, assume that p = +∞. We have supm≥n |fm − fn | ≤ supm≥n ||fm − fn ||∞ µ–a.e., and thus
hfn in is a Cauchy sequence µ–a.e. Define f : Ω → R as above: f (ω) = limn fn (ω) if this limit
exists, and f (ω) = 0 otherwise. Then

|f | ≤ |fn | + |fn − f | = |fn | + lim |fn − fm | ≤ ||fn ||∞ + sup ||fn − fm ||∞ µ–a.e.
m m≥n

so that f ∈ L∞ , and

|fn − f | = lim |fn − fk | ≤ sup ||fn − fm ||∞ µ–a.e.


k m≥n

L∞
Hence ||fn − f ||∞ ≤ supm≥n ||fn − fm ||∞ → 0 as n → ∞, proving that fn → f . a

The following result is easy, but very important:

Theorem 6.2.8 Let (Ω, F, µ) be a measure space. The map


Z
2 2
h·, ·i : L (Ω, F, µ) × L (Ω, F, µ) → R : (f, g) 7→ f g dµ

is an inner product on L2 (Ω, F, µ) which induces the L2 –norm || · ||2 . Hence L2 (Ω, F, µ)
is a Hilbert space.

Proof: Suppose that f, g ∈ L2 (Ω, F, µ). By Hölder’s inequality (or the Cauchy-Schwarz inequal- R
ity), we have ||f g||1 ≤ ||f ||2 ||g||2 , so f g is integrable. It is now easy to see that hf, gi := f g dµ
100 Lp –Spaces and Probability

defines an inner product on L2 (where we identify functions that are µ–a.e. equal to ensure hf, f i = 0
implies f = 0). Furthermore, the norm induced by this inner product is precisely
Z 1
1 2
||f || := hf, f i 2 = |f |2 dµ = ||f ||2

Thus L2 (Ω, F, µ) is an inner product space. Since it is also a Banach space by the Riesz–Fischer
Theorem, it is a complete inner product space, i.e. a Hilbert space. a

6.3 Lp –Spaces and Probability


Let us investigate the Lp –spaces in the context of a probability space.

6.3.1 Moments

Definition 6.3.1 (i) If X is a random variable on (Ω, F, P), then its pth moment is
defined to be EX p (which exists iff X ∈ Lp (Ω, F, P)).

(ii) The variance of a random variable X ∈ L2 (Ω, F, P) is defined by Var(X) := E(X −


EX)2 = EX 2 − (EX)2 . The first equality shows that Var(X) ≥ 0.

(iii) The covariance of two random variables X, Y ∈ L2 (Ω, F, P) is defined by


Cov(X, Y ) := E(X − EX)(Y − EY ) = EXY − (EX)(EY ).
1
(iv) The standard deviation of X ∈ L2 (Ω, F, P) is given by σX := Var(X) 2 .

(v) The correlation of two random variables X, Y ∈ L2 (Ω, F, P) is defined by ρX,Y =


Cov(X,Y )
σX σY .

Note the polarization identity


1 
Cov(X, Y ) = Var(X + Y ) − Var(X − Y )
4
We know that for 1 ≤ p ≤ ∞, the space Lp (Ω, F, P) is a Banach space. If 1 ≤ p < ∞, then
Lp (Ω, F, P) consist of all those random variables that have pth moments. The norm on Lp is simply
the pth root of the pth absolute moment, i.e.
1
||X||p = (E|X|p ) p

If p = ∞, then Lp (Ω, F, P) consists of all essentially bounded random variables. Hölder’s inequality
is simply the statement
1 1
E(XY ) ≤ E(|X|p ) p E(|Y |q ) q
where X ∈ Lp , Y ∈ Lq and p1 + 1q = 1.
For probability theory, the following result is also useful:
Spaces of Random Variables 101

Proposition 6.3.2 Let (Ω, F, P) be a probability space. If 1 ≤ p ≤ r ≤ ∞, then

||X||p ≤ ||X||r

for any random variable X, so that Lr (Ω, F, P) ⊆ Lp (Ω, F, P).


Moreover, if X ∈ L∞ , then
||X||∞ = lim ||X||p
p→∞

r
Proof: Note that if X ∈ Lr , then X p ∈ L p . Now p0 = pr and q 0 = r−p r
, satisfy the relation
1 1 p
p0 + q 0 = 1, and so Hölder’s inequality applied to f = |X| and g = 1 yields
Z Z r
p
r
||X||pp = f g dP ≤ ||f ||p0 · ||g||q0 = p
|X | dP
p · 1 = ||X||pr

Next, suppose that X ∈ L∞ . Then ||X||p ≤ ||X||∞ , so lim supp ||X||p ≤ ||X||∞ .
1
If M < ||X||∞ , then |X|p dP ≥ M p P(|X| > M ) and so ||X||p ≥ M P(|X| > M ) p . Now
R
1
P(|X| > M ) > 0, because M < ||X||∞ , and thus lim inf p ||X||p ≥ M (because P(|X| > M ) p → 1).
Since M was arbitrary, also lim inf p ||X||p ≥ ||X||∞ .
a

Propn. 6.3.2 states that if 1 ≤ p ≤ r ≤ ∞, and if the rth absolute moment of X exists, then so
does the pth absolute moment. Thus it is not possible, for example, for a random variable to have
a variance, but no mean.

6.3.2 L2 and Probability


A number of statistical notions, are really geometric notions: L2 (Ω, F, P) is a Hilbert space, and
thus possesses geometry: There is an inner product hX, Y i := EXY , and this inner product induces
a notion of length ||X||2 as well as a notion of orthogonality hX, Y i = 0. These notions have nice
interpretations if we restric ourselves to the space L20 (Ω, F, P) of centered random variables, i.e.
L2 –variables with zero mean. For then EXY = Cov(X, Y ), i.e. the covariance is the inner product
(when restricted to L20 ). Similarly, the standard deviation of X is simply the norm (length) of X:
σX = ||X||2 . Finally, the correlation is given by

Cov(X, Y ) hX, Y i
ρX,Y = = = cos θ
σX σY ||X||2 ||Y ||2

i.e. the correlation between two centered random variables can be interpreted as the cosine of angle
between them in L2 . In particular, two centered random variables are uncorrelated iff they are
orthogonal.
The Cauchy–Schwarz inequality is simply the statement that |E(XY )| ≤ ||X||2 ||Y ||2 , with
equality only if Y is a scalar multiple of X. If X, Y have zero mean, this amounts to saying
|Cov(X, Y )| ≤ σX σY , i.e. −1 ≤ ρX,Y ≤ 1, with equality only if Y is a scalar multiple of X. Since
ρX,Y = cos θ, this is not surprising.
102 Convergence of Random Variables

Moreover, the mean E[X] of an L2 –random variable can be given a geometric interpretation:
The mean of X is that real number c∗ which lies closest to X in L2 , i.e

c∗ = arg min ||X − c||2


c∈R

To see this, define f : R → R : c 7→ E[(X − c)2 ], where E[(X − c)2 ] = ||X − c||22 . To find the value
of c where f has a minimum, we solve f 0 (c) = 0 to obtain E[2(X − c)] = 0, and conclude that the
minimum occurs at c∗ = E[X]. This theme will be taken up again in the next chapter, when we
discuss conditional expectation
The next exercise shows that we have a similar L1 –interpretation for the median of a random
variable.

Exercise 6.3.3 If X is a random variable, then a median (there may be more than one) of X is a
real number c∗ so that P(X ≤ c∗ ) = 21 .
Assume that X has a density function hX (x). Show that if we define f (c) := E[|X − c|], then
Z c Z ∞
f (c) = (c − x)hX (x) dx + (x − c)hX (x) dx
−∞ c
Z c Z ∞
=2 (c − x)hX (x) dx + (x − c)hX (x) dx
Z−∞
c
−∞

=2 (c − x)hX (x) dx + E[X] − c


−∞

Deduce that f has a minimum at a number c∗ which satisfies

Z c∗
∗ 1
P(X ≤ c ) = hX (x) dx =
−∞ 2

Conclude that a median c∗ of X is a real number which lies closest to X in L1 , i.e.

c∗ = arg min ||X − c||1


c∈R

6.4 Convergence of Random Variables


6.4.1 Modes of Convergence

We list here various forms of convergence:


Spaces of Random Variables 103

Definition 6.4.1 (Modes of Convergence)


Let (Ω, F, P) be a probability space, and let X1 , X2 , . . . and X be random variables.

(a) Xn converges to X almost surely, written Xn → X a.s., provided that the set
{ω ∈ Ω : Xn (ω) 6→ X(ω)} is P–null, i.e.

P(Xn → X) = 1

P
(b) Xn converges to X in probability, written Xn −→ X, provided that

lim P(|Xn − X| > ε) → 0 for all ε > 0


n

Lp
(c) Xn converges to X in Lp , or in pth mean, written Xn −→ X, iff each Xn ∈ Lp (Ω, F, P),
and ||Xn − X||p → 0, i.e. iff E|Xn |p < +∞ for all n and

E|Xn − X|p → 0 as n → +∞

Remarks 6.4.2 (a) Many of the above notions can be extended to random vectors. For example,
P
if Xn , X : (Ω, F, P) → (Rd , B(Rd )), we say that Xn → X iff limn P(||Xn − X|| > ε) = 0 for all
ε > 0. Proofs of the results below can easily be extended to the more general case.
(b) Just a note of caution: The limits in the above notions of convergence are not unique, but
P
unique only up to a.s.–equality. Thus, for example, if Xn → X and X = Y a.s., then also
P
Xn → Y . The same goes for a.s. convergence and convergence in Lp .


6.4.2 Convergence in Probability


Note that convergence in pth mean is just ordinary convergence in the metric space Lp (with metric
d(f, g) = ||f − g||p ). Convergence in probability is also determined by a metric. First note the
following:

Lemma 6.4.3 Suppose that f : R+ → R+ is an increasing bounded continuous function with the
properties that f (0) = 0, and that f (x) > 0 whenever x > 0. Then:
P
Xn → X iff Ef (|Xn − X|) → 0

Proof: (⇒): Let K be a bound for f , and let ε > 0. Choose δ such that 0 ≤ f (x) < ε whenever
0 ≤ x ≤ δ. Then
h i h i
Ef (|Xn − X|) = E f (|Xn − X|); |Xn − X| > δ + E f (|Xn − X|); |Xn − X| ≤ δ
≤ KP(|Xn − X| > δ) + εP(|Xn − X| ≤ δ)
Since by assumption P(|Xn − X| > δ) → 0 as n → ∞, we obtain limn Ef (|Xn − X|) ≤ ε. Since ε
was arbitrary, limn Ef (|Xn − X|) = 0.
104 Convergence of Random Variables

(⇒): Let ε > 0. Then

f (ε)I{|Xn −X|>ε} ≤ f (|Xn − X|)I{|Xn −X|>ε} ≤ f (|Xn − X|)

and so
0 ≤ f (ε) lim sup P(|Xn − X| > ε) ≤ lim Ef (|Xn − X|) = 0
n n

Since f (ε) > 0, it follows that limn P(|Xn − X| > ε) = 0. a

Using the function f (x) = xp ∧ 1, we see that:

P
Proposition 6.4.4 If p ≥ 1, then Xn → X iff E[|Xn − X|p ∧ 1] → 0.

Observe that 0 ≤ |Xn − X|p ∧ 1 ≤ 1, so |Xn − X|p ∧ 1 is always P–integrable.


With p = 1, we see that
P
Xn → X iff E[|Xn − X| ∧ 1] → 0

6.4.3 Relationships between Different Modes of Convergence


We now investigate how the various modes of convergence are related. We begin with a simple
result:

Proposition 6.4.5 If hXn in is a sequence of random variables, and if



X
E|Xn | < ∞
n=1
P∞
Then the series n=1 Xn converges a.s., and the limit is an integrable random variable.
Moreover,

hX i X∞
E Xn = E[Xn ]
n=1 n=1

Proof: Suppose that ∞


P Pn Pn
n=1 E|Xn | < ∞ and define Sn := k=1 |Xk | and Tn := k=1 Xk . We must
show that hTn in converges almost surely to an integrable random variable T .
Now clearly hSn in is an increasing sequence of non–negative random variables. Define S :=
limn Sn , but note that S(ω) may be +∞ for some ω ∈ Ω. By the Monotone Convergence Theorem
n
X ∞
X
E[S] = lim E[Sn ] = lim E[|Xk |] = E[|Xk |] < ∞
n n
k=1 k=1

Hence S is integrable.PIt follows that P(S = ∞) = 0 (by Proposition 4.1.9), so that for almost
all ω ∈ Ω the series
P n=1 Xn (ω) is absolutely convergent, and hence convergent. Thus T :=
X
k=1 k = lim n Tn converges a.s.
Spaces of Random Variables 105

Now since |Tn | ≤ Sn ≤ S, and S is integrable, the Dominated Convergence Theorem ensures
that

hX i n
X ∞
X
E Xn = E[lim Tn ] = lim E[Tn ] = lim E[Xk ] = E[Xn ]
n n n
n=1 k=1 n=1

P
Proposition 6.4.6 Xn → X iff every subsequence hXnk ik has a subsubsequence hXnkj ij
such that Xnkj → X a.s..
P
In particular, if Xn → X a.s., then Xn → X.

P
Proof: (⇒): Fix a subsequence hXnk ik of hXn in . Then certainly also Xnk → X, so we may choose
a subsubsequence hXnkj ij such that

E[|Xnkj − X| ∧ 1] < 2−j

Then ∞
P P
j=1 E[|Xnkj − X| ∧ 1] < ∞. Thus by Proposition 6.4.5 we have that j |Xnkj − X| ∧ 1
converges a.s. Now if a series converges, the terms of the series must converge to zero, so |Xnkj −
a.s.
X| → 0 a.s., i.e. Xnkj → X as j → ∞.
P
(⇐): We use Propn. 6.4.4 with p = 1. Suppose that Xn 6→ X, i.e. that lim supn E[|Xn −X|∧1] ≥
ε for some ε > 0. Choose a subsequence hXnk ik such that E[|Xnk − X| ∧ 1] > ε for all k. If hXnkj ij
is any subsubsequence, we cannot have Xnkj → X a.s., for then |Xnkj − X| ∧ 1 → 0 a.s., and hence
E[|Xnkj − X| ∧ 1] → 0 by the Dominated Convergence Theorem. a

P P
Corollary 6.4.7 If Xn → X and f : R → R is a.s. continuous at X, then also f (Xn ) →
f (X).

Proof: Given a subsequence hXnk ik , choose a subsubsequence such that Xnkj → X a.s. Then
P
f (Xnkj ) → f (X) a.s., because f is a.s. continuous at X. By Propn. 6.4.6, f (Xn ) → f (X). a

Lp P
Proposition 6.4.8 If Xn → X for p ≥ 1, then Xn → X.

Proof: Since |Xn −X|p ∧1 ≤ |Xn −X|p , we see that E[|Xn −X|p ∧1] → 0 whenever E|Xn −X|p → 0.
P Lp
By Propn. 6.4.4, we see that Xn → X whenever Xn → X. a

Remarks 6.4.9 So far we have seen that a.s.–convergence and Lp –convergence both imply con-
vergence in probability, and that convergence in probability implies weak convergence. Without
further assumptions, this is the best we can do:
(a) Convergence in Lp does not imply almost sure convergence: On the unit interval with Lebesgue
measure, consider the following sequence of intervals:
106 Convergence of Random Variables

[0, 1], [0, 21 ], [ 12 , 1], [0, 31 ], [ 13 , 23 ], [ 32 , 1], [0, 41 ], [ 14 , 24 ], [ 24 , 34 ], [ 34 , 1], [0, 15 ] . . .

Let Xn be R the indicator function of the nth interval on this list. It is clear that for sufficiently
p
large n, Xn dλ is arbitrarily small, and thus that Xn converges to 0 in Lp for all p ≥ 1. How-
ever, Xn does not converge pointwise anywhere: For any x, there are infinitely many intervals
on the list which contain x, and infinitely many which don’t. It follows that lim supn Xn = 1,
lim inf n Xn = 0, i.e. that Xn is nowhere convergent.

(b) Almost sure convergence does not imply convergence in Lp : On the unit interval with Lebesgue
measure, let Xn = nI[0,( 1 )p ] . Then Xn → 0 a.s., but E|Xn − 0|p = 1 for all n ∈ N, so Xn does
n
not converge to 0 in Lp mean.

(c) Convergence in probability does not imply convergence in Lp or almost sure convergence: This
follows from the previous two examples. For instance, (a) is an example of a sequence which
converges in mean, and thus in probability, but not almost surely.

Exercise 6.4.10 (a) We show that rapid convergence in probability implies a.s. convergence.
Suppose that hXn in is a sequence of random variables. Say that hXn in converges to X rapidly
in probability if and only if

X
P(|Xn − X| > ε) < ∞ for every ε > 0
n=1

a.s.
Show that, in that case, Xn → X.
P a.s.
(b) Use (a) to prove that if Xn → X, then hXn in has a subsequence hXnk ik such that Xnk → X.
a.s.
[Hint: (a) Suppose that Xn 6→ X. Explain why there is ε > 0 such that P(|Xn −X| > ε, i.o) > 0.
Now use a Borel–Cantelli Lemma.
(b) Explain why we may choose an increasing sequence of natural numbers hnk ik such that P(|Xnk −
X| > 2−k ) < 2−k for all k ∈ N. Given ε > 0, choose k0 such that 2−k0 < ε. Explain why
P
k≥k0 P(|Xnk − X| > ε) < ∞. Now use (a).] 
Chapter 7

Conditional Expectation

7.1 Definition of Conditional Expectation


The concept of conditional expectation, in its modern form due to Kolmogorov, is quite abstract,
but it is one of the most important concepts in probability theory. We therefore begin with some
examples to familiarize ourselves with the idea of conditioning, and to provide some motivation.

7.1.1 Conditioning on an Event


In Section 3.1, we introduced the notion of the conditional probability P(B|A) that B occurs, given
that A has occurred:
P(B ∩ A)
P(B|A) :=
P(A)
Suppose that X is a random variable and that A is an event with positive probability. Before we
know that A has occurred, we have a best estimate of what x will be, namely EX. If we are told
that A has occurred, however, we will revise our expectation. For example, the expected value if a
fair die is rolled is 3.5. However, if we are told that the outcome is even, we revise our estimate: the
expected value is now 4. If we are told that the outcome is odd, the expected value is 3. Essentially,
when we are told that an event A has occurred, we revise our probability measure: Each outcome
outside A is now assigned measure 0, whereas each event inside A has its probability scaled up by
dividing by P(A): P(B|A) = P(B) P(A) for all B ⊆ A. We use the new measure P(.|A) on the sample
space A to calculate our revised expectation: The expectation of X given A is
Z Z
1
E[X|A] : = X dP(.|A) = X dP
A P(A) A
E[X; A]
=
P(A)

7.1.2 Conditioning on a σ–Algebra in a Discrete Probability Space


Suppose that we are working with a discrete probability space (Ω, F, P), where Ω = {ωn : n =
1, 2, . . . }, and each P{ωn } > 0.. If G is a σ–subalgebra of F, then there is a partition B1 , B2 , . . .

107
108 Definition of Conditional Expectation

which generates G, i.e. such that G = σ(B1 , B2 , . . . ). In the previous subsection, we defined E[X|G]
for any G ∈ G. In particular, cn := E[X|Bn ] has been defined for each block Bn .
Consider now a random variable Z defined by

Z(ω) = cn when ω ∈ Bn

i.e. Z is a random variable which takes the constant value cn on the block Bn :
X
Z= cn IBn
n

It is then clear that


The random variable Z is measurable with respect to the σ–algebra G.
Z is clearly an approximation of X. In fact, Z is the best approximation of X which is measurable
with respect to the σ–algebra G, where “best” is meant in a sense which we shall make fully precise
later.
Example 7.1.1 A fair die is rolled, and you get an amount equal to the outcome ω, but only if
it is even. Let X be your winnings, i.e. X(ω) = ωI{2,4,6} (ω). Let G = σ({2, 4, 6}, {1, 3, 5}). We
calculate the random variable E[X|G]:
 1
(2 + 4 + 6)
 E[X|{2, 4, 6}] = E[X; {2, 4, 6}] = 6

= 4 if ω is even

 P({2, 4, 6}) 3
6
E[X|G](ω) = 1
E[X; {1, 3, 5}] 6 (0 + 0 + 0)

3, 5}] = = = 0 if ω is odd.


 E[X|{1, 3
P({1, 3, 5}) 6


Since Z := E[X|G] is a random variable, we can try to integrate it. If Z(ω) = cn when ω ∈ Bn ,
i.e. if cn = E[X|Bn ], then

E[Z; Bn ] = cn · E[IBn ] because Z has value cn on Bn


= E[X|Bn ]P(Bn )
E[X; Bn ]
= P(Bn )
P(Bn )
= E[X; Bn ]
Arbitrary σ–algebras on arbitrary probability spaces need not be generated by countable partitions,
i.e. there may be no “blocks”. However, we can use an “averaging” Rproperty to make the above
calculation independent of a generating partition. Since E[X; A] = A X dP is the integral — a
kind of “average” — of the random variable X over the set A, it follows that X and, Z = E[X|G]
have the Ssame integrals —“averages” — on each block. Now each G ∈ G is a union of blocks Bn ,
i.e. G = k Bnk for some sequence Bnk . Thus, assuming the integrals exist,
Z XZ XZ Z
X dP = X dP = Z dP = Z dP
G k Bnk k Bnk G
Conditional Expectation 109

i.e. E[Z; G] = E[X; G] for all G ∈ G.


We have shown:

Theorem 7.1.2 Let X be an integrable random variable on a discrete probability space


(Ω, F, P), and let G be a σ–subalgebra of F. Then the conditional expectation E[X|G] is a
random variable with the properties that

(a) E[X|G] is G–measurable.


R R
(b) G E[X|G] dP = G X dP for all G ∈ G.

7.1.3 Conditioning on a σ–Algebra in a General Probability Space


To define E[X|G] on a general probability space, not necessarily discrete, we turn things upside
down, and make Theorem 7.1.2 into a definition:
Definition and Theorem 7.1.3 (Kolmogorov)
Suppose that (Ω, F, P) is a probability space and that X is a random variable in
L1 (Ω, F, P). Let G be a sub–σ–algebra of F. Then there exists a random variable Z
such that

(i) Z is G–measurable.

(ii) Z ∈ L1 (Ω, F, P), i.e. E(|Z|) < +∞.

(iii) For every set G ∈ G, we have


Z Z
E[Z; G] = E[X; G] i.e. Z dP = X dP
G G

Moreover, if Z 0 is a random variable satisfying (i),(ii),(iii), then Z = Z 0 a.s. Any random


variable Z with the properties (i),(ii),(iii) is called a version of the conditional expectation
of X given G. We write

Z = E[X|G] a.s. or Z = EG X a.s.

Definition 7.1.4 We define conditional expectation w.r.t to general random variables in


the following manner:
E[X|Y ] := E[X|σ(Y )]
To prove that conditional expectations exist, we give a geometric argument, involving approxi-
mation in a Hilbert space.
Before we start the second proof, recall that a Hilbert space V is a vector space which is equipped
with an inner product, which we will denote hv1 , v2 i. Now an inner product automatically induces
a norm (length), and angle
1 hv1 , v2 i
||V || = hv, vi 2 cos θ =
||v1 ||||v2 ||
110 Definition of Conditional Expectation

Here θ is the angle between v1 and v2 . We say that v1 , v2 are orthogonal if hv1 , v2 i = 0. Hilbert
spaces are also complete, i.e. every Cauchy sequence in V converges (to a vector in V ).
Suppose that W is a complete subspace of V . We then have the notion of orthogonal projection
onto W . Given any vector v ∈ V , there exists adecomposition
v = v || + v ⊥
a unique vector w with the following properties:
(1) v || ∈ W .
(2) v ⊥ ⊥ W , i.e. hv ⊥ , wi = 0 for all w ∈ W .
(3) ||v − v || || = inf{||v − w|| : w ∈ W }.
Thus v || is the vector in W which is the best approximation of v: It lies closer to v than any other
w ∈ W . v || is called the orthogonal projection of v onto W .
Recall also that L2 (Ω, F, P) is a Hilbert space, with inner product hX, Y i = EXY and induced
1
norm ||X||2 = (EX 2 ) 2 . (We ignore here the small difference between L2 and L2 .)

Proof of Thm. 7.1.3: First assume that X ∈ L2 (Ω, F, P). Note that L2 (Ω, G, P) is a closed
subspace of L2 (Ω, F, P), and thus there exists a decomposition
X =Z +Y where Z ∈ L2 (Ω, G, P) and Y ⊥ L2 (Ω, G, P)
Moreover, ||X − Z||2 = inf{||X − U ||2 : U ∈ L2 (Ω, G, P)}. Now Z is clearly G–measurable. Also, if
G ∈ G, then IG ∈ L2 (Ω, G) and so Y ⊥ IG . Hence
E[Z; G] = hZ, IG i = hX, IG i = E[X; G] all G ∈ G
It follows that Z = E[X|G] a.s.
For X ∈ L1 (Ω, F, P), we use an approximation argument. First assume that X ≥ 0, and for
n ∈ N, define Xn := X ∧ n. Then Xn ↑ X, and each Xn ∈ L2 (Ω, F, P). By the above, there are
Zn ∈ L2 (Ω, G, P) such that Zn = E[Xn |G] a.s. Next, note that if n ≤ m, then Xn ≤ Xm , and
thus Zn ≤ Zm a.s.: For if ε > 0, and Gε := {Zn − Zm > ε}, then Gε ∈ G, so that 0 ≤ S ε · PGε ≤
E[Zn − Zm ; Gε ] = E[Xn − Xm ; Gε ] ≤ 0. Hence PGε = 0 for all ε > 0. Now {Zn > Zm } = k∈N G 1 ,
k
and hence P(Zn > Zm ) = 0, i.e. Zn ≤ Zm a.s., for each pair n ≤ m. Taking the (countable)
intersection over all such pairs yields P(Zn is an increasing sequence) = 1, i.e. the sequence (Zn )n
is increasing a.s. Define Z = lim supn Zn . Then Z is G-measurable, and Zn ↑ Z a.s. If G ∈ G, then
by two applications of the MCT we have
E[Z; G] = lim E[Zn ; G] = lim E[Xn ; G] = E[X; G]
n n

Hence Z = E[X|G] a.s.


The existence of E[X|G] for integrable X follows by decomposition into positive and negative
parts.
The a.s. uniqueness of E[X|G] is straightforward: If Z, Z 0 are two versions of E[X|G], and ε > 0
then Gε := {Z − Z 0 > ε} ∈ G, and hence 0 ≤ ε · PGε ≤ E[Z − Z 0 ; Gε ] = E[X − X; Gε ] = 0. Arguing
as above, we see that P(Z > Z 0 ) = 0. By symmetry, P(Z 0 > Z) = 0 as well, i.e. Z = Z 0 a.s. a
Conditional Expectation 111

Remarks 7.1.5 (a) If G = {∅, Ω}, the algebra of zero information, then

E[X|G] = EX

Recall that only the constant functions are G—measurable, and it is obvious that the best
constant approximation to X is EX. However, we must be precise, and show that EX is
a version of the conditional expectation. Since EX is just a number, we can regard it as a
constant function, so that EX is G–measurable. Moreover
Z Z Z Z
X dP = EX dP = 0 X dP = EX dP = EX
∅ ∅ Ω Ω
R R
and thus GX dP = G EX dP for all G ∈ G.

(b) If X is G–measurable, then E[X|G] = X. Again, this is obvious: If X is G–measurable, then G


contains all the information needed to determine X. The best G–measurable “approximation”
of X is therefore X itself. Check the definition!!

(c) If B ∈ G and P(B) > 0, then E[E[X|G]|B] = E[X|B]. To see this, note that
Z Z
1 1
E[E[X|G]|B] = E[X|G] dP = X dP = E[X|B]
P(B) B P(B) B

Thus the random variables X and E[X|G] have the same conditional expectations if we condition
over events in G. 

Example 7.1.6 Let (Ω, F, P) be the probability space which models the rolling of a fair die, and
let G be the σ–algebra which contains the information whether the outcome is odd or even, i.e. G =
σ({1, 3, 5}, {2, 4, 6}). Let X be a random variable with X(ω) = ω 2 . We want to determine E[X|G].
Now E[X|G] is G–measurable, and therefore constant on the sets A = {1, 3, 5} and B = {2, 4, 6}.
The integral of X over the set A is

1 35
E[X; A] = (12 + 32 + 52 ) =
6 6

Since E[X|G] is constant on the set A, with value cA say, we must have
Z Z Z
35
cA dP = E[X|G] dP = X dP =
A A A 6
35 56
which yields cA = 3 . Similarly the value cB of E[X|G] on B must be cb = 3 . We thus have

35 56
E[X|G] = 3 I{1,3,5} + 3 I{2,4,6}

Check that this satisfies all the requirements of a conditional expectation. 


112 Properties of Conditional Expectation

Example 7.1.7 Take Ω = (0, 1] with the σ–algebra of Borel sets and with P the Lebesgue measure
on (0, 1]. Define two random variables X, Y as follows:

1

 1 if 0 < x ≤

X(x) = 3x2 Y (x) = 2
 4x if 1 < x ≤ 1

2

We want to find a version of E[X|Y ]. First we need to find the σ–algebra generated by Y , i.e. we
must describe the family of sets Y −1 (B), B ∈ B. Now Y −1 ({1}) = (0, 12 ], which shows that this
half–open interval is a smallest non–empty set in σ(Y ). This makes sense: Y cannot distinguish
between any of the elements of (0, 21 ], and therefore neither can σ(Y ). Moreover, if 2 < a < b ≤ 4,
then Y −1 (a, b) = ( 14 a, 41 b) ⊆ ( 12 , 1]. It is therefore clear that every open set in ( 21 , 1] belongs to σ(Y ),
and thus that the family B( 12 , 1] of Borel subsets of ( 21 , 1] is a subset of σ(Y ). It is now easy to see
that A ∈ σ(Y ) iff there is B ∈ B( 12 , 1] such that either A = B or A = (0, 21 ] ∪ B. This completely
describes the family σ(Y ). Since E[X|Y ] has to be σ(Y )–measurable, it must be constant on (0, 12 ].
Also, since (0, 12 ] ∈ σ(Y ), we must have

Z Z Z 1
2 1
E(X|Y ) dP = X dP = 3x2 dx =
(0, 12 ] (0, 21 ] 0 8

so that the random variable E[X|Y ] must take the constant value 41 for x ∈ (0, 12 ]. The σ–algebra
σ(Y ) can distinguish between all the points in the interval ( 12 , 1]: For example, if we know that
Y = 3, then we know that the outcome is x = 43 . Thus if we know the value of Y for values
2 < Y ≤ 4, then we know the outcome, and if we know the outcome, we know the value of X. We
therefore expect the best σ(Y )–measurable approximation of X over ( 12 , 1] to be X itself, i.e. we
expect E[X|Y ](x) = 3x2 for 21 < x ≤ 1. Putting this together, define a random variable

(
1 1
4 if 0 < x ≤ 2
Z(x) = 2 1
3x if 2 <x≤1

We just need to show that Z is a version of E[X|Y ]. By similar arguments as for Y , it is clear
that σ(Z) = σ(Y ), and thus that Z is σ(Y )–measurable. Now suppose that A ∈ σ(Y ). Then
either A = B for some B ∈ B( 12 , 1] or A = (0, 21 ] ∪ B. In the former case we obviously have
E[Z; A] = E[X; A], because Z = X on the interval ( 21 , 1]. In the latter case, we have

E[Z; A] = E[Z; (0, 12 ]] + E[Z; B] = 1


8 + E[X; B] = E[X; A]


Conditional Expectation 113

7.2 Properties of Conditional Expectation


Theorem 7.2.1 (Properties of Conditional Expectation)
The following are true for random variables on a probability space (Ω, F, P) whenever the
expressions occurring inside a conditional expectation are integrable.
 
(a) E E[X|G] = EX a.s.;

(b) If X is G–measurable, then E[X|G] = X a.s.

(c) LINEARITY: E[a1 X1 + a2 X2 |G] = a1 E[X1 |G] + a2 E[X2 |G] a.s.

(d) POSITIVITY: If X ≥ 0, then E[X|G] ≥ 0 a.s.

(e) cMCT: If 0 ≤ Xn ↑ X, then E[Xn |G] ↑ E[X|G] a.s.

(f ) cFATOU: If Xn ≥ 0, then E[lim inf n Xn |G] ≤ lim inf n E[Xn |G] a.s.

(g) cDCT: If |Xn | < Y (all n ∈ N) for some integrable Y , and if Xn → X, then
E[Xn |G] → E[X|G] a.s.
     
(h) PROJECTION: E X · E[Y |G] = E E[X|G] · Y = E E[X|G] · E[Y |G] .

(i) If Y is G–measurable, then E[Y X|G] = Y E[X|G] a.s.


   
(j) TOWER: If H ⊆ G, then E E[X|G]|H = E E[X|H]|G = E[X|H].

(k) INDEPENDENCE: If H is independent of σ(X) ∨ G, then E[X|G ∨ H] = E[X|G]


a.s.

Exercise 7.2.2 Prove Thm. 7.2.1(a), (b), (c), (d).


[Hint: (c) means that if Yk are versions of E[Xk |G] for k = 1, 2, then a1 Y1 + a2 Y2 is a version of
E[a1 X1 + a2 X2 |G].
For (d), let Z = E[X|G] a.s., and note that {Z < 0} = n∈N {Z < − n1 } ∈ G.]
S


Proof of Thm. 7.2.1(e)–(k):


(e): Suppose that 0 ≤ Xn ↑ X, and define Yn := E[Xn |G] a.s. By (d), Yn is increasing a.s.
Define Y = lim supn Yn , so that Y is G–measurable and Yn ↑ Y a.s. If G ∈ G, then by the MCT,
E[X; G] = limn E[Xn ; G] = limn E[Yn ; G] = E[Y ; G]. Hence Y = E[X|G] a.s.
(f): Let Zn := inf k≥n Xk a.s., so that Zn ↑ lim inf n Xn . Since Zn ≤ Xk whenever k ≥ n, we
have E[Zn |G] ≤ E[Xk |G] a.s. whenever k ≥ n, and hence E[Zn |G] ≤ inf k≥n E[Xk |G] a.s. Now by
cMCT,
E[lim inf Xn |G] = lim E[Zn |G] ≤ lim inf E[Xk |G] = lim inf E[Xn |G]
n n n k≥n n

(g): Y ± Xn are non–negative random variables, so by cFATOU,



E[Y |G] + lim inf ± E[Xn |G] = lim inf E[Y ± Xn |G] ≥ E[lim inf Y ± Xn |G] = E[Y |G] ± E[X|G] a.s.
n n n
114 Properties of Conditional Expectation

Since E[Y |G] is integrable, it is finite a.s., and hence can be cancelled to yield lim inf n (±E[Xn |G] ≥
±E[X|G], which implies
E[X|G] ≤ lim inf E[X|G] ≤ lim sup E[Xn |G] ≤ E[X|G] a.s.
n n

(h): This follows from the usual properties of projections if X, Y ∈ L2 (Ω, F, P): For suppose that
X = Xk + X⊥ , Y = Yk + Y⊥ are decompositions of X, Y into components parallel and perpendicular
to L2 (Ω, G, P, so that Xk = E[X|G], Yk = E[Y |G] (by the second proof of Thm. 7.1.3). Then
   
E X · E[Y |G] = hX, Yk i = hXk , Yk i = E E[X|G] · E[Y |G] a.s.
because hX⊥ , Yk i = 0. If X, Y ∈ L2 (Ω, F, P) are non–negative, we may define Xn := X ∧ n, Yn :=
Y ∧ n. Then 0 ≤ Xn ↑ X and 0 ≤ Yn ↑ Y , and Xn , Yn ∈ L2 (Ω, F, P). It follows by the MCT and
cMCT that
       
E X · E[Y |G] = lim E Xn · E[Yn |G] = lim E E[Xn |G] · E[Yn |G] = E E[X|G] · E[Y |G] a.s.
n n

(i): If Y = IG is an indicator function, and G0 ∈ G, then


E Y E[X|G]; G0 = E E[X|G]|G ∩ G0 = E[X; G ∩ G0 ] = E[Y X; G0 ]
   

Hence Y E[X|G] is a version of E[Y X|G]. The result now follows by linearity and cMCT.
(j): Consider the case where X ∈ L2 (Ω, F, P). Since L2 (Ω, H, P) ⊆ L2 (Ω, G, P) ⊆ L2 (Ω, F, P)
are closed Hilbert subspaces, the result follows from the fact that a projection of a projection is a
projection. Alternatively, let
 
Y := E[X|G] a.s. Z := E E[X|G]|H = E[Y |H] a.s.
If H ∈ H ⊆ G, then
E[Z; H] = E[Y ; H] = E[X; H]
 
and hence Z is a version of E[X|H], i.e. E E[X|G]|H = E[X|H] a.s.
The fact that E E[X|H]|G = E[X|H] a.s. follows directly (i).
(k): Let Y := E[X|G]. Since Y is certainly G ∨ H–measurable, we must show that E[Y ; F ] =
E[X; F ] for all F ∈ G ∨ H. Now let C := {G ∩ H : G ∈ G, H ∈ H}, and let D := {F ∈ G ∨ H :
E[Y ; F ] = E[X; F ]}. First note that C ⊆ D: For if G ∈ G, H ∈ H, then E[X; G∩H] = E[XIG ]E[IH ],
by independence, and so
E[X; G ∩ H] = E[X; G]E[IH ] = E[Y ; G]E[IH ] = E[Y ; G ∩ H]
since Y IG is independent of H. It is straightforward to verify that C is a π–system that generates
G ∨ H, and that D is a λ–system. Hence by Dynkin’s Lemma (Thm. 1.6.3), D = G ∨ H.
a
Proposition 7.2.3 (Jensen’s inequality)
Suppose that g : U → R is a convex function on an open interval U ⊆ R, and that X is
a random variable with values in U (a.s.) such that both X and g(X) have finite expected
values. Then
E[g(X)|G] ≥ g (E[X|G])
Conditional Expectation 115

Proof: We use notation and results from Remarks 4.7.4. Let v ∈ U , and let D− (v) = lim ∆(u, v)
u↑v
and D+ (v) = lim ∆(v, w). Then D− (v), D+ (v) both exist, and D− (v) ≤ D+ (v). Now suppose that
w↓v
m is a real number satisfying D− (v)
≤m≤ D+ (v),
and that x ∈ U . We consider two cases: If (i)

x ≤ v, then ∆(x, v) ≤ D (v) (since ∆(u, v) increases as u ↑ v) and thus ∆(x, v) ≤ m. It follows
that g(x) ≥ m(x − v) + g(v). Next, if (ii) x ≥ v, then ∆(v, x) ≥ D+ (v) (because ∆(v, w) decreases
as w ↓ v) and thus ∆(v, x) ≥ m. It follows that g(x) ≥ m(x − v) + g(v). Hence, in either case, we
have
g(x) ≥ m(x − v) + g(v)

for any v ∈ U , any x ∈ U , and any D− (v) ≤ m ≤ D+ (v).


We are now ready to prove Jensen’s inequality: Put v = E[X|G]. Then

g(X) ≥ m(X − E[X|G]) + g(E[X|G]) a.s. whenever D− (E[X|G]) ≤ m ≤ D+ (E[X|G])

If we now take conditional expectations on both sides , then

E[g(X)|G] ≥ m(E[X|G] − E[X|G]) + E[g(E[X|G])|G] = g(E[X|G])

Some notation: we define


 
E[X|G|H] := E E[X|G]|H

We end this section with some simple examples:

Examples 7.2.4 (a) Suppose that X is a random variable in Lp (Ω, F, P), where p ≥ 1, and let Y
be a version of E(X|G). Then Y in Lp (Ω, F, P) as well. This is because g(x) = |x|p is convex.
By Jensen’s inequality, we therefore have |E(X|G)|p ≤ E(|X|p |G) a.s., i.e. |Y |p ≤ E(|X|p |G). It
follows that E|Y |p ≤ E[E(|X|p |G)] = E|X|p < +∞, by (I).

(b) Suppose that X ∈ L2 (Ω, F, P) (i.e. that var(X) exists). If Y is a version of E(X 2 |G), then
Y ∈ L2 (Ω, F, P) as well, by (a), and EY 2 ≤ EX 2 . Since EY = EX (by (I)), we thus have

var(Y ) = EY 2 − (EY )2 ≤ EX 2 − (EX)2 = var(X)

Thus var(Y ) ≤ var(X). This reflects the fact that Y , being cruder, can’t vary as much as X
can.


7.3 The Radon–Nikodým Theorem


We begin with some definitions:
116 Radon–Nikodým Theorem

Definition 7.3.1 Let ν, µ be measures on a measurable space (Ω, F).

(i) We say that ν is absolutely continuous w.r.t. µ, and write ν  µ, iff µ(A) = 0 implies
ν(A) = 0 for all A ∈ F.

(ii) µ, ν are equivalent iff ν  µ and µ  ν.

(iii) We say that µ, ν are mutually singular, and write µ ⊥ ν, iff there exists A ∈ S such
that µ(A) = 0 = ν(Ac ).

Remarks 7.3.2 By definition, two measures are equivalent iff they have the same null sets. It is
easy to show that, in that case, they have the same sets of positive measure, and (assuming the y
are probability measures) they have the same sets of measure 1. 

Suppose that (Ω, F, µ) is a measure space, and that f ∈ F + . Recall from Propn. 4.5.1 that
there is a measure ν on (Ω, F), defined by
Z
ν(A) = f dµ
A


The map f is called the density, or Radon–Nikodým derivative, of ν w.r.t. µ, and also denoted dµ .
Furthermore, we showed that for
Z Z

g dν = g dν

for any ν–integrable function g (cf. Propn. 4.5.2, the Chain Rule).
It should be clear that ν  µ
The Radon–Nikodým Theorem (below) states that this way of constructing an absolutely R con-
tinuous measure is the only way to do so: If ν  µ, then ν has a density, i.e. then ν(A) = A f dµ
for some non–negative measurable f .
We leave the following proposition as an exercise:

Proposition 7.3.3 (a) Let µ, ν, η be σ–finite measures on a measurable space (Ω, F), and
dν dη dη
suppose that dµ and dν exist. Then dµ exists, and

dη dη dν
=
dµ dν dµ

dν dν dµ
(b) If dµ exists and dµ > 0 µ–a.e., then dν exists and

dµ 1
= µ–a.e.
dν dν/dµ

where dµ
dν may be defined to be an arbitrary constant (e.g. 0) where

dµ = 0.
In particular, µ, ν are equivalent measures.
Conditional Expectation 117

Here is the main result of this section:

Theorem 7.3.4 (Radon–Nikodým)



Suppose that ν, µ are σ–finite measures on (Ω, F) and that ν  µ. Then dµ exists, i.e.
there exists a µ–a.e. unique f ∈ mF + such that ν = f · µ.

Proof: We prove the resultRfor the case where µ, ν are finite measures on (Ω, F), and that ν  µ.

If ν(Ω) = 0, then ν(A) = A 0 dµ for all A ∈ F, i.e. dµ = 0. We may therefore assume that
ν(Ω) > 0, from which it follows that also µ(Ω) > 0.
Define the probability space space (Ω̃, F̃, P) by

Ω̃ := Ω × {0, 1} F̃ := (A × {0}) ∪ (B × {1}) : A, B ∈ F

and  
1 µ(A) ν(B) 
P (A × {0}) ∪ (B × {1}) := +
2 µ(Ω) ν(Ω)
(It is straightforward to verify that (Ω̃, F̃, P) is a probabilility space.)
Now define a random variable X and a σ–subalgebra G̃ ⊆ F̃ by

X := IΩ×{1} G̃ := A × {0, 1} : A ∈ F

Let Z := E[X|G̃], and note that


Z(ω, 0) = Z(ω, 1) P̃–a.s.
Therefore we may define a function g : (Ω, F) → (R, B(R)) by

g(ω) := Z(ω, j) j = 0, 1

Note that g = Z ◦ ιj , where ιj : (Ω, F) → (Ω̃, F̃) : ω 7→ (ω, j) is measurable. Hence g is F–


measurable. Furthermore 0 ≤ g ≤ 1 , because X is an indicator function.
By definition of conditional expectation, we have E[X; A × {0, 1}] = E[Z; A × {0, 1}] for any
ν(A)
A ∈ F. Now clearly E[X; A × {0, 1}] = P(A × {1}) = 2ν(Ω) . On the other hand
Z
µ
E[Z; A × {0, 1}] = g d( 21 [ µ(Ω) + ν
ν(Ω) ])
A

We may conclude that


Z
µ ν(A)
g d( 12 [ µ(Ω) + ν
ν(Ω) ]) = 2ν(Ω) A∈F
A

and thus that Z Z Z


1 1 1
2µ(Ω) g dµ + 2ν(Ω) g dν = 2ν(Ω) 1 dν
A A A
i.e. that Z Z
1 1
2µ(Ω) g dµ = 2ν(Ω) 1 − g dν (∗)
A A
118 Radon–Nikodým Theorem

Define therefore a new measure η on (Ω, F) by


Z Z
1 1
η(A) := 2µ(Ω) g dµ = 2ν(Ω) 1 − g dν
A A

Then
dη g dη 1−g
= =
dµ 2µ(Ω) dν 2ν(Ω)
and thus η  µ, η  ν.
1
R
With A := {ω ∈ Ω : g(ω) = 1} in (∗)., we see that ν(Ω) A 1 − g dν = 0, and hence that
1
R
µ(Ω) A g dµ = 0 also. It follows that µ({ω ∈ Ω : g(ω) = 1}) = 0, and since ν  µ, that

ν({ω ∈ Ω : g(ω) = 1}) = 0. In particular ν({ω ∈ Ω : 1 − g(ω) = 0}) = 0, and hence dν > 0 ν–a.e.
It follows that dν dν 1
dη exists, and that dη = dη/dν — cf. Propn. 7.3.3.
Now define f : (Ω, F) → (R, B(R)) by

 ν(Ω) g(ω)

if g(ω) < 1
f (ω) := µ(Ω) 1 − g(ω)

0 else

Then Z Z Z Z
g/2µ(Ω) dη dν dν
f dµ = dµ = dµ = dµ = ν(A)
A A (1 − g)/2ν(Ω) A dµ dη A dµ

and thus f = dµ .
a

Exercise 7.3.5 Extend the above proof to the case where µ, ν are σ–finite. 
Appendix A

Logic and Sets

A.1 Logic
We introduce here a formal language for talking about mathematical objects. This language is
very precise, and unambiguous — properties which are largely absent from spoken languages such
as English, but obviously essential for mathematics. But, as a result, this language is rather
restricted in scope. The reason we use it is to make certain statements amenable to logical analysis.
The purpose of logical analysis is to decide whether a particular sentence/expression (e.g. about
mathematical objects) is true (T) or false (F). A sentence/expression that is either true or false
(but not both!) is called a statement.

Example A.1.1 Here are some typical examples of statements:


• 1 + 1 = 3.

• The equation x2 + 2x + 1 = 0 has a real root.

• Either x2 + a = 0 has a real root, or a > 0.

• There exist are infinitely many prime numbers.

• Every continuous function is differentiable.


Note that a mathematical statement need not be true. 

More complicated statements in our formal language are built up from a collection of symbols,
including amongst others
• Symbols for objects, operations and relations;

• Logical Connectives;

• Quantifiers;
We will briefly discuss each of these in turn. None of this material is difficult, though it may take
a little while to get used to.

119
120 Logic

A.1.1 Symbols denoting Objects, Operations and Relations


When doing mathematics, we use symbols to denote certain mathematical objects, operations and
relations. For example, the expression √
x+3≤ π
contains the following symbols:
(i) Symbols denoting fixed objects, namely the constants 3 and π;
(ii) A symbol denoting a variable object, namely x;

(iii) Symbols denoting operations, namely +, ;
(iv) A symbol denoting a relationship, namely ≤;
So our language will contain symbols for
• Variables
Typically we use the symbols x, y, z, x1 , x2 , x3 . . .
• Relation symbols
For example, if we want to talk about partial orderings, we will want a symbol ≤; if we want
to talk about sets, we will want symbols ∈ and ⊆.
The identity relation (the relation of being equal) will be denoted by =.
• Function/Operation symbols
For example, if we want to talk about arithmetic, we will want binary function symbols +, ×.
We may want unary function symbols −,−1 . If we want to talk about sets, we will want
binary function symbols ∩, ∪, unary function symbols c , P;
• Constant symbols
These are specially named elements, and are often regarded as nullary function symbols. For
example, if we want to talk about addition, a distinguished element denoted by 0 plays an
important role. If we want to talk about sets, the set ∅ deserves its own name.
A formal language will generally not contain all of the above non–logical symbols, only those
needed to talk about the domain of discourse. L will also have brackets (, ), [, ], etc.

A.1.2 Logical Connectives


Once we are able to make basic statements such as 1 > 0 and x = 3, we are able to combine them
using the logical connectives and, or, implies (then), not to make new statements such as
(1 > 0) and (x = 3); If x > 0 then y = 1; x 6≥ 0
∧ and
¬ not
→ implies, then
∨ or
↔ if and only if
Logic and Sets 121

In probability theory, the set operations of union, intersection and complementation can be
interpreted as the logical connectives or, and and not.
In our formal language, the logical connectives have very precise meanings: If φ, ψ denote
statements, then

φ ∧ ψ is true ⇐⇒ both φ, ψ are true.


φ ∨ ψ is true ⇐⇒ at least one of φ, ψ is true, perhaps both.
φ → ψ is true ⇐⇒ whenever φ is true, so is ψ
i.e. it is not the case that φ is true but ψ is false.
φ ↔ ψ is true ⇐⇒ if both φ → ψ, ψ → φ are true
i.e. if φ, ψ are simultaneously true, or when they are simultaneously false.
¬φ is true ⇐⇒ φ is false.

Here is a truth table for the logical connectives:

φ ψ φ∧ψ φ∨ψ φ→ψ φ↔ψ ¬φ


T T T T T T F
T F F T F F F
F T F T T F T
F F F F T T T

This means, for example, that if φ is true and ψ is false — in the second row of the table — then
φ ∧ ψ is false, φ ∨ ψ is true, φ → ψ is true, etc.
Now it is extremely important to note that the logical use of and ∧, or ∨, and implies →,
though related to their common usage in English, is certainly not identical to it. In particular the
truth value T or F of an expression such as φ ∧ ψ, φ ∨ ψ, φ → ψ etc. depends only on the truth
values of φ and ψ, and not on any meaning that the statements φ, ψ might possess! Let us discuss
some of the pitfalls:

φ ψ φ∧ψ
T T T
• And, ∧: T F F
F T F
F F F
To say that ψ ∧ ψ simply means that both φ and ψ are true. It does not assert any connection
(causal or otherwise) between φ and ψ. This is not typically true in English. With the
English and, the following sentences have rather different meanings, but with the logical and
they mean the same thing:

1. Alice got drunk and failed her test.


2. Alice failed her test and got drunk.
122 Logic

φ ψ φ∨ψ
T T T
• Or, ∨: T F T
F T T
F F F
φ ∨ ψ is true precisely when at least one of φ, ψ is true, possibly both. In particular, it is not
exclusive–or (“either. . . , or. . . ”). Thus the statement

(1 > 0) ∨ (5 is a prime number)

is true.
φ ψ φ∨ψ
T T T
• Implies, Then, →: T F F
F T T
F F T
The logical (or material) implication is likely to present you with the most difficulties, as it
diverges considerably from its meaning in natural language. In English usage, implies (or
then) usually involves a causal connection, as in “If it is raining, then it is wet outside.” It is
wet because of the rain. But such a connection is irrelevant for the logical then. As we said
before, the truth value of the statement φ → ψ depends only on the truth values of φ and ψ.
Now the one thing that is certain is that a true statement cannot imply a false statement.
Thus T → F must be false, and we define:

φ → ψ is false if and only if φ is true but ψ is false.

Turning this around, we see that:

φ → ψ is true if and only whenever φ is true, then so is ψ.

Note that a false statement can obviously imply a false statement. This is because any
statement implies itself, i.e. φ → φ. Indeed, if φ is true, then φ is true. So φ → φ is true even
when φ is false. But a false statement can also imply a true statement, e.g. “If the moon is
made of green cheese, then the moon has mass”.
Thus there are severe differences between the English usage and the mathematical usage of
implies. For example, the statement

(1 > 0) → (5 is a prime number)

is true. Of course, the reason that 5 is prime is not because of the fact that 1 > 0!! There is
no causal connection. Indeed, (1 < 0) → (5 is prime) is true also!
We repeat: A logical φ → ψ statement is false only when φ is true and ψ is false — just look
at the truth table.
Logic and Sets 123

– In particular, if ψ is true, then φ → ψ is also true, no matter what φ might be.


– Even more surprisingly, if φ is false, then φ → ψ is true, i.e. a false statement implies
any other statement! In particular

(0 = 1) → (The Moon is made of cheese)

is true.

A.1.3 Quantifiers
Many mathematical statements assert the existence of a mathematical object with certain proper-
ties. For example to say that
x2 − 1 = 0 has a real root
is to say that there exists a real number c such that c2 − 1 = 0.
Other mathematical statements assert that something is true for all objects (of a prespecified type),
for example
For every real number x, x2 ≥ 0.
We therefore introduce the following symbols for quantifiers:

∀ For all
∃ There exists

A quantifier always occurs in conjunction with a variable, i.e. as ∀x or as ∃x. Thus if φ(x) is a
statement about x, then

∀xφ(x) is true iff the statement φ(x) is true for every x

Frequently, if we want to restrict the domain to a particular set X, we may also write ∀x ∈ X φ(x)
or ∃x ∈ X φ(x). Thus

(∃x ∈ X)φ(x) is true iff there is at least one x ∈ X for which the statement φ(x) is true

Thus the statement ∃x ∈ R(x2 − 1 = 0) asserts that the equation x2 − 1 = 0 has a real root.
The statement ∀x ∈ R(x2 ≥ 0) asserts that the square of any real number is non–negative.

Exercise A.1.2 Decide if the following sentences about real numbers are true or false:

(a) ∃x ∈ R(x2 = −1)

(b) ∃x ∈ N(4x = 1)

(c) ∃x ∈ R(4x = 1)

(d) ∀x ∈ R ∃y ∈ R(x ≤ y)

(e) ∃y ∈ R ∀x ∈ R(x ≤ y)
124 Sets

(f) ∃y ∈ [0, 1] ∀x ∈ [0, 1](x ≤ y)

(g) ∀x ∈ R ∀y ∈ R[xy = 0 → (x = 0 ∨ y = 0)]

(h) ∀x ∈ R ∀y ∈ R ∃z ∈ R[x + z = y]

(i) ∃z ∈ R ∀x ∈ R ∀y ∈ R[x + z = y]

Note that we have the following equivalence of statements:


 
¬ ∀xϕ(x) ⇐⇒ ∃x(¬ϕ(x))

For if it isn’t the case that the statement ϕ(x) is true for every x, then there is at least one x
for which the statement ϕ(x) is false, and thus for which ¬ϕ(x) is true. Thus a negation sign can
“creep” past a quantifier, but it flips the quantifier in the process. For example,

¬[∀x∃y(y > x)] ⇐⇒ ∃x¬[∃y(y > x)]


⇐⇒ ∃x∀y(y 6> x)

One more thing: The variable x in a statement of the form ∀xφ(x) or ∃xφ(x) is unimportant, i.e.
the meaning of the statement remains the same if we change the variable (provided that the new
variable does not already occur in the statement φ). This is just like what happened for definite
integrals: For example, we have
Z b Z b
f (x) dx = f (y) dy
a a
Just so, we have

∀xφ(x) ⇐⇒ ∀yφ(y) and ∃xφ(x) ⇐⇒ ∃yφ(y)

provided y does not already occur in φ.

A.2 Sets
In the early twentieth century, the following principle was established:
All mathematical objects are sets.
All mathematical notions can be expressed as relationships between sets.

Intuitively, a set is just a collection of objects.

If A is a set and x is some mathematical object, then we say

x∈A (x is an element of A)
Logic and Sets 125

if x is amongst the objects that are collected in A.


A set is characterized entirely by its elements. Two sets are the same if and only if they have
the same elements:
A = B ⇐⇒ ∀x[x ∈ A ↔ x ∈ B]
Instead of set, we will also say collection, family or class. Instead of saying x is an element of A,
we may also say x is a member of A or x belongs to A.
We say that a set A is a subset of a set B if and only if every element of A belongs to B
A⊆B ⇐⇒ ∀x[x ∈ A → x ∈ B]
Thus
A=B iff (A ⊆ B) ∧ (B ⊆ A)
We say that A is a proper subset of B if A ⊆ B, but A 6= B. We also write B ⊇ A to mean A ⊆ B.
There are two ways to represent a set:
(i) By listing its elements: A = {ai : i ∈ I}
(ii) By some defining property: A = {x : φ(x)}
The following sets have special symbols associated with them:
• The empty set ∅ = {} = {x : x 6= x}.
• N := {1, 2, 3, . . . } is the set of natural numbers.
• Z+ := {0, 1, 2, 3, } is the set of non–negative integers.
• Z := {− . . . , −2, −1, 0, 1, 2, . . . } is the set of integers.
• Q := { pq : p ∈ Z, q ∈ N} is the set of rational numbers.
• R is the set of real numbers.
• C is the set of complex numbers.

A.2.1 Union and intersection


The symbols ∪, ∩ denote, respectively, the union and intersection of two sets:
A ∪ B = {x : x ∈ A ∨ x ∈ B}
A ∩ B = {x : x ∈ A ∧ x ∈ B}
We have here considered union and intersection as binary operations, involving just two sets.
Frequently, however, we may need to consider these as infinitary operations: We can, for example,
take the union of infinitely many sets, e.g.

[
An = A1 ∪ A2 ∪ A3 ∪ . . .
n=1
S
for n∈N An , etc. We therefore define the union and intersection of a family of sets as follows:
126 Sets

Definition A.2.1 (Union, intersection and product of a family of sets)


If A = {Ai : i ∈ I} is a family of sets, we may define

(a) the union [ [


A= Ai = {x : x ∈ Ai for some i ∈ I}
i∈I

(b) the intersection \ \


A= Ai = {x : x ∈ Ai for all i ∈ I}
i∈I

S S S ∞
S
We will frequently write Ai or i Ai instead of Ai . We will also write An instead
S I T i∈I n=1
of An . The same holds for .
n∈N

Remarks A.2.2 Note that


S
(i) {A, B} = A ∪ B
Q
(ii) {A, B, C} = A × B × C
T
(iii) {X1 , X2 , . . . , Xn } = X1 ∩ X2 ∩ · · · ∩ Xn

etc. 

A.2.2 Set difference, complementation and symmetric difference


If A, B are sets, we define the set difference of A and B by

A − B = {x : x ∈ A ∧ x 6∈ B}

We define the symmetric difference of A, B by

A∆B = (A − B) ∪ (B − A) = (A ∪ B) − (A ∩ B)

Often, we will be working with subsets of some universal set Ω. If A ⊆ Ω, we define the
complement of A by
Ac = Ω − A

Note that if A, B ⊆ Ω, then


A − B = A ∩ Bc

A.2.3 Set algebra


Note the following laws:
Logic and Sets 127

• Idempotent laws:
A∪A=A A∩A=A

• Commutative laws:

A∪B =B∪A A∩B =B∩A A∆B = B∆A

• Associative laws:

A∪(B ∪C) = (A∪B)∪C A∩(B ∩C) = (A∩B)∩C A∆(B∆C) = (A∆B)∆C

• Distributive laws:

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)
[ [ \ \
A∩ Bi = (A ∩ Bi ) A∪ Bi = (A ∪ Bi )
i∈I i∈I i∈I i∈I

(A∆B) ∩ C = (A ∩ C)∆(B ∩ C) (A∆B) ∪ C = (A ∪ C)∆(B ∪ C)

• Absorption laws:
A ∩ (A ∪ B) = A A ∪ (A ∩ B) = A

• Complementation laws:

A ∩ Ac = ∅ A ∪ Ac = Ω (the universal set)

(Ac )c = A (A∆B)c = Ac ∆B

• De Morgan’s laws:

(A ∩ B)c = Ac ∪ B c (A ∪ B)c = Ac ∩ B c
 [ c \  \ c [
Ai = Aci Ai = Aci
i∈I i∈I i∈I i∈I

• Set difference laws

A − (B ∪ C) = (A − B) ∩ (A − C) A − (B ∩ C) = (A − B) ∪ A − C)
[ \ \ [
A− Bi = (A − Bi ) A− Bi = (A − Bi )
i∈I i∈I i∈I i∈I

• Symmetric difference laws:

A∆A = ∅ A∆∅ = A A∆Ω = Ac


128 Cartesian Products

A.2.4 Cartesian Products


A set is completely determined by its elements. The order in which those elements are arranged
does not matter. For example, {a, b} = {b, a}. When we want the order to matter, we have to deal
with ordered tuples. An ordered pair is denoted by (a, b), and should be thought of as a collection
containing a and b, in that order. Thus (a, b) 6= (b, a). Note that
(a, b) = (c, d) ⇐⇒ a = c and b = d
Generally, an ordered n–tuple is denoted by (a1 , a2 , . . . , an ), and should be thought of as a collection
containing a1 , a2 , . . . , an , in that order.
The pair (a, b) is usually defined to be the set {{a}, {a, b}}. You can check that this definition yields the required
property that (a, b) = (c, d) iff a = c and b = d.
(a, b, c) is then defined to be (a, (b, c)) (which is just the set {{a}, {a, {{b}, {b, c}}}}), etc. This is in keeping with
the notion that all mathematical objects should be sets. On first encounter, however, you might find this arbitrary,
clumsy, and unnecessary, and you wouldn’t be far wrong: The main thing that you need to keep in mind is that an
ordered tuple is a collection in which the order matters.

Using ordered tuples, we can define one more way of making new sets from old:

Definition A.2.3 (Cartesian product)

(a) Suppose that A1 , A2 , . . . , An are sets. The cartesian product of A1 , . . . , An is the set
of all n–tuples (a1 , . . . , an ), with each ak ∈ Ak .

A1 × A2 × · · · × An = {(a1 , a2 , . . . , an ) : ak ∈ Ak for k = 1, 2, . . . , n}

(b) If A = {Ai : i ∈ I} is a family of sets, we may define the cartesian product


Y Y
A= Ai = {(ai )I : ai ∈ Ai for all i ∈ I}
i∈I

Here (ai )I is a generalized tuple, indexed by I.


S
In essence, (ai )I is a function with domain I and range Ai .
i∈I

Exercise A.2.4 Suppose that A, B ⊆ R are defined by


A := [1, 2] ∪ {3} B := {1, 2} ∪ [3, 4)
Draw a diagram of A × B in the Euclidean plane. 

We will identify the sets (A × B) × C and A × (B × C) with A × B × C, although, strictly


speaking, they are not equal.
For example, ((a, b), c)) is an element of the first set, but not of the second or third. (a, (b, c)) belongs to the second,
but not to the first or third. (a, b, c) belongs to the third, but not to the first two. However, we shall simply identify
(a, (b, c)), ((a, b), c) and (a, b, c), i.e. we shall not distinguish between them. After all, all that matters is the order of
a, b, c and that is the same in each of these tuples.
Logic and Sets 129

Example A.2.5 • Observe that a relation R between elements a ∈ A and b ∈ B is just a


subset of A × B. For example, the ≤ relation on the reals is a subset of R × R.

The set ≤ is the set {(x, y) ∈ R × R : x ≤ y}

• A function f : A → B is a special kind of relation: a ∈ A and b ∈ B are related if and only if


f (a) = b. Thus f ⊆ A × B.

The function f is the set {(a, b) ∈ A × B : f (a) = b}

• In probability theory, Cartesian products are used to combine random experiments. The
experiment of tossing a coin is modelled by the sample space ΩC := {T, H}. The experiment of
rolling a die is modelled by the sample space ΩD := {1, 2, 3, 4, 5, 6}. The combined experiment
of tossing a coin and then rolling a die is modelled by the sample space

Ω := ΩC × ΩD = {(T, 1), (T, 2), . . . , (T, 6), (H, 1), (H, 2), . . . , (H, 6)}

A.3 Functions Operating On Sets


Recall that a function f : A → B is invertible if and only if it is a bijection. The inverse function,
denoted f −1 is a function f −1 : B → A with the properties that

f −1 (f (a)) = a all a ∈ A f (f −1 (b)) = b all b ∈ B

i.e.
f −1 ◦ f = 1A f ◦ f −1 = 1B
where 1A : A → A : a 7→ a is the identity function (which maps an element a to itself), and similarly
1B : B → B : b 7→ b.
The above is possible only for bijective functions f . We now introduce a similar notion which
will work for any function f : A → B. However, instead of operating on elements, these operat
on sets: With every function f : A → B (not necessarily invertible), we can associate two new
functions between the power sets of A and B

f [·] : P(A) → P(B) : A0 7→ {b ∈ B : There is a0 ∈ A0 such that f (a0 ) = b} where A0 ⊆ A


f −1 [·] : P(B) → P(A) : B 0 7→ {a ∈ A : f (a) ∈ B 0 } where B 0 ⊆ B

Thus f [·] assigns to each subset A0 of A a subset f [A0 ] ⊆ B. Similarly, f −1 [·] transforms each subset
B 0 of B into a subset f −1 [B 0 ] ⊆ A.
We will, for the moment, use square brackets to distinguish the various functions, but will drop
this convention later. Which function is meant will be clear from context. We shall also call f [A0 ]
the direct image of A0 along f , and f −1 [B 0 ] the inverse image or pullback of B 0 along f . Note that

f [A0 ] = set of all images of a ∈ A0


130 Equivalence Relations

whereas
f −1 [B 0 ] = set of all preimages of b ∈ B 0

Inverse images play a very important role in mathematics. It is therefore useful to remember
the following:

a ∈ f −1 [B 0 ] if and only if f (a) ∈ B 0


b ∈ f [A0 ] if and only if b = f (a0 ) for some a0 ∈ A0

For probability theory, the following result is very important:

Proposition A.3.1 (Set operations commute with pullback)


Inverse images preserve the set operations: Let f : A → B, and suppose that G, H, Gi (i ∈
I) are subsets of B. Then

(a) If G ⊆ H, then f −1 [G] ⊆ f −1 [H];

(b) f −1 [G ∩ H] = f −1 [G] ∩ f −1 [H]; In fact f −1 f −1 [Gi ];


T  T
i∈IGi = i∈I

(c) f −1 [G ∪ H] = f −1 [G] ∪ f −1 [H]; In fact f −1 f −1 [Gi ];


S  S
i∈IGi = i∈I

(d) f −1 [G − H] = f −1 [G] − f −1 [H];

Exercise A.3.2 Prove the preceding proposition. 

Direct images are not quite so well behaved:

Exercise A.3.3 Let f : A → B, and suppose that G, H ⊆ A.

(a) Suppose that G ⊆ H. Show that f [G] ⊆ f [H];

(b) Show that f [G ∪ H] = f [G] ∪ f [H];

(c) Show that f [G ∩ H] ⊆ f [G] ∩ f [H];

(d) Give an example to show that we may not have f [G ∩ H] = f [G] ∩ f [H];

(e) Show that f [G] − f [H] ⊆ f [G − H] ⊆ f [G];

(f) Give an example to show, in (e), that both ⊆’s may fail to be =’s.


Logic and Sets 131

A.4 Equivalence Relations


The relation of equality is a class relation which satisfies following properties:
• = is reflexive: x = x.
• = is symmetric. If x = y, then y = x.
• = is transitive: If x = y and y = z, then x = z.
A relation which satisfies these same properties is called an equivalence relation. To be precise:

Definition A.4.1 A binary relation R on a set X is an equivalence relation if and only if

(i) R is reflexive, i.e. ∀x ∈ X (xRx).

(ii) R is symmetric, i.e. ∀x, y ∈ X (xRy → yRx).

(iii) R is transitive, i.e. ∀x, y, z ∈ X (xRy ∧ yRz → xRz).

An equivalence relation R can be thought of as a generalization of the identity relation: If xRy,


then x, y are in a certain sense identified, i.e. regarded as “equal” for the purpose at hand. When
we collect together all the elements which are made “equal” we get equivalence classes:
Definition A.4.2 If R is an equivalence relation on a set X and x ∈ X, then the equivalence class
of x modulo R is the set defined by:

[x]R := {y ∈ X : xRy}


We may write just [x] is R is understood from context.
The following device yields a fruitful way of looking at equivalence relations:
Definition A.4.3 A family P of non–empty sets is called a partition of a set X if and only if:
S
(i) {P : P ∈ P} = X.
(ii) Any two distinct members of P are disjoint, i.e. P, Q ∈ P and P 6= Q implies P ∩ Q = ∅.

The members P ∈ P are sometimes called the blocks of the partition. Observe that (i) of the above
definition states that each x ∈ X belongs to at least one block, whereas (ii) states that no x ∈ X
belongs to more than one block. Thus each x ∈ X belongs to exactly one block.

Exercise A.4.4 (a) Every equivalence relation on X induces a partition of X: Let R be an equiv-
alence relation on a set X. Define a family of sets PR by

PR := {[x]R : x ∈ X}

Show that PR is a partition of X.


132 Equivalence Relations

(b) Every partition of a set X induces an equivalence relation on X: Let P be a partition of X.


Define a binary relation RP on X by

x RP y if and only if ∃P ∈ P (x ∈ P ∧ y ∈ P )

(i.e. x RP y if and only if x, y belong to the same block of P.) Show that RP is an equivalence
relation. Further show that the blocks of P are precisely the equivalence classes of RP , i.e.
that [x]RP = P if and only if x ∈ P .

(c) Show that the constructions in (a), (b) above are inverses of each other, in the following sense:
Starting from an equivalence relation R, the construction in (a) yields a partition PR . If we
then apply the construction in (b) to the partition PR , we get an equivalence relation RPR .
This relation is precisely the original equivalence relation R, i.e. RPR = R.
Similarly, PRP = P.


If R is an equivalence relation on a set X, then the set of equivalence classes — which we above
denoted by PR — is usually denoted by X/R. The equivalence class of an element x — above
denoted by [x]R — is often denoted by x/R.
Here is another reason why equivalence relations are ubiquitous in mathematics:

Exercise A.4.5 (a) Every function induces an equivalence relation on its domain: Suppose that
f : X → Y is a function. Define a binary relation R on X by

xRy iff f (x) = f (y)

The relation R is called the kernel of f , and denoted R = ker f . Show that ker f is an
equivalence relation.

(b) Assuming the Axiom of Choice, every equivalence relation on a set X is induced by a function
with domain X: Show that if R is an equivalence relation, then there is a function f with
domain X such that R = ker f .

Appendix B

Convergence

B.1 Convergence of Sequences


B.1.1 “Infinitely Often” and “Eventually”
Here are two related ideas which will help elucidate the notion of convergence:
Let P be a property that a real number may (or may not) have. We write P (x) if x has the
property P . [For example, P could be the property of being positive, so that P (1.23), but ¬P (−π).
Or Q could be the property of being irrational, in which case ¬Q(1.23), whereas Q(−π).] Now
suppose that hxn in is a sequence in R, and that P is a property:
• We say hxn in has property P infinitely often iff there are infinitely many n for such that
P (xn ) is true.

• We say hxn in has property P eventually iff P (xn ) is true for all n from some point onwards.
These are intuitive descriptions, not formal definitions — we’ll come to that. But first, some
examples:
Examples B.1.1 (1) The sequence 1, 2, 3, 4, 5, . . . is prime infinitely often: Infinitely many terms
are prime numbers. It is also even infinitely often. It is eventually greater than 1010 .

(2) The sequence −2, −1, 0, 1, 2, 3, . . . is positive eventually: From the fourth term onwards, all
terms are positive (i.e. > 0).

(3) Let xn = sin nπ


2 .Then hxn in is strictly positive infinitely often, strictly negative infinitely often,
and zero infinitely often.
(−1)n 1
(4) Let xn = 2 + n . Then |xn − 2| < 1000 eventually.


Remarks B.1.2 • First note that hxn in has property P eventually if and only if there exists
an N ∈ N such that every term after the N th has property P . Formally,

(∃N ∈ N) (∀n ≥ N ) [xn has property P ] (†)

133
134 Convergence of Sequences

• Next note that if hxn in has property P infinitely often, then the following is true:

Given any natural number N ∈ N, there is a natural number


n ≥ N such that xn satisfies property P , i.e.

(∀N ∈ N) (∃n ≥ N ) [xn has property P ] (∗)

For if this is not the case, then there is some N such that no n ≥ N has property P . Thus if
xn does have property P , then n < N .
But then there are only finitely many xn which have property P — at most those xn for
n = 1, 2, . . . , N − 1!! This contradicts the assumption that hxn in has property P infinitely
often.
It follows that if hxn in has property P infinitely often if and only if then (∗) is true.

• Finally, observe that infinitely often and eventually are closely related: If it is not the case that
a sequence hxn in has property P infinitely often, then eventually hxn in must have property
¬P .

¬(∀N )(∃n ≥ N ) [xn has property P ] ≡ (∃N )(∀n ≥ N ) [xn has property ¬P ]

The above remarks contain a set of formal definitions:

Definition B.1.3 (a) We say that a sequence hxn in has property P infinitely often if and
only if
(∀N ∈ N) (∃n ∈ N) [n ≥ N ∧ xn has property P ]

(b) We say that a sequence hxn in has property P eventually if and only if

(∃N ∈ N) (∀n ∈ N) [n ≥ N → xn has property P ]

B.1.2 Formal Definition of Convergence of Sequences


We first define what it means for a sequence hxn in of non–negative real numbers to converge to
zero. Intuitively:

To say that xn → 0 means that hxn i is “small” eventually,


for any measure of “smallness”.

The notion “small” is subjective, so we will demand that it holds for absolutely anybody’s idea of
“small”. Specifically, suppose you define “small” by specifying some number ε > 0 and saying “A
Convergence 135

non–negative number x is small iff x < ε”. To say that hxn in is eventually small then means that
from some point onwards all the xn ’s are small, i.e

∃N ∀n ≥ N [xn < ε]

This must be true no matter what gauge ε > 0 of “smallness” you use. Thus:

If hxn in is a sequence of non–negative real numbers, we say

xn → 0 ⇐⇒ ∀ε > 0 ∃N ∀n ≥ N [xn < ε]

We also write
lim xn = 0
n→∞

Thus xn → 0 iff given any ε > 0 it is possible to find a natural number N such that

xn < ε whenever n ≥ N

The number N typically depends on ε: The smaller ε > 0, the greater N usually has to be. It is
now rather simple to define convergence of arbitrary sequences in R: To say that xn → x means
that the distance between xn and x converges to 0, i.e.

xn → x ⇔ |xn − x| → 0

Now the distance |xn − x| between xn and x is non–negative, so we already know what |xn − x| → 0:
It means ∀ε > 0 ∃N ∀n ≥ N [|xn − x| < ε]. Thus

Definition B.1.4 If hxn in is a sequence of real numbers , we say

xn → x ⇐⇒ ∀ε > 0 ∃N ∀n ≥ N [|xn − x| < ε]

We then say that hxn in is a convergent sequence, with limit x. We also write

x = lim xn for xn → x
n

A sequence which is not convergent is said to be divergent.

Thus xn → x if and only if, given any ε > 0, the distance between xn and x is eventually < ε, i.e.
all but finitely many terms xn lie within a distance of ε of x.
136 sup and inf

B.2 Sup and Inf


We begin with some definitions:

Definition B.2.1 Let A ⊆ R.

(a) We say that an element u ∈ R is an upper bound for A if and only if

∀a ∈ A(a ≤ u)

(b) Similarly, we say that l ∈ R is an lower bound for A if and only if

∀a ∈ A(l ≤ a)

(c) We say that A is bounded if and only if it has both an upper bound and a lower bound.

(d) We say that u0 is the supremum, or least upper bound of A if and only if the following
hold:

(i) u0 is an upper bound of A;


(ii) If u is any upper bound of A, then u0 ≤ u.

We write
u0 = sup A or u0 = l.u.b.(A)

(e) We say that l0 is the infimum, or greatest lower bound of A if and only if the following
hold:

(i) l0 is a lower bound of A;


(ii) If l is any lower bound of A, then l ≤ l0 .

We write
l0 = inf A or l0 = g.l.b.(A)

(f) We say that u0 is the maximum of A, and write u0 = max A, if and only if

u0 ∈ A and u0 = sup A

(g) Similarly, we say that l0 is the minimum of A, denoted l0 = min A, if and only if

l0 ∈ A and l0 = inf A

Notation: If A := {xn : n ∈ N, we may write supn xn instead of sup A.

Examples B.2.2 (1) In (R, ≤), we have:

(a) 1 = max[0, 1] = sup[0, 1]; 0 = min[0, 1] = inf[0, 1]


Convergence 137

(b) 1 = sup[0, 1), but max[0, 1) does not exist.



(c) If A = {x ∈ R : x2 ≤ 2}, then sup A = 2.
(d) If A 6= ∅, then inf A ≤ sup A.
(e) If A = ∅, then A has no supremum and no infimum. First note that every u ∈ R is an
upper bound for A — for if u is not an upper bound, then there must be an a ∈ A such
that u < a. But A is empty, so there is no such a!
Similarly, every real number is also a lower bound for A. We therefore write

sup ∅ = −∞ inf ∅ = +∞


It is easy to see that


Lemma B.2.3 If ∅ 6= A ⊆ B ⊆ R, then

sup(A) ≤ sup(B) inf(A) ≥ inf(B)

The following trivial observation is nonetheless very useful: If x < sup(A), then x is smaller
than the least upper bound of A, and hence x is not an upper bound of A (else it would be an
upper bound that is less than the least upper bound).It follows that there is a ∈ A such that x < a.
Thus:
Lemma B.2.4 Let A ⊆ R.

(a) x < sup(A) iff there is a ∈ A such that x < a.

(b) x > inf(A) iff there is a ∈ A such that x > a.

Lemma B.2.5 Suppose that A ⊆ R, and define −A := {−x : x ∈ A}. Then

sup(−A) = − inf(A)

Proof: Let u be an upper bound for −A. Then u ≥ −x for every x ∈ A, and hence −u ≤ x for
every x ∈ A, i.e. −u is a lower bound for A. Similarly, if l is a lower bound for A, then −l is an
upper bound for −A.
Now let u0 := sup(−A). We must show that −u0 = inf(A). Since u0 is an upper bound for
−A, we see that −u0 is a lower bound for A. Furthermore, if l is another lower bound for A, then
−l is an upper bound for −A, and hence −l ≥ u0 (because u0 is the least upper bound for −A, so
less than the upper bound −l). It follows that −u0 ≥ l, i.e. −u0 is a lower bound of A which is
greater than any other lower bound l. So −u0 = inf(A). a

The existence of suprema and infima is taken as a fundamental axiom of the real number system:
Completeness Axiom:
Every non–empty A ⊆ R which is bounded above has a supremum.
Every non–empty A ⊆ R which is bounded below has an infimum.
138 lim sup and lim inf

It is the Completeness Axiom which often guarantees the existence of limits:

Lemma B.2.6 A bounded monotone sequence converges.

(a) If hxn i is a bounded increasing sequence, and a := supn xn , then xn → a.

(b) If hxn i is a bounded decreasing sequence, and b := inf n xn , then xn → b.

Proof: We prove only (a), as (b) is similar. We are given an increasing sequence hxn i, with
a := supn xn . Let ε > 0. We must show that there is N ∈ N such that |xn − a| < ε whenever
n ≥ N . Observe that xn ≤ a (as a = supn xn ), so that |xn − a| = a − xn . Now a − ε < a, so by
Lemma B.2.4 there is N such that a − ε < xN . If n ≥ N , then xn ≥ xN > a − ε, since hxn i is
increasing. It follows that ∀n ≥ N (a − xn < ε), and thus that ∀n ≥ N (|xn − a| < ε). a

Next, we show that every sequence of real numbers has a monotone subsequence. For the
purpose of the proof, we briefly introduce some non–standard terminology. Let hxn in be a sequence
of real numbers. Imagine that you are walking along a landscape, and that xn is your height above
sea level at time n. Call xn a vista if you can see the whole landscape ahead of you, i.e. if xn ≥ xm
for all m ≥ n. Thus if hxn in is decreasing, then each xn is a vista, whereas if hxn in is increasing,
there are no vistas at all. If xn := 1 + (−1)n n1 , then every even point x2n is a vista.

Theorem B.2.7 Every sequence of real numbers has a monotone subsequence.

Proof: We consider two cases: Either (1) hxn in has infinitely many vistas, or (2) it has only
finitely many. In case (1), let xn1 , xn2 , xn3 , . . . be the subsequence of vistas, in order of increasing
subscript. Note that
xn1 ≥ xn2 ≥ xn3 ≥ · · · ≥ xnk ≥ . . .

is a decreasing subsequenc of hxn in .


In case (2), hxn in has only finitely many vistas, so there is an N ∈ N such that there are no
vistas beyond point xN , i.e. if n ≥ N , then xn is not a vista. Now construct a subsequence as
follows. Let n1 = N . Since xn1 is not a vista, there is n2 > n1 such that xn2 > xn1 . Since xn2 is
not a vista, there is n3 > n2 such that xn3 > xn2 . Continuing in this way, we obtain an increasing
sequence
xn1 < xn2 < xn3 . . . xnk < . . .

Theorem B.2.8 (Bolzano–Weierstrass) Every bounded sequence of real numbers has a


convergent subsequence.

Proof: By Theorem B.2.7, any sequence hxn i has a monotone subsequence. If hxn i is bounded,
then so is the subsequence. But a bounded monotone sequence converges, by Theorem ??. a
Convergence 139

B.3 lim sup and lim inf


Suppose that hxn in is a bounded sequence in R. Construct two new sequences as follows:

yn = sup{xm : m ≥ n} = sup xm zn = inf{xm : m ≥ n} = inf xm


m≥n m≥n

Because hxn i is bounded, yn and zn exist (i.e. are finite real numbers), by the Completeness Axiom.
n
Example B.3.1 Suppose, for example, that xn = (−1)
n for n ≥ 1. Then
 
1 1 1 1 1
y1 = sup −1, , − , , − , . . . =
2 3 4 5 2
 
1 1 1 1 1
y2 = sup ,− , ,− ,... =
2 3 4 5 2
 
1 1 1 1
y3 = sup − , , − , . . . =
3 4 5 4
 
1 1 1
y4 = sup ,− ,... =
4 5 4

i.e. hyn i is the sequence 12 , 21 , 14 , 14 , 61 , 16 , 81 , . . . .


Similarly, you can check that hzn i is the sequence −1, 13 , − 31 , − 15 , − 15 , − 17 , − 17 , − 19 . . . . 

Observe that by Lemma B.2.3 we se that hyn i is a bounded decreasing sequence, and that hzn i is
a bounded increasing sequence.
By Lemma B.2.6, it follows that hyn i converges, and that limn yn = inf n yn . Similarly, limn zn =
supn zn exists also. We now define lim supn xn = limn yn , and lim inf n xn = limn zn :

Definition B.3.2 Let hxn i be a sequence in R. We define the limit superior of hxn i by

lim sup xn = lim sup xm


n n→∞ m≥n

where we adopt the convention that if hxn i is not bounded above, we set lim supn xn = +∞.
Similarly, we define the limit inferior of hxn i by

lim inf xn = lim inf xm


n n→∞ m≥n

where we adopt the convention that if hxn i is not bounded below, we set lim inf n xn = −∞.

Let’s analyse the notions of lim sup and lim inf. Let hxn in be a sequence, and let yn :=
supm≥n xm and zn := inf m≥n xm .
• If lim supn xn > a, then lim yn > a. Hence yn > a for all n (since hyn in is decreasing). It
follows that, for all n, supm≥n xm > a, i.e. that there exists m ≥ n such that xm > a. Thus

lim sup xn > a =⇒ ∀n ∈ N ∃m ≥ n(xm > a)


n
140 lim sup and lim inf

i.e.
lim sup xn > a =⇒ xn > a infinitely often
n

• On the other hand, if xn ≥ a infinitely often, then for every n there is m ≥ n such that
xm ≥ a. Thus yn := supm≥n xm ≥ a for all n, and hence lim supn xn = limn yn ≥ a, i.e.

xn ≥ a infinitely often =⇒ lim sup xn ≥ a


n

• From the logical equivalence of ϕ → ψ and ¬ψ → ¬ϕ, and using facts like ¬(xn > a, i.o.) ≡
(xn ≤ a, ev.), we see that

xn ≤ a eventually =⇒ lim sup xn ≤ a


n

and
lim sup xn < a =⇒ xn < a eventually
n

• By Lemmas B.2.5 and B.2.6, we have lim inf n xn = supn inf m≥n xm = − inf n supm≥n (−xm ) =
− lim supn (−xn ). Thus we need to prove a result only for lim sup, in order to get immediately
a corresponding result for lim inf. Similar statements therefore hold for lim inf.
Summarizing in a box:

lim sup xn > a =⇒ xn > a infinitely often


n
xn ≥ a infinitely often =⇒ lim sup xn ≥ a
n
xn ≤ a eventually =⇒ lim sup xn ≤ a
n
lim sup xn < a =⇒ xn < a eventually
n
lim inf xn < a =⇒ xn < a infinitely often
n
xn ≤ a infinitely often =⇒ lim inf xn ≤ a
n
xn ≥ a eventually =⇒ lim inf xn ≥ a
n
lim inf xn > a =⇒ xn > a eventually
n

If you understand the implications in the box, you understand lim sup and lim inf.

Though a bounded sequence hxn in may not have a limit, it always has a lim sup and a lim inf.
When hxn in does converge, the three notions coincide, and conversely, as we shall see next. Note
that always lim supn xn ≥ lim inf n xn :

Proposition B.3.3 Suppose that hxn in is a bounded sequence of real numbers. Then
hxn in converges if and only if lim supn xn = lim inf n xn . In that case, limn xn =
lim supn xn = lim inf n xn .
Convergence 141

Proof: (⇒): Suppose that xn → x, and let ε > 0. Then |xn − x| < ε eventually, and thus in
particular xn ≤ x + ε eventually. Thus lim supn xn ≤ x + ε.
Similarly x − ε ≤ xn eventually, and thus lim inf n xn ≥ x − ε.
It follows that for all ε > 0, we have
x − ε ≤ lim inf xn ≤ lim sup xn ≤ x + ε
n n

Since ε was arbitrary, we must have lim inf n xn = lim supn xn = x.


(⇐): Suppose that lim inf n xn = lim supn xn =: x, and let ε > 0 be arbitrary. Then lim inf n xn >
x − ε, so xn > x − ε eventually. Similarly, lim supn xn < x + ε, so xn < x + ε eventually. Combining,
we see that x − ε < xn < x + ε eventually, i.e. that |xn − x| < ε eventually. a

One more trivial observation:


Proposition B.3.4 xn 6→ x if and only if lim supn |xn − x| ≥ ε for some ε > 0.
Proof: If xn 6→ x then ¬∀ε > 0 (|xn − x| < ε, ev.), so ∃ε > 0 (|xn − x| ≥ ε, i.o). a

B.4 Cauchy Sequences and Convergence


Intuitively, a sequence hxn in in R is a Cauchy sequence if its terms lie eventually arbitrarily close
to each other. This means that from some point onwards, any two terms are “close”. If all terms
lie closer and closer together, there should be some point that they are all clustering around, and
that point should be the limit of the sequence hxn in .
All this “ought” and “should” needs to be made precise.

Definition B.4.1 A sequence hxn in in R is called a Cauchy sequence if and only if for
every ε > 0 there is an N ∈ N such that

|xn − xm | < ε whenever n, m ≥ N

i.e. if and only if


∀ε > 0 ∃N ∈ N ∀n, m ≥ N [|xn − xm | < ε]

Remarks B.4.2 (a) Note that all terms from some point onwards need to be within ε of each
other, not just successive terms. Thus, for example, if N = 100, then not just do we have
|x100 − x101 | < ε, but also |x301 − x15 673 428 | < ε.
(b) A neat way to characterize Cauchy sequences is as follows:
hxn in is a Cauchy sequence ⇐⇒ lim sup |xn − xN | = 0
N →∞ n≥N

Here (⇒) is obvious. (⇐) follows by the triangle inequality: Given ε > 0, choose N such that
supk≥N |xk − xN | < ε/2. Then for n, m ≥ N we have
|xn − xm | ≤ |xn − xN | + |xN − xm | ≤ 2 sup |xk − xN | < ε
k≥N


142 Cauchy Sequences and Convergence

Example B.4.3 The sequence h1 + (−1)n 2−n in is Cauchy. Indeed, given ε > 0, we may choose
N ∈ N such that 2−N < 2ε . If n, m > N , then (by the triangle inequality)

(1 + (−1)n 2−n ) − (1 + (−1)m 2−m ) ≤ 2−n + 2−m ≤ 2−N + 2−N < ε


Lemma B.4.4 Every convergent sequence is a Cauchy sequence

Proof: Suppose that xn → x, and that we are given ε > 0. We must find N such that |xn −xm | < ε
whenever n, m > N .
Now because xn → x there is N ∈ N such that |xn − x| < 2ε whenever n ≥ N . In particular, if
n, m ≥ N , then
ε ε
|xn − xm | ≤ |xn − x| + |x − xm | < +
2 2
Hence hxn in is a Cauchy sequence. a

More importantly, the converse is true: Any Cauchy sequence in R is convergent. To prove this,
we will need a number of lemmas. We shall prove:

• Every Cauchy sequence is bounded.

• Every bounded sequence has a convergent subsequence.

• If a Cauchy sequence hxn in has a convergent subsequence, then hxn in is itself convergent.

Actually, the second point has already been proved. It is the Bolzano–Weierstrass theorem (Theo-
rem B.2.8). Thus we need only prove the first and the last point.

Lemma B.4.5 If hxn in is a Cauchy sequence in R, then hxn in is bounded.

Proof: Choose N ∈ N such that |xn − xm | < 1 whenever n, m ≥ N . (This is possible, because
hxn in is Cauchy — we have taken ε = 1.) Now define

K = max{|x1 |, |x2 |, . . . , |xN | + 1}

We show that K is a bound for hxn in , i.e. that |xn | ≤ K for all n ∈ N.
Consider separately the two case (i) n ≤ N , and (ii) n > N . In case (i), we obviously have
|xn | ≤ K, by definition of K. Suppose therefore, that n > N . In that case, both n and N are ≥
N , and thus
|xn | ≤ |xn − xN | + |xN | ≤ 1 + |xN | ≤ K
which finishes case (ii). a

Lemma B.4.6 If hxn in is a Cauchy sequence, and if hxn in has a convergent subsequence, then
hxn in itself converges.
Convergence 143

Proof: Suppose that hxnk ik is a subsequence of the Cauchy sequence hxn in , and that xnk → x (as
k → ∞). We show that xn → x (as n → ∞).
So let ε > 0. We must show that there is N ∈ N such that |xn − x| < ε whenever n ≥ N . Now
because hxn in is a Cauchy sequence, we can find an N1 such that
ε
n, m ≥ N1 implies |xn − xm | < 2

Because xnk → x, we can find a K such that


ε
k≥K implies |xnk − x| < 2

Now define N = max{N1 , nK }, and let n ≥ N . Choose k such that nk ≥ N . Then (i)
n, nk ≥ N1 , and (ii) k ≥ K (because nk ≥ N ≥ nK ). It follows that
ε ε
|xn − x| ≤ |xn − xnk | + |xnk − x| < 2 + 2

whenever n > N . a

Theorem B.4.7 Let hxn in be a sequence in R. Then hxn in converges if and only if it is
a Cauchy sequence.

Proof: (⇒) is Lemma B.4.4.


(⇐): If hxn in is a Cauchy sequence, then it is bounded (by Lemma B.4.5). Hence it has a convergent
subsequence (by Theorem B.2.8). It follows that hxn in converges (by Lemma B.4.6). a

You might also like