Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views127 pages

CH 3

The document outlines the lecture 'Topics in Mathematics 2' at Seoul National University for Fall Term 2024, covering various aspects of probability theory, including probability spaces, random variables, and independence. It includes detailed sections on basic notions, conditional probabilities, and relevant examples and applications. Additionally, a bibliography of key texts in probability theory is provided for further reading.

Uploaded by

micster0116
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views127 pages

CH 3

The document outlines the lecture 'Topics in Mathematics 2' at Seoul National University for Fall Term 2024, covering various aspects of probability theory, including probability spaces, random variables, and independence. It includes detailed sections on basic notions, conditional probabilities, and relevant examples and applications. Additionally, a bibliography of key texts in probability theory is provided for further reading.

Uploaded by

micster0116
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 127

Topics in Mathematics 2

Gerald Trutnau
Seoul National University
Fall Term 2024

Non-Corrected version

This text is a summary of the lecture


Topics in Mathematics 2
held at Seoul National University
(Fall Term 2024)
Please email all misprints and mistakes to me at
[email protected]

2
Contents

1 Basic Notions 5
1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Discrete models . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Transformations of probability spaces . . . . . . . . . . . . . . 18
4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Variance and Covariance . . . . . . . . . . . . . . . . . . . . . 30
7 The (strong and the weak) law of large numbers . . . . . . . . 33
8 Convergence and uniform integrability . . . . . . . . . . . . . . 39
9 Distribution of random variables . . . . . . . . . . . . . . . . . 47
10 Weak convergence of probability measures . . . . . . . . . . . 51
11 Dynkin-systems and Uniqueness of probability measures . . . . 57

2 Independence 63
1 Independent events . . . . . . . . . . . . . . . . . . . . . . . . 63
2 Independent random variables . . . . . . . . . . . . . . . . . . 69
3 Kolmogorov’s law of large numbers . . . . . . . . . . . . . . . 71
4 Joint distribution and convolution . . . . . . . . . . . . . . . . 79
5 Characteristic functions . . . . . . . . . . . . . . . . . . . . . 87
6 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . 89

3 Conditional probabilities 101


1 Elementary definitions . . . . . . . . . . . . . . . . . . . . . . 101
2 Transition probabilities and Fubini’s theorem . . . . . . . . . . 107
2.1 Examples and Applications . . . . . . . . . . . . . . . 112
3 Stochastic models in discrete time . . . . . . . . . . . . . . . . 114
3.1 The canonical model . . . . . . . . . . . . . . . . . . . 115
3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 119
4 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3
Bibliography

1. Bauer, H., Probability theory, de Gruyter, 1996.

2. Bauer, H., Measure and integration theory, de Gruyter, 1996, ISBN: 3-


11-016719-0.

3. Billingsley, P., Probability and Measure, third edition. Wiley, 1995. ISBN:
0-471-00710-2.

4. Billingsley, P., Convergence of probability measures, Wiley, 1999.

5. Chung, K. L., A Course in Probability Theory, Third Edition, Academic


Press;

6. Dudley, R.M., Real analysis and probability, Cambridge University Press,


2002.

7. Feller, W., An introduction to probability theory and its applications, Vol.


1 & 2, Wiley, 1950.

8. Halmos, P.R., Measure Theory, Springer, 1974.

9. Jacod, J.; Protter, P., Probability essentials, second edition. Universitext.


Springer, ISBN: 3-540-43871-8

10. Klenke, A., Probability theory. A comprehensive course., Universitext.


Springer, ISBN: 978-1-84800-047-6.

11. Shiryaev, A.N., Probability, Springer, 1996.

4
1 Basic Notions

1 Probability spaces
Probability theory is the mathematical theory of randomness. The basic notion is
that of a random experiment, which is an event whose outcome is not predictable
and can only be determined after performing it and then observing the outcome.
Probability theory tries to quantify the possible outcomes by attaching a
probability to every event. This is of importance for example for an insurance
company when asking the question what is a fair price of an insurance against
events like fire or death that are events that can happen but need not happen.
The set of all possible outcomes of a random experiment is denoted by Ω.
The set Ω may be finite, denumerable or even uncountable.
Example 1.1. Examples of random experiments and corresponding Ω:
(i) Coin tossing: The possible outcomes of tossing a coin are either “head”
or “tail”. Denoting one outcome by “0” and the other one by “1”, the set
of all possible outcomes is given by Ω = {0, 1}.

(ii) Tossing a coin n times: In this case any sequence of zeros and ones
(alias heads or tails) of length n are considered as one possible outcome;
hence
Ω = (x1 , x2 , . . . , xn ) xi ∈ {0, 1} =: {0, 1}n


is the space of all possible outcomes.

(iii) Tossing a coin infinitely many times: In this case



Ω = (xi )i∈N xi ∈ {0, 1} =: {0, 1}N .
In this case Ω is uncountable in contrast to the previous examples. We
can define a surjection from Ω onto the set [0, 1] ⊂ R using the binary
expansion

X
x= xi 2−i .
i=1

5
(iv) A random number between 0 and 1: Ω = [0, 1].

(v) Continuous stochastic processes, e.g. Brownian motion on R:


Any continuous real-valued function
 defined on [0, 1] ⊂ R is a possible
outcome. In this case Ω = C [0, 1] .
Example:
R

ω(t)

t
Events:
Reasonable subsets A ⊂ Ω for which it makes sense to calculate the probabil-
ity are called events (a precise definition will be given in Definition 1.3 below).
If we consider an event A and observe ω ∈ A in a random experiment, we say
that A has occured.

• elementary events: A = {ω} for some ω ∈ Ω

• the impossible event A = ∅ (never occurs) and the certain event A = Ω


(always occurs)

• “A does not occur” Ac = Ω \ A

Combination of events:
[ “at least one of the events Ai oc-
A1 ∪ A2 , Ai
cur”
i
\
A1 ∩ A2 , Ai “all of the events Ai occur”
i

6
\[
lim sup An := Am “infinitely many of the Am occur”
n→∞
n m>n

[\ “all but finitely many of the Am oc-


lim inf An := Am
n→∞ cur”
n m>n

Example 1.2. (i) Coin tossing “1 occurs”: A = {1}

(ii) Tossing a coin n times “tossing k ones”:

n n
X o
n
A = (x1 , . . . , xn ) ∈ {0, 1} xi = k .
i=1

(iii) Tossing a coin infinitely many times “relative frequency of 1 equals


p”:
 n 
N 1X
A= (xi )i∈N ∈ {0, 1} lim xi = p .
n→∞ n
i=1

(iv) A random number 0 and 1: “number ∈ [a, b]”, A = [a, b] ⊂ Ω = [0, 1].

(v) Continuous stochastic processes “exceeding level c”:

A = ω ∈ C [0, 1]
 
max ω(t) > c .
06t61

c
0

ω(t)

7
Let Ω be countable. A probability distribution function p on Ω is a function
X
p : Ω → [0, 1] with p(ω) = 1 .
ω∈Ω

Given any subset A ⊂ Ω, its probability P(A) can then be defined by simply
adding up X
P(A) = p(ω) .
ω∈A

In the uncountable case, however, there is no reasonable way of adding up an


uncountable set of numbers. There is no way to build a reasonable theory
by starting with probability functions specifying the probability of individual
outcomes. The best way out is to specify directly the probability of events.
In the uncountable case it is not possible in general to consider the power set
P(Ω), i.e. the collection of all subsets of Ω (including the empty set ∅ and the
whole set Ω) but only a certain subclass A. On the other hand A should satisfy
some minimal requirements specified in the following definition:

Definition 1.3. A ⊂ P(Ω) is called a σ-algebra, if

(i) Ω ∈ A,

(ii) A ∈ A implies Ac ∈ A,

(iii) Ai ∈ A, i ∈ N, implies Ai ∈ A.
S
i∈N

Remark 1.4. (i) Let A be a σ-algebra. Then:


• ∅ = Ωc ∈ A.
• Ai ∈ A, i ∈ N, implies
\ [ c
Ai = Aci ∈ A.
i∈N i∈N

• A1 , . . . , An ∈ A implies
n
[ n
\
Ai ∈ A and Ai ∈ A
i=1 i=1

(consider Am = ∅ for all m > n, resp. Am = Ω for all m > n).

8
• Ai ∈ A, i ∈ N, implies
\[ [\
Am ∈ A and Am ∈ A.
n m>n n m>n

(ii) The power set P(Ω) is a σ-algebra.

(iii) Let I be an index set (not necessarily countable) and for any i ∈ I, let
Ai be a σ-algebra. Then i∈I Ai := {A ⊂ Ω | A ∈ Ai for any i ∈ I} is
T
again a σ-algebra.

(iv) Typical construction of a σ-algebra: Let A0 6= ∅ be a class of events.


Then
\
σ(A0 ) := B
B σ-algebra
in Ω ,A0 ⊂B

is the smallest σ-algebra containing A0 . σ(A0 ) is called the σ-algebra


generated by A0 .

Example 1.5. Let Ω be a topological space, and Ao be the collection of open


subsets of Ω. Then B(Ω) := σ(Ao ) is called the Borel-σ-algebra of Ω, or
σ-algebra of Borel-subsets.
Example of Borel-subsets: closed sets, countable unions of closed sets, etc...
Note: not every subset of a topological space is a Borel-subset, e.g. B(R) 6=
P(R).
Definition 1.6. Let Ω 6= ∅ and A ⊂ P(Ω) a σ-algebra. A mapping P : A →
[0, 1] is called a probability measure (on (Ω, A)) if:

• P(Ω) = 1




[  X
P Ai = P(Ai ) ( “σ-additivity”)
i∈N i=1

for all pairwise disjoint Ai ∈ A, i ∈ N.

In this case (Ω, A, P) is called a probability space and A ∈ A an event. The


pair (Ω, A) of a set Ω together with a σ-algebra is called a measurable space.

9
Example 1.7. (i) Coin tossing Let A := P(Ω) = ∅, {0}, {1}, {0, 1} .

Tossing a fair coin means “head” and “tail” have equal probability 12 , hence:
1
P({0}) := P({1}) := , P(∅) := 0, P({0, 1}) := 1.
2 | {z }
=Ω

(iii) Tossing a coin infinitely many times: Ω = {0, 1}N .


Let A := σ(A0 ) where

A0 := B ⊂ Ω ∃ n ∈ N and B0 ∈ P {0, 1}n ,


 

such that B = B0 × {0, 1} × {0, 1} × . . . .

An event B is contained in A0 if it depends on finitely many tosses.


Fix x̄1 , . . . , x̄n ∈ {0, 1} and define

:= 2−n .
 
P (x1 , x2 , . . . ) ∈ {0, 1}N x1 = x̄1 , . . . , xn = x̄n
| {z }
∈A0

P can be extended to a probability measure on A = σ(A0 ). For this


probability measure we have that
n
 N 1X 1 
P (x1 , x2 , . . . ) ∈ {0, 1} lim xi = = 1.
n→∞ n 2
i=1

(Proof: Later !)

(v) Continuous stochastic processes: Ω = C([0, 1]), P = Wiener measure


on (Ω, B(Ω)) (“Brownian motion”). For fixed t0 ∈ (0, ∞) and α, β ∈ R,
α < β, we have that
Z β 2
  1 x
− 2t
P ω ω(t0 ) ∈ [α, β] := √ e 0 dx
2πt0 α

(Gaussian or normal distribution).


What is B(Ω) and how to construct the Wiener measure ? An answer
to this question is given in a course on stochastic processes or stochastic
differential equations !
What is now the probability P ω max ω(t) > c ? Answer to this
 
06t61
question: see also a course on stochastic processes or stochastic differential
equations !

10
R

β
α
0

ω(t)

t0 t
Remark 1.8. Let (Ω, A, P) be a probability space, and let A1 , . . . , An ∈ A be
pairwise disjoint. Then
n

[  X
P Ai = P(Ai ) (P is additive)
i6n i=1

(simply let Am = ∅ for all m > n). In particular:


A, B ∈ A, A ⊂ B ⇒ P(B) = P(A) + P(B \ A)
⇒ P(B \ A) = P(B) − P(A),
and P(Ac ) = P(Ω \ A) = P(Ω) − P(A) = 1 − P(A).
P is subadditive, that is, for A, B ∈ A
P(A ∪ B) = P A ∪ B \ (A ∩ B)
 

= P(A) + P(B) − P(A ∩ B)


6 P(A) + P(B), (1.1)
and by induction one obtains Sylvester’s formula:
Let I be a finite index set, Ai , i ∈ I, be a collection of subsets in A (not
necessarily disjoint). Then (below, |J| denotes the number of elements in J)
[  X \ 
P Ai = (−1)|J|−1 · P Aj (1.2)
i∈I J⊂I, j∈J
J6=∅
n
I={1,...,n} X X
= (−1)k−1 · P(Ai1 ∩ · · · ∩ Aik ).
k=1 16i1 <···<ik 6n

11
Proposition 1.9. Let A be a σ-algebra, and P : A → R+ := [0, ∞) be a
mapping with P(Ω) = 1. Then the following are equivalent:

(i) P is a probability measure.

(ii) P is additive and continuous from below, that is, Ai ∈ A, i ∈ N, with


Ai ⊂ Ai+1 for all i ∈ N, implies
[ 
P Ai = lim P(Ai ).
i→∞
i∈N

(iii) P is additive and continuous from above, that is Ai ∈ A, i ∈ N, with


Ai ⊃ Ai+1 for all i ∈ N, implies
\ 
P Ai = lim P(Ai ).
i→∞
i∈N

Corollary 1.10 (σ-subadditivity). Let (Ω, A, P) be a probability space and Ai ,


i ∈ N, be a sequence of subsets in A (not necessarily pairwise disjoint). Then:

[  ∞
X
P Ai 6 P(Ai ).
i=1 i=1

Proof.

[  1.9 n
[  (1.1) n
X ∞
X
P Ai = lim P Ai 6 lim P(Ai ) = P(Ai ).
n→∞ n→∞
i=1 i=1 i=1 i=1

Lemma 1.11 (Borel-Cantelli). Let (Ω, A, P) be a probability space and Ai ∈


A, i ∈ N. Then

X \ [ 
P(Ai ) < ∞ ⇒ P Am = 0.
i=1 n∈N m>n
| {z }
=:lim sup An
n→∞

Proof. Since
[ n→∞ \ [
Am & Am
m>n n∈N m>n

12
the continuity from above of P implies that
[  1.10 ∞
 1.9 X
P lim sup An = lim P Am 6 lim P(Am ) = 0,
n→∞ n→∞ n→∞
m>n m=n
P∞
since m=1 P(Am ) < ∞.

Example 1.12. (i) Uniform distribution  on [0, 1]: Let Ω = [0,1] and A
be the Borel-σ-algebra on Ω (= σ [a, b] 0 6 a 6 b 6 1 ). Let P
be the restriction of the Lebesgue measure on the Borel subset [0, 1] of
R. Then (Ω, A, P) is a probability space. The probability measure P is
called the uniform distribution on [0, 1], since P([a, b]) = b − a for any
0 ≤ a ≤ b ≤ 1 (translation invariance).

(ii) Dirac-measure: Let Ω 6= ∅ and ω0 ∈ Ω. Let A be an arbitrary σ-algebra


on Ω (e.g. A = P(Ω)). Then
(
1 if ω0 ∈ A
P(A) := 1A (ω0 ) :=
0 if ω0 ∈
/ A.

defines a probability measure on A. P is called the Dirac-measure in ω0 ,


denoted by P = δω0 or P = εω0 .

(iii) Convex combinations of probability measures: Let Ω 6= ∅ and A be


a σ-algebra of subsets of Ω. Let I be a countable index set. Let Pi , i ∈ I,
be a family of probability measures on (Ω, A), and αi ∈ [0, 1], i ∈ I, be
such that Then i∈I αi · Pi is again a probability
P P
i∈I αi = 1. P :=
measure on (Ω, A).
This holds in particular for
X
P := αi · δωi
i∈I

if ωi ∈ Ω, i ∈ I.

2 Discrete models
Throughout the whole section

13
• Ω 6= ∅ countable (i.e finite or denumerable)

• A = P(Ω) and

• ω ∈ Ω (more precisely {ω} ⊂ Ω) an elementary event.

Proposition 2.1. (i) Let p : Ω → [0, 1] be a function with


P
ω∈Ω p(ω) =1
(p is called a probability distribution function). Then
X
P(A) := p(ω) ∀A ⊂ Ω
ω∈A

defines a probability measure on (Ω, A).

(ii) Every probability measure P on (Ω, A) is of this form, with p(ω) := P({ω})
for all ω ∈ Ω.

Proof. (i)
X
P= p(ω) · δω .
ω∈Ω

(ii) Exercise.

Example 2.2 (Laplace probability space). Fundamental example in the discrete


case that forms the basis of many other discrete models.
Let Ω be a nonempty finite set (that is 0 < |Ω| < ∞). Define

1
p(ω) = ∀ω ∈ Ω .
|Ω|

Then

|A| number of elements in A


P(A) = = .
|Ω| number of elements in Ω

Hence measure theoretic problems reduce to combinatorial problems in the dis-


crete case.
P is called uniform distribution on Ω, because every elementary event ω ∈ Ω
has the same probability |Ω|
1
.

14
Example 2.3. (i) random permutations: Let M := {1, . . . , n} and Ω :=
all permutations of M . Then |Ω| = n! Let P be the uniform distribution
on Ω.
Problem: What is the probability P(“at least one fixed point”)?
Consider the event Ai := {ω | ω(i) = i} (fixed point at position i). Then
Sylvester’s formula (cf. (1.2)) implies that
n
[ 
P(“at least one fixed point”) = P Ai
i=1
n
(1.2) X X
= (−1)k+1 · P(Ai1 ∩ · · · ∩ Aik )
| {z }
k=1 16i1 <···<ik 6n
= (n−k)!
n!
(k positions are fixed)
n   X (−1)k n
X
k+1 n (n − k)!
= (−1) · =− .
k n! k!
k=1 k=1

Consequently,
n n
X (−1)k X (−1)k n→∞
P(“no fixed point”) = 1 + = −−−→ e−1
k! k!
k=1 k=0

and thus for all k ∈ {0, . . . , n}:

P(“exactly k fixed points”)


n−k n−k
(−1)j 1 X (−1)j
 
1 n X
= · · (n − k)! = .
n! k j! k! j!
j=0 j=0
|{z} | {z } | {z }
all possible all ω with all ω with n-k positions
outcomes k fixed points without fixed points

Asymptotics as n → ∞:
n−k
1 X (−1)j n→∞ 1 −1
P(“exactly k fixed points”) = −−−→ ·e
k! j! k!
j=0

(the number of fixed points is asymptotically Poisson distributed with pa-


rameter λ = 1).

15
The Poisson distribution with parameter λ > 0 on N ∪ {0} is given by

−λ
X λj
πλ := e · δj .
j!
j=0

(ii) n experiments with state space S, |S| < ∞

|Ω| = |S|n .

Ω := ω = (x1 , . . . , xn ) xi ∈ S ,

Let P be the uniform distribution on Ω.


Fix a subset S0 ⊂ S, such that xi ∈ S0 is called a “success”, hence
|S |
p := |S|0 is the probability of success.
What is the probability of the event Ak = “(exactly) k successes”, k =
0, . . . , n?
n

|S0 |k · |S \ S0 |n−k
 
|A | n
P(Ak ) = k = k
= · pk (1 − p)n−k
|Ω| |S|n k

(Binomial distribution with parameters n, p).


The Binomial distribution with parameters n, p on {0, ..., n} is given by
n  
X n
B(n, p) := βnp := · pk (1 − p)n−k · δk ,
k
k=0

where p = P(“Success in the i-th experiment”), i = 1, . . . , n (see above).


Moreover, if we replace p by pn := nλ , λ > 0, then for k = 0, 1, 2, ... the
asymptotics of the binomial distribution as n → ∞ is given by
   k  n−k
n λ λ
· 1−
k n n
n−k
λk n · (n − 1) · · · (n − k + 1)

λ
= · k
· 1−
k! n n
| {z } | {z }
n→∞ n→∞
−−−→ 1 −−−→ e−λ

n→∞ λk
−−−→ · e−λ (k = 0, 1, 2, . . .)
k!

16
(for n big and p small, the Poisson distribution with parameter λ = p · n
is a good approximation for B(n, p)).

(iii) Urn model (for example: opinion polls, samples, poker, lottery...)
We consider an urn containing N balls, K red and N − K black (N ≥ 2,
0 6= K 6= N ). Suppose that n 6 N balls are sampled without replace-
ment. What is the probability that exactly k balls in the sample are red?
typical application: suppose that a small lake contains an (unknown) num-
ber N of fish. To estimate N one can do the following: K fish will be
marked by red and after that n (n ≤ N ) fish are “sampled” from the
lake. If k is the number of marked fish in the sample, N̂ := K · nk is an
estimation of the unknown number N . In this case the probability below
with N replaced by N̂ is also an estimation.
Model:
Let Ω be all subsets of {1, . . . , N } having cardinality n, hence

 
N
Ω := ω ∈ P {1, . . . , N }
 
|ω| = n , |Ω| =
n

and let P be the uniform distribution on Ω. Consider the event Ak :=


“exactly k red”. Then
  
K N −K
|Ak | = ,
k n−k

so that
K N −K
 
k n−k
P(Ak ) = N
 (k = 0, . . . , n), (hypergeometric distribution).
n

Asymptotics for N → ∞, K → ∞ with K


N → p ∈ [0, 1] and n fixed:
 
n
P(Ak ) −→ · pk (1 − p)n−k (k = 0, . . . , n),
k

(good approximation for N big and n


N small).

17
3 Transformations of probability spaces
Throughout this section let (Ω, A) and (Ω̃, Ã) be measurable spaces.
Definition 3.1. A mapping T : Ω → Ω̃ is called A/Ã-measurable (or simply
measurable), if T −1 (Ã) ∈ A for all à ∈ Ã.
Notation:

{T ∈ Ã} := T −1 (Ã) = ω ∈ Ω T (ω) ∈ Ã .




Remark 3.2. (i) Clearly, if A := P(Ω) then every mapping T : Ω → Ω̃ is


measurable.

(ii) Sufficient criterion for measurability: suppose that à := σ(Ã0 ) for some
collection of subsets Ã0 ⊂ P(Ω̃). Then T is A/Ã-measurable, if T −1 (Ã) ∈
A for all à ∈ Ã0 .
(iii) Let Ω, Ω̃ be topological spaces, and A, Ã be the associated Borel σ-
algebras. Then:

T : Ω → Ω̃ is continuous ⇒ T is A/Ã-measurable.

(iv) Let (Ωi , Ai ), i = 1, 2, 3, be measurable spaces, and Ti : Ai → Ai+1 ,


i = 1, 2, measurable mappings. Then:

T2 ◦ T1 is A1 /A3 -measurable.

Proof. (ii) {Ã ∈ P(Ω̃) | T −1 (Ã) ∈ A} is a σ-algebra containing Ã0 . Conse-


quently,

σ(Ã0 ) ⊂ Ã ∈ P(Ω̃) T −1 (Ã) ∈ A .




(iii) Easy consequence of (ii).

(iv) Exercise.
Definition 3.3. Let T : Ω̄ → Ω be a mapping and let A be a σ-Algebra of
subsets of Ω. The system

σ(T ) := T −1 (A) A ∈ A


is a σ-algebra of subsets of Ω̄; σ(T ) is called the σ-algebra generated by T . More


precisely: σ(T ) is the smallest σ-algebra Ā, such that T is Ā/A-measurable.

18
Proposition 3.4. Let T : Ω → Ω̃ be A/Ã-measurable and P be a probability
measure on (Ω, A). Then

P̃(Ã) := T (P)(Ã) := P T −1 (Ã) =: P(T ∈ Ã), Ã ∈ Ã,




defines a probability measure on (Ω̃, Ã). P̃ is called the induced measure on


(Ω̃, Ã) or the distribution of T under P.
Notation: P̃ = P ◦ T −1 or P̃ = T (P).

Proof. Clearly, P̃(Ã) > 0 for all à ∈ Ã, P̃(∅) = 0 and P̃(Ω̃) = 1. For pairwise
disjoint Ãi ∈ Ã, i ∈ N, T −1 (Ãi ) are pairwise disjoint too, hence
P is ∞ ∞

[  −1

[  σ-additive X
−1
 X
P̃ Ãi = P T Ãi = P T (Ãi ) = P̃(Ãi ).
i∈N i∈N i=1 i=1
| {z }

S
= T −1 (Ãi )
i∈N

Remark 3.5. Let T be as in Proposition 3.4, and T (Ω) be countable, so that


T (Ω) = {ω̃i | i ∈ I}, with I finite or I = N. Then
X
P̃ = P(T = ω̃i ) · δω̃i .
i∈I

Proof. For any à ∈ à we can write


 
[
T ∈ Ã = {T = ω̃i }
{i∈I,ω̃i ∈Ã}

so that
X X 
P̃(Ã) = P(T ∈ Ã) = P(T = ω̃i )·1Ã (ω̃i ) = P(T = ω̃i )·δω̃i (Ã).
| {z }
i∈I i∈I
=δω̃i (Ã)

Example 3.6. Infinitely many coin tosses: existence of a probability


measure. Let Ω := [0, 1] and A be the Borel σ-algebra on [0, 1]. Let P be the
restriction of the Lebesgue measure on [0, 1]. Let

Ω̃ := ω̃ = (xn )n∈N xi ∈ {0, 1} ∀ i ∈ N = {0, 1}N .

19
Define X̃i : Ω̃ → {0, 1} by

X̃i (xn )n∈N := xi , i ∈ N,
and let
à := σ
 
{X̃i = 1} i ∈ N .

Note that à = σ(A0 ), where A0 is the algebra of cylindrical subsets of Example


1.7 (iii). The binary expansion of some ω ∈ [0, 1] defines a mapping
T : Ω → Ω̃
ω 7→ T (ω) = (T1 (ω), T2 (ω), . . . ),
with

R R
T1 (ω) T2 (ω)
1 1

1 1 1 3
2
1 Ω 4 2 4
1 Ω

(and similar for T3 , T4 , . . . ). Note that Ti = X̃i ◦ T for all i ∈ N. T is


A/Ã-measurable, since
T −1 {X̃i = 1} = {Ti = 1} = finite union of intervals ∈ A .


Define P̃ := P ◦ T −1 . For fixed (x1 , . . . , xn ) ∈ {0, 1}n we now obtain


n
\ 
P̃(X̃1 = x1 , . . . , X̃n = xn ) = P̃ {X̃i = xi }
i=1
n
\  n
\ 
−1
= P T ({X̃i = xi }) = P {Ti = xi }
i=1 i=1
−n −n
= P(interval of length 2 ) = 2 .
Hence, for any fixed n, P̃ coincides with the probability measure for n coin
tosses ( = uniform distribution on binary sequences of length n). We have thus
shown the existence of a probability measure P̃ on (Ω̃, Ã) and solved a part of
the problem of 1.7.
Uniqueness of P̃ later !

20
4 Random variables
Let (Ω, A) be a measurable space and

B(R̄) = B ⊂ R̄ B ∩ R ∈ B(R) .

R̄ := R ∪ {−∞, +∞},

Definition 4.1. A random variable (r.v.) on (Ω, A) is a A/B(R̄)-measurable


map X : Ω → R̄.
Remark: We will mainly consider real-valued r.v. X : Ω → R. In this case,
it is of course enough to check A/B(R)-measurability. For a r.v. we let

{X 6 c} := {ω ∈ Ω | X(ω) 6 c}, {X < c} := {ω ∈ Ω | X(ω) < c}, etc.

Remark 4.2. (i) X : Ω → R̄ is a random variable if for all c ∈ R, {X 6


c} = {X ∈ {−∞} ∪ (−∞, c])} ∈ A. In the same way, if X : Ω → R is
real-valued, it is enough to check {X 6 c} = {X ∈ (−∞, c]} ∈ A for all
c ∈ R. In both cases, it even suffices to only consider c ∈ Q instead of
c ∈ R, or instead of Q any other set S ⊂ R which is dense in R.

(ii) If A = P(Ω), then every function from Ω to R̄ is a random variable on


(Ω, A).

(iii) Let X be a random variable on (Ω, A) with values in R (resp. R̄) and
h : R → R (resp. h : R̄ → R̄) be B(R)/B(R)-measurable (resp.
B(R̄)/B(R̄)-measurable). Then h(X) is a random variable too.
Examples: |X|, X 2 , |X|p , eX , . . .

(iv) The class of random variables on (Ω, A) is closed under the following
countable operations.
If X1 , X2 , . . . are random variables, then
Pn
• · Xi (αi ∈ R, n ∈ N), provided the sum of R̄-valued r.v.’s
i=1 αi
makes sense, ∞ − ∞ = ?,
• supi∈N Xi , inf i∈N Xi , in particular

X1 ∧ X2 := min(X1 , X2 ), X1 ∨ X2 := max(X1 , X2 )

are r.v.’s
• lim supi→∞ Xi , lim inf i→∞ Xi (hence also limi→∞ Xi , if it exists),

21
are random variables too.

Proof. (i) Obvious, since σ({{−∞} ∪ (−∞, c] : c ∈ R}) = B(R̄), resp.


σ({(−∞, c] : c ∈ R}) = B(R). This also holds, if replace c ∈ R by
c ∈ S where S ⊂ R is any dense subset of R. For instance, if S = Q,
then for c ∈ R, we have
[
{X > c} = {X > q},
q∈Q
q>c

since for a real number x it holds x > c ⇔ x > q for some q ∈ Q, q > c.

(ii) and (iii) Obvious.

(iv) for example:


• supremum
\
{Xi 6 c} ∈ A.

(sup Xi ) 6 c =
i∈N | {z }
i∈N ∈A

• sum (assume X, Y to be real-valued r.v.’s). Then for c ∈ R


[
{X < r} ∩ {Y < c − r} ∈ A.

X +Y <c =
| {z }
r∈Q ∈A

Important examples

Example 4.3. (i) Indicator functions of an event A ∈ A:


(
1 if ω ∈ A,
ω 7→ 1A (ω) := (alternative notation IA )
0 if ω ∈
/ A,

is a random variable, because



∅
 if c < 0,
{1A 6 c} = Ac if 0 6 c < 1,
if c > 1.


22
(ii) simple random variables
n
X
X= ci · 1Ai , ci ∈ R, Ai ∈ A,
i=1

Note: any finite-valued random variable is simple, because X(Ω) = {c1 , . . . , cn }


implies
X
ci 1Ai with Ai := {X = ci } = X −1 {ci } ∈ A.

X = X ·1∪ni=1 Ai =
i

Proposition 4.4 (Structure of random variables on (Ω, A)). Let X be a ran-


dom variable on (Ω, A). Then:

(i) X = X + − X − , with

X + := max(X, 0), X − := − min(X, 0) (random variables!).

(ii) Let X > 0. Then there exists a sequence of simple random variables
Xn , n ∈ N, with 0 ≤ Xn 6 Xn+1 and X = limn→∞ Xn (in short:
0 ≤ Xn % X).

Proof. (of (ii))


n
n2 −1
X i
Xn := 1 i i+1 + n1{X≥n}
2n { 2n ≤X< 2n }
i=0

Let (Ω, A, P) be a probability space.

Definition 4.5. Let X be a random variable on (Ω, A) with


Z Z 
min X + dP, X − dP < ∞. (1.3)

Then
Z  Z 
E[X] := X dP = X dP

is called the expectation of X (w.r.t. P).

23
Definition/Construction of the integral w.r.t. P:
Let X be a random variable.

1. If X = 1A , A ∈ A, define
Z
X dP := P(A) .

Pn
2. If X = i=1 ci · 1Ai , ci ∈ R, Ai ∈ A, define
Z n
X
X dP := ci · P(Ai )
i=1

(have to show: independent of the particular representation of X!)

3. X ≥ 0, then there exist Xn simple, Xn ≥ 0 (see 4.4) with Xn % X. Define


Z Z

X dP := lim Xn dP ∈ [0, ∞] .
n→∞

(have to show: independent of the particular choice for Xn !)

4. for general X, decompose X = X + − X − and define

Z Z Z
X + dP − X − dP.

E[X] = X dP :=

(well-defined, if (1.3) satisfied by using the usual definitions for c ∈ R:


−∞ + c := c − ∞ := −∞, c + ∞ := ∞ + c := ∞.)

Definition 4.6. (i) The set of all P-integrable random variables is defined
by

L1 := L1 (Ω, A, P) := X r.v. E |X| < ∞ .


  

(ii) A property E of points ω ∈ Ω holds P-almost surely (P-a.s.), if there


exists a measurable zero set N , i.e. a set N ∈ A with P(N ) = 0, such
that every ω ∈ Ω \ N has property E.

24
If

N := {X r.v. | X = 0 P-a.s.}

then the quotient space

L1.
L := L (Ω, A, P) :=
1 1
(X ∼ Y :⇔ X −Y ∈ N :⇔ X = Y P-a.s.)
N
is a Banach space w.r.t. the norm E |X| .
 

Remark 4.7. Special case: X random variable, X ≥ 0, X(Ω) countable. Then


(with the usual definitions (+∞) · 0 := (−∞) · 0 := 0, (+∞) · α := +∞ if
α > 0, (+∞) · α := −∞ if α < 0, etc) we have

h X “3.” and
i “2.” above X
E[X] = E x · 1{X=x} = x · P(X = x). (1.4)
x∈X(Ω) x∈X(Ω)

Similarly for X not necessarily finite, but E[X] well-defined:


X X
E[X] = x · P(X = x) − (−x) · P(X = x).
x∈X(Ω), x∈X(Ω),
x>0 x<0

If, in addition, Ω is countable, and X > 0, then


X
X= X(ω) · 1{ω} , and
ω∈Ω

X X X
E[X] = X(ω) · E[1{ω} ] = X(ω) · P({ω}) = p(ω) · X(ω).
| {z }
ω∈Ω ω∈Ω =:p(ω) ω∈Ω

Example 4.8. Infinitely many coin tosses with a fair coin: Let Ω =
{0, 1}N . A and P as in 3.6

(i) Expectation of the ith coin toss Xi (xn )n∈N := xi :




(1.4) 1
E[Xi ] = 1 · P(Xi = 1) + 0 · P(Xi = 0) = .
2

25
(ii) Expectation of number of “successes”:
Sn := X1 + · · · + Xn = number of “successes”(= ones) in n tosses
Then for k = 0, 1, . . . , n
X
P(Sn = k) = P(X1 = x1 , . . . , Xn = xn )
(x1 ,...,xn )∈{0,1}n
with
x1 +...+xn =k
 
n
= · 2−n .
k

Hence
n n  
(1.4) X X n n
E[Sn ] = k · P(Sn = k) = k· · 2−n = .
k 2
k=0 k=1

Easier: Once we have noticed that E[ · ] is linear (see next proposition):


n
X (i) n
E[Sn ] = E[Xi ] = .
2
i=1

(iii) Waiting time until first success: Let

T (ω) := min{n ∈ N | Xn (ω) = 1}


= waiting time until first success

(min ∅ := +∞, T measurable !, why ?). Then

P(T = k) = P(X1 = · · · = Xk−1 = 0, Xk = 1) = 2−k ,

and P(T = ∞) = 0 so that


∞ ∞ ∞  k−1
(1.4) X X
−k 1X 1
E[T ] = k · P(T = k) = k·2 = k· = 2.
2 2
k=1 k=1 k=1

q P∞ P∞
(Recall: 1 d d
)
 k k−1
(1−q)2
= dq 1−q = dq k=1 q = k=1 kq

Remark 4.9. X = Y P-a.s., i.e. P(X = Y ) = 1, implies E[X] = E[Y ].

26
Proposition 4.10. Let X, Y be r.v. satisfying (1.3). Then
(i) 0 6 X 6 Y P-a.s. =⇒ 0 6 E[X] 6 E[Y ].

(ii) α, β ∈ R, Y ∈ L1 =⇒ E[αX + βY ] = α E[X] + β E[Y ].


In particular, (i) implies: X 6 Y P-a.s. =⇒ E[X] 6 E[Y ] (only (1.3) is
necessary !).
Proof. See textbooks on measure theory.
In addition X 7→ E[X] is continuous w.r.t. monotone increasing sequences, i.e.
the following proposition holds:
Proposition 4.11 (monotone integration, B. Levi). Let Xn random variables
with 0 6 X1 6 X2 6 . . . . Then:
 
lim E[Xn ] = E lim Xn .
n→∞ n→∞
Proof. See textbooks on measure theory.
Corollary 4.12. Let Xn > 0, n ∈ N, be random variables. Then

hX i X∞
E Xn = E[Xn ].
n=1 n=1
Lemma 4.13 (Fatou’s lemma). Let Xn , n ∈ N, be a sequence random vari-
ables satisfying (1.3) and let Y ∈ L1 . Then
(i) Xn > Y P-a.s. for all n ∈ N =⇒ E lim inf n→∞ Xn 6 lim inf n→∞ E[Xn ].
 

(ii) Xn 6 Y P-a.s. for all n ∈ N =⇒ E lim supn→∞ Xn > lim supn→∞ E[Xn ].
 

(Remark: (i) and (ii) are mostly applied with Y = 0.)


Proof. (i) By Remark 4.9 and Corollary 5.6 below we may assume that |Y (ω|) <
∞ for all ω ∈ Ω
  4.10(ii)    
E lim inf Xn = E lim inf Xk − Y + E Y
n→∞ n→∞ k>n
| {z }
satisfies (1.3)

B. Levi    
= lim E inf Xk − Y +E Y
n→∞ k >n
4.10   4.10(ii)
6 lim inf E[Xk − Y ] + E Y = lim inf E[Xn ].
n→∞ k>n n→∞

(ii) is similar to (i).

27
Proposition 4.14 (Lebesgue’s dominated convergence theorem, DCT). Let
Xn , n ∈ N be random variables and Y ∈ L1 with |Xn | 6 Y P-a.s. Suppose
that the pointwise limit limn→∞ Xn exists P-a.s. Then
 
E lim Xn = lim E[Xn ].
n→∞ n→∞

Proof. Since −Y 6 Xn 6 Y P-a.s. for any n and

lim inf Xn = lim sup Xn = lim Xn P − a.s,


n→∞ n→∞ n→∞

it follows
  4.9   Fatou
E lim Xn = E lim inf Xn 6 lim inf E[Xn ] 6 lim sup E[Xn ]
n→∞ n→∞ n→∞ n→∞
Fatou  4.9  
6 E lim sup Xn = E lim Xn .
n→∞ n→∞

Example 4.15. Tossing a fair coin Consider the following simple game: A
fair coin is thrown and the player can invest an arbitrary amount of KRW on
either “head” or “tail”. If the right side shows up, the player gets twice his
investment back, otherwise nothing.
Suppose now a player plays the following bold strategy: he doubles his invest-
ment until his first success. Assuming the initial investment was 1000 KRW,
the investment in the nth round is given by

In = 2n−1 · 1{T >n−1} ,

where T = waiting time until the first “1”. Then

E[In ] = 2n−1 · P(T > n − 1) = 1.


| {z }
n−1
=( 12 )

whereas on the other hand limn→∞ In = 0 P-a.s. (more precisely: for all
ω 6= (0, 0, 0, . . . )).

5 Inequalities
Let (Ω, A, P) be a probability space.

28
Proposition 5.1 (Jensen’s inequality). Let h be a convex function defined on
some interval I ⊆ R, X in L1 with X(Ω) ⊂ I. Then E[X] ∈ I and
  
h E[X] 6 E h(X) .

Proof. W.l.o.g. we may asssume that x0 := E[X] ∈ ˚ I (otherwise X is P-a.s.


equal to a constant function for which the statement is trivial). Since h is
convex, there exists an affine linear function ` with `(x0 ) = h(x0 ) and ` 6 h
(“support tangent line”). Consequently,
  linearity   monotonicity  
h E[X] = ` E[X] = E `(X) 6 E h(X) .
Example 5.2.

E[X]2 6 E[X 2 ].
More generally, for 0 < p 6 q:
1 1
E |X|p p 6 E |X|q q .
 
| {z } | {z }
=:kXkp =:kXkq
q p
Proof. h(x) := |x| p is convex. Since |X| ∧ n ∈ L1 for n ∈ N, we obtain
that
 h p i pq h q i
E |X| ∧ n 6 E |X| ∧ n ,

which implies the assertion by B. Levi taking the limit n → ∞.


Definition 5.3. For 1 6 p < ∞ let

Lp := X X r.v. and E |X|p < ∞ .


  

Lp is called the set of p-integrable random variables.


Remark 5.4. (i) If 1 6 p 6 q then Lq ⊂ Lp .

(ii) Let N = {X | X r.v. and X = 0 P-a.s.}, and p > 1. Then N ⊂ Lp is a


linear subspace and the quotient space
Lp.
Lp :=
N
is a Banach space w.r.t. k · kp (i.e. a complete normed vector space).

29
Proposition 5.5. Let X be a random variable, h : R̄ → [0, ∞] be increasing.
Then
 
h(c) P(X > c) 6 E h(X) ∀ c > 0.

Proof.
  
h(c) P(X > c) 6 h(c) P h(X) > h(c) = E h(c) 1{h(X)>h(c)}
 
6 E h(X) .

Corollary 5.6. (i) Markov inequality: Choose h(x) = x1[0,∞] (x) and re-
place X by |X| in 5.5. Then
 1  
P |X| > c 6 E |X| ∀c > 0.
c
In particular,
|X| = 0 P-a.s.
 
E |X| = 0 ⇒
|X| < ∞ P-a.s.
 
E |X| < ∞ ⇒

(ii) Chebychev’s inequality: Choose h(x) = x2 1[0,∞] (x) and and replace
X by |X − E[X]| in 5.5. Then X ∈ L2 implies
 
1 var(X)
  h 2 i
P X − E[X] > c 6 2 E X − E[X] = .
c c2

6 Variance and Covariance


Let (Ω, A, P) be a probability space.
E[X] = “average value” of X(ω), ω ∈ Ω (prediction).

Remark 6.1. Let P be the uniform distribution on Ω = {ω1 , . . . , ωn }, then


n
1X
E[X] = X(ωi ) = arithmetic mean of X(ω1 ), . . . , X(ωn ) .
n
i=1

Definition 6.2. Let X ∈ L1 . Then


h 2 i
2

var(X) := σ (X) := E X − E[X] ∈ [0, ∞]

30
is called the variance of X (mean square prediction error).
The variance is a measure for fluctuations of X around E[X]. It indicates
the risk thatpone takes when a prognosis is based on the expectation.
σ(X) := var(X) is called standard deviation.

Remark 6.3. (i)


h 2 i 2
var(X) = E X − E[X] = E[X 2 ] − E[X] .

(ii) var(X) = 0

⇔ P X = E[X] = 1
i.e. X behaves deterministically

(iii) var(X) < ∞ ⇔ X ∈ L2 .

Definition 6.4. Let X, Y ∈ L2 . Then


h  i
cov(X, Y ) := E X − E[X] Y − E[Y ] = E[XY ] − E[X] · E[Y ]
(1.5)

is called the covariance of X and Y . If additionally X, Y are not deterministic,


then
cov(X, Y )
%(X, Y ) :=
σ(X) · σ(Y )

is called the correlation of X and Y . The correlation is independent of scaling,


i.e %(aX + b, cY + d) = %(X, Y ), if ac > 0 (use the formula in Remark 6.5(i)
below).

Remark 6.5 (properties of the covariance). (i) Let X ∈ L1 . Then

var(aX + b) = a2 · var(X).

(ii) Let X, Y ∈ L2 . Then

var(X + Y ) = var(X) + var(Y ) + 2 cov(X, Y ).

Definition 6.6. Two random variables X, Y ∈ L2 are called uncorrelated, if

(1.6)

cov(X, Y ) = 0 ⇔ var(X + Y ) = var(X) + var(Y ) .

31
Proposition 6.7 (Cauchy-Schwarz). Let X and Y ∈ L2 . Then

X · Y ∈ L1 and cov(X, Y ) 6 σ(X) · σ(Y ).

In particular, %(X, Y ) ∈ [−1, 1].

Proof. Let X, Y ∈ L2 . Then X + Y ∈ L2 , hence

2 · XY = (X + Y )2 − X 2 − Y 2 ∈ L1 .

cov(·, ·) : L2 × L2 → R, (X, Y ) 7→ cov(X, Y ) is bilinear, symmetric and


positive. Hence by the Cauchy-Schwarz inequality
1 1
| cov(X, Y )| ≤ cov(X, X) 2 cov(Y, Y ) 2 = σ(X) · σ(Y ).

Example 6.8. Tossing a coin with probability p ∈ [0, 1] for success:



Ω = ω = (x1 , x2 , . . . ) xi ∈ {0, 1} = {0, 1}N ,
Xi : Ω → {0, 1} with Xi (xn )n∈N = xi ,

i ∈ N,
A=σ
 
{Xi = 1} i ∈ N .

Then there exists a unique probability measure P = Pp on (Ω, A) with


n
P n
P
xi n− xi
Pp (Xi1 = x1 , . . . , Xin = xn ) = p i=1 ·(1−p) i=1 for any 1 ≤ i1 < ... < in ,

(existence for p = 12 in Example 3.6, existence for general p 6= 21 later, unique-


ness later).
Then P(Xi = 1) = p and P (Xi = 1, Xj = 1) = p2 for all i 6= j. Conse-
quently,

E[Xi ] = p and var(Xi ) = E[Xi2 ] − E[Xi ]2 = p − p2 = p(1 − p)

and for i 6= j
cov(Xi , Xj ) = E[Xi Xj ] − p2 = 0,
so that X1 , X2 , . . . are pairwise uncorrelated (in fact even independent, see
below).

32
Let Sn := X1 + · · · + Xn be the number of successes. Then

E[Sn ] = np and var(Sn ) = np(1 − p).


P∞ −n
If X := n=1 2 Xn then

hX ∞
i Levi X ∞
X
−n −n
E[X] = E 2 Xn = E[2 Xn ] = 2−n p = p
n=1 n=1 n=1

and using Levi and the fact that X1 , X2 , . . . are pairwise uncorrelated, we con-
clude that

X 1
var(X) = 2−2n · p(1 − p) = · p(1 − p).
3
n=1

Finally, let T be the waiting time until the first “success”. Then

Pp (T = n) = Pp X1 = · · · = Xn−1 = 0, Xn = 1
= (1 − p)n−1 p (geometric distribution),

then

hX i ∞
X
E[T ] = E n · 1{T =n} = n · Pp (T = n)
n=1 n=1

X 1
= n · (1 − p)n−1 p = ,
p
|n=1 {z }
“derivative of the
geometr. series”

and analogously
1−p
var(T ) = · · · = .
p2

7 The (strong and the weak) law of large


numbers
Let

33
• (Ω, A, P) be a probability space

• X1 , X2 , . . . ∈ L2 r.v. with
– Xi uncorrelated, i.e. cov(Xi , Xj ) = 0 for i 6= j
– uniformly bounded variances, i.e. supi∈N var(Xi ) < ∞.
| {z }
=σ 2 (Xi )
=:σi2

Let
Sn := X1 + · · · + Xn
S (ω)
so that nn is the arithmetic mean of the first n observations X1 (ω), . . . , Xn (ω)
(“empirical mean”).
Our aim in this section is to show that randomness in the empirical mean
vanishes for increasing n, i.e.

Sn (ω) n large E[Sn ]


∼ ,
n n
resp.

Sn (ω) n large
∼ m if E[Xi ] ≡ m .
n
Remark 7.1. W.l.o.g. we may assume that E[Xi ] = 0 for all i, because
otherwise consider X̃i := Xi − E[Xi ] (“centered”), which satisfies:

• X̃i ∈ L2

• cov(X̃i , X̃j ) = cov(Xi , Xj ) = 0 for i 6= j

• var(X̃i ) = var(Xj ).
E[S̃n ] E[Sn ] Pn
• S̃n
n − n = Sn
n − n (S̃n := i=1 X̃i "centered sum")

Proposition 7.2.
 
S n E[Sn ] 2
lim E − = 0,
n→∞ n n
 2 
Sn
(resp. lim E −m =0 if E[Xi ] ≡ m).
n→∞ n

34
Proof.
 
S n E[Sn ] 2 S 
n 1
E − = var = · var(Sn )
n n n n2
n
Bienaymé 1 X 2 1 n→∞
= 2
σi 6 · const. −−−→ 0.
n n
i=1

Remark 7.3. Mere functional analytic fact: in the Hilbert space L2 = L2




1
(with the scalar product (X, Y ) = E[X · Y ] and norm k · k = ( · , · ) 2 ), the
arithmetic mean of uniformly bounded, orthogonal vectors converges to zero:
2 n
Sn 1 2 1 X n→∞
= 2 · kSn k = 2 kXi k2 −−−→ 0.
n n n
i=1

Chebychev’s inequality immediately implies the following:


Proposition 7.4 (“Weak law of large numbers”). Let X1 , X2 , . . . ∈ L2 (Ω, A, P)
uncorrelated r.v. with uniformly bounded variances and E[Xi ] = m ∀ i. Then
for all ε > 0:  
Sn
lim P − m > ε = 0.
n→∞ n
“convergence in probability of Sn
n to m”
Proof.
 
Sn 1 Sn
 
P −m >ε 6 2
· var → 0 if n → ∞.
n ε n

Example 7.5. Bernoulli experiments with parameter p ∈ [0, 1] Let Xi (ω) =


xi and Pp [Xi = 1] = p, hence Ep [Xi ] = p and var(Xi ) = p(1 − p) (6 41 ).
Then
 
Sn n→∞
Pp −p >ε −−−→ 0; (1.7)
n
|{z}
rel. freq.
of “1”

(J. Bernoulli: Ars Conjectandi)


Interpretation of the success probability p as relative frequency.
Problem: to infer p from (1.7) one uses a probabilistic statement w.r.t. a
probability measure Pp that is defined with the help of p.

35

Example 7.6. Application: uniform approximation of f ∈ C [0, 1] with
Bernstein polynomials
Let p ∈ [0, 1]. Then by the transformation theorem (see assignments)
n
X  k  n n
X k
k n−k
Bn (p) := f · p (1 − p) = f · Pp [Sn = k]
n k n
k=0 k=0
  
Sn
= Ep f .
n

Let ε > 0: f uniformly continuous on [0, 1] ⇒ ∃δ = δ(ε) > 0 such that

sup f (x) − f (y) 6 ε.


x,y∈[0,1],
|x−y|6δ

Now
       
Sn Sn
Bn (p) − f (p) = Ep f − f (p) 6 Ep f − f (p)
n n
       
Sn Sn
= Ep f − f (p) · 1{| Sn −p|6δ} + Ep f − f (p) · 1{| Sn −p|>δ}
n n n n

   
Sn Sn
6 ε · Pp − p 6 δ + 2kf k∞ Pp −p >δ .
n n
| {z }
≤ δ21n p(1−p)≤ 4δ12 n

Consequently,

lim sup kBn − f k∞ 6 ε ∀ ε > 0, hence lim kBn − f k∞ = 0.


n→∞ n→∞

From convergence in probability to P-a.s.-convergence:

Lemma 7.7. Let Z1 , Z2 , . . . be r.v. on (Ω, A, P). Then

P(lim sup{|Zn | ≥ ε}) = 0 ∀ε > 0 ⇐⇒ P( lim Zn = 0) = 1.


n→∞ n→∞

In particular, using the Lemma of Borel-Cantelli (Lemma 1.11), we obtain that



X
( “fast convergence in probability to 0”),

P |Zn | > ε < ∞ ∀ε > 0
n=1

36
implies

( “almost sure convergence to 0”).


 
P ω lim Zn (ω) = 0 =1
n→∞

Proof. “⇒”: Define Nk := lim supn→∞ { Zn ≥ k1 } and N := k≥1 Nk . Then


S
N ∈ A and by assumption P(N ) = 0. Let ω ∈ N c (= Ω \ N ). Then ω ∈ Nkc
for any k ∈ N, i.e. for any k ∈ N, we have

1
Zn (ω) < for all but finitely many n ∈ N,
k

thus limn→∞ Zn (ω) = 0.


n→∞
“⇐”: Suppose ∃N measurable with P(N ) = 0 and Zn (ω) −−−→ 0 ∀ω ∈ Ω\N .
Let ε > 0 arbitrary and An := {|Zn | > ε}. If ω ∈ N c , then ω ∈ Acn for all but
finitely many n, thus N c ⊂ ∪∩Acn = (∩∪An )c and so P(lim supn→∞ An ) = 0.

Proposition 7.8 ( “Strong law of large numbers”). Let X1 , X2 , . . . ∈ L2 (Ω, A, P)


be uncorrelated with supi∈N σ 2 (Xi ) = c < ∞. Then:
 
Sn E[Sn ]
lim − = 0 P-a.s.
n→∞ n n

(resp., if E[Xi ] = m: limn→∞ Sn


n = m P-a.s.).

Proof. Again w.l.o.g. we may assume that E[Xi ] = 0 (otherwise consider


X̃i := Xi − E[Xi ]).

1. Step “Fast convergence in probability towards 0 along the subsequence nk =


k 2 ”:
For all ε > 0
 
Sk2 Chebychev 1 c
P 2
> ε 6 2 4
var(Sk2 ) ≤ .
k ε k ε2 k 2

Consequently, Lemma 7.7 implies that

Sk2 (ω)
lim =0 / N1 with P(N1 ) = 0 .
∀ω ∈
k→∞ k2

37
2. Step Let Dk := maxk2 6l<(k+1)2 |Sl − Sk2 |. We show fast convergence in
probability of D
k2 to 0. For all ε > 0:
k

   k2[
+2k 
D
P 2k > ε |Sl − Sk2 | > εk 2

=P
k
l=k2 +1
2
kX +2k
|Sl − Sk2 | > εk 2
 
6 P
| {z }
l=k2 +1 Chebychev
6 1
ε2 k4
(l−k2 ) ·c
| {z }
62k

(2k)(2k) · c
6
ε2 k 4
4c
= 2 2.
ε k
Lemma 7.7 now implies that

Dk (ω)
lim =0 / N2 with P(N2 ) = 0.
∀ω ∈
k→∞ k 2

3. Step For n ∈ N and k = k(n) ∈ N with k 2 6 n < (k + 1)2 we obtain that

Sn (ω) Sk2 (ω) + Dk (ω) n→∞


6 −−−→ 0 ∀ω ∈
/ N1 ∪ N2 .
n k2

Example 7.9. Bernoulli experiments with p ∈ [0, 1]:


n
1X
Xi −−→ p Pp -a.s. (E. Borel 1909, improves Bernoulli’s result).
n
i=1

Consider the experiment of tossing a fair coin (p = 12 ), Yi := 2Xi − 1

S n = Y1 + · · · + Yn
= position of a particle undergoing a “random walk” on Z.
Increasing refinement (+ rescaling and linear interpolation) of the random walk
yields the Brownian motion:

38
S (ω)
The strong law of large numbers implies that nn → 0 P-a.s.
In particular, fluctuations are growing slower than linear.

A precise description of the order of fluctuations is provided by the law of the


iterated logarithm:

Sn (ω)
lim sup √ = +1 P-a.s.
n→∞ 2n log log n
Sn (ω)
lim inf √ = −1 P-a.s.
n→∞ 2n log log n

8 Convergence and uniform integrability


Definition 8.1. Let X, X1 , X2 , . . . be r.v. on (Ω, A, P).

(i) Lp -convergence (p > 1)

lim E |Xn − X|p = 0


 
n→∞

(alternative notation limn→∞ kXn − Xkp = 0).

(ii) Convergence in probability



∀ ε > 0 : lim P |Xn − X| > ε = 0.
n→∞

39
(iii) P-a.s. convergence

P lim Xn = X = 1.
n→∞

Proposition 8.2 (Dependencies between the three types of convergence).

(i) +3 (ii)
X` >F

if sup|Xn | ∈ Lp
n∈N
(resp. |Xn |p unif. int.)
along some
~ subsequence
(iii)

Proof. (i)⇒(ii): Chebychev’s inequality implies:


 
 E |Xn − X|p
P |Xn − X| > ε 6 .
εp

(iii)⇒(ii):

∞ [
∞ \ 
 \ 1
lim Xn = X = |Xn − X| 6 .
n→∞ k
k=1 m=1 n>m
| {z }
=:Ak

Then P limn→∞ Xn = X = 1 implies P(Ak ) = 1 for all k ∈ N.



Continuity of P from below (cf. Proposition 1.9) implies that

1
\ 
1.9
1 = P(Ak ) = lim P |Xn − X| 6
m→∞ k
n>m
1 1
6 lim inf P |Xm − X| 6 ≤ lim sup P |Xm − X| 6 6 1.
m→∞ k m→∞ k

Consequently,

1
lim P |Xm − X| > = 0.
m→∞ k

40
(iii)⇒(i): Y := supn∈N |Xn | ∈ Lp , limn→∞ Xn = X P-a.s. implies |X| 6 Y
In particular, |Xn − X|p 6 2p Y p ∈ L1 .
limn→∞ |Xn − X|p = 0 P-a.s. with Lebesgue’s dominated convergence
now implies
lim E |Xn − X|p = 0.
 
n→∞

(ii)⇒(iii): For each k ∈ N there exists nk ∈ N with


1
 1 k
P Xnk −X > < .
k 2
(We may and do choose nk such that nk is strictly increasing in k.) Then
for ε > 0,
X 
P Xnk − X > ε
k≥1
X  X 
= P Xnk − X > ε + P Xnk − X > ε
| {z }
k : k1 ≥ε k : k1 <ε 1
| {z } ≤ P(|Xn k
−X| > k
)
≤ const.
≤ const. + 1 < ∞.

Lemma 7.7 (“fast convergence in probability”) now implies limk→∞ Xnk =


X P-a.s.
Remark 8.3. The diagram can be complemented as follows:
• (ii)⇒(i) holds, if supn∈N |Xn | ∈ Lp (resp. |Xn |p uniformly integrable)(see
Proposition 8.5 and Remark 8.8 below).

• in general (i);(iii) and (iii);(i) (hence (ii);(i) too). For examples, see
Exercises.
Definition 8.4. Let I be an index set. A family (Xi )i∈I ⊂ L1 of r.v. is called
uniformly integrable if
Z
lim sup |Xi | dP = 0.
c→∞ i∈I {|Xi |>c}

Note that by Lebesgue’s theorem {|Xi |>c} |Xi | dP = E[1{|Xi |>c} · |Xi |] →
R

0 for any fixed i ∈ I as c → ∞, but that uniform integrabillity requires this


covergence to be uniform in i ∈ I.

41
The next Proposition is the definitive version of Lebesgue’s theorem on dom-
inated convergence.
Proposition 8.5 (Vitali convergence therorem). Let Xn ∈ L1 , n ≥ 1, and X
be r.v. Then the following statements are equivalent:
(i) lim Xn = X in L1 .
n→∞

(ii) lim Xn = X in probability and (Xn )n∈N uniformly integrable.


n→∞

Corollary 8.6. limn→∞ Xn = X P-a.s. and (Xn )n∈N uniformly integrable


implies

lim E[|Xn − X|] = 0 hence lim E[Xn ] = E[X].


n→∞ n→∞

Lemma 8.7 (ε-δ criterion). Let (Xi )i∈I ⊂ L1 . Then the following statements
are equivalent:
(i) (Xi )i∈I is uniformly integrable.

(ii) supi∈I E |Xi | < ∞ and ∀ ε > 0 ∃δ > 0 such that


 
Z
A ∈ A and P(A) < δ =⇒ |Xi | dP < ε ∀ i ∈ I.
A

Proof. (i)⇒(ii): ∃c > 0 such that supi∈I |Xi | dP 6 1 Consequently,


R
{|Xi |>c}
Z Z Z

sup |Xi | dP = sup |Xi | dP + |Xi | dP
i∈I i∈I {|Xi |<c} {|Xi |>c}

6 c + 1 < ∞.

Let ε > 0. Then there exists c > 0 such that


Z
ε
sup |Xi | dP < .
i∈I {|Xi |>c} 2

For δ := ε
2c and A ∈ A with P(A) < δ we now conclude
Z Z Z
|Xi | dP = |Xi | dP + |Xi | dP
A A∩{|Xi |<c} A∩{|Xi |>c}
Z Z
ε
6c dP + |Xi | dP < c · P(A) + < ε.
A {|Xi |>c} 2

42
(ii)⇒(i): Let ε > 0 and δ be as in (ii). Using Markov’s inequality (and the two
properties in (ii)), we get for any i ∈ I
 
1 supi∈I E |Xi | + 1
if c >
  
P |Xi | > c 6 · E |Xi | < δ, ,
c δ

Z
hence |Xi | dP < ε ∀ i ∈ I.
{|Xi |>c}

Remark 8.8. (i) Existence of dominating integrable r.v. implies uniform in-
tegrability: |Xi | ≤ Y ∈ L1 ∀i ∈ I
Z Z
  DCT c%∞
⇒ |Xi | dP 6 Y dP = E 1{Y >c} ·Y −−−−−−−−→ 0,
{|Xi |>c} {Y >c}

c→∞
since 1{Y >c} · Y −−−→ 0 P-a.s. (Markov’s inequality)
In particular, I finite ⇒ (Xi )i∈I ⊂ L1 uniformly integrable.

(ii) Let (Xi )i∈I , (Yi )i∈I be uniformly integrable, α, β ∈ R

⇒ (αXi + βYi )i∈I uniformly integrable

(see Exercises).

Proof of Proposition 8.5. (i)⇒(ii): see Exercises. (Hint: Use Lemma 8.7).

(ii)⇒(i): a) X ∈ L1 , because there exists a subsequence (nk ) such that


limk→∞ Xnk = X P-a.s., so that

    Fatou  
E |X| = E lim inf |Xnk | 6 lim inf E |Xnk |
k→∞ k→∞
 
6 sup E |Xn | < ∞.
n∈N

b) W.l.o.g. X = 0 (because (Xn )n∈N uniformly integrable implies


(Xn − X)n∈N uniformly integrable too by Remark 8.8(ii) and the
following argument does not change, when applied to (Xn − X)n∈N
instead of (Xn )n∈N ).

43
Let ε > 0. Then there exists δ > 0 such that for all A ∈ A with
P(A) < δ it follows that A |Xn | dP < 2ε for any n ∈ N.
R

Since Xn →  0 in probability, there exists n0 ∈ N, such that


P |Xn | > 2 < δ ∀ n ≥ n0 . Hence, for n ≥ n0
ε

Z Z
 
E |Xn | = |Xn | dP + |Xn | dP < ε,
{|Xn |< 2ε } {|Xn |> 2ε }
| {z } | {z }
≤ 2ε < 2ε

and thus limn→∞ E[|Xn |] = 0.

Corollary 8.9. limn→∞ Xn = X in probability and |Xn |p uniformly



n∈N
integrable, p > 0

⇒ lim Xn = X in Lp .
n→∞

Proof. limn→∞ |Xn − X|p → 0 in probability and since

|Xn − X|p 6 2p · |Xn |p + |X|p ,



| {z }
unif. integrable

(|Xn − X|p )n∈N is uniformly integrable too. Proposition 8.5 implies

lim E |Xn − X|p = 0.


 
n→∞

g(x)
Proposition 8.10. Let g : [0, ∞) → [0, ∞) be measurable with limx→∞ x =
∞. Then

< ∞ ⇒ (Xi )i∈I uniformly integrable


 
sup E g |Xi |
i∈I

g(x)
Proof. Let ε > 0. Choose c > 0, such that 1
  
x > ε supi∈I E g |Xi | +1
for x > c. Then for all i ∈ I
Z Z
 |Xi |
|Xi | dP = g |Xi | ·  dP
{|Xi |>c} {|Xi |>c} g |Xi |
Z
ε 
6   · g |Xi | dP 6 ε.
sup E g |Xj | + 1 {|Xi |>c}
j∈I

44
Example 8.11. (i) p > 1, supi E[|Xi |p ] < ∞ ⇒ (Xi )i∈I uniformly inte-
grable

(ii) (“finite entropy condition”)


h i
+
sup E |Xi | · log |Xi | < ∞ ⇒ (Xi )i∈I uniformly integrable (1.8)
i∈I

Example 8.12. Application to the strong law of large numbers Let


X1 , X2 , . . . be r.v. in L1 (Ω, A, P) with E[Xi ] = m for all i ∈ N. Suppose that
n
Sn 1X n→∞
= Xi −−−→ m P-a.s. (1.9)
n n
i=1

Question: Which additional condition implies L1 -convergence in (1.9) ?


(We have seen: if supi∈N E[Xi2 ] < ∞ and (Xi )i∈N uncorrelated, then (1.9)
holds (cf. Proposition 7.8). In this case even L2 -convergence holds in (1.9)
by Proposition 7.2 (“ L2 -case”). Later, we will see that (1.9) always holds if
Xi ∈ L1 are pairwise independent, identically distributed.)
Answer to the question:
Sn
sup E |Xi | · log+ |Xi | implies = m in L1 .
 
<∞ lim
i∈N n→∞ n

Proof: g(x) := x · log+ (x) ist monotone increasing and convex. Consequently,
n n
|S | 
h i monotonicity h 1X i convexity  1 X
E g n

6 E g |Xi | 6 E g |Xi |
n n n
i=1 i=1
 
6 sup E g |Xi | ∀n,
16i6n

and so
|Sn | 
h  i 
sup E g 6 sup E g |Xi | < ∞.
n∈N n i∈N

Consequently, Sn
is uniformly integrable and (1.9) holds. Thus by

n n∈N
Proposition 8.5
Sn
lim = m in L1 .
n→∞ n

45
One complementary remark concerning Lebesgue’s dominated convergence
theorem.

Proposition 8.13. Let Xn ≥ 0, lim Xn = X P-a.s. (or in probability). Then


n→∞

lim Xn = X in L1
n→∞

⇔ lim E[Xn ] = E[X] and E[X] < ∞.


n→∞

Proof. “⇒”: Obvious.

“⇐”:

X + Xn = X ∨ Xn + X ∧ Xn
| {z } | {z }
:=sup{X,Xn } :=inf{X,Xn }

Then
Lebesgue
lim E[X ∧ Xn ] = E[X]
n→∞

and thus

lim E[X ∨ Xn ] = E[X] .


n→∞

Now |Xn − X| = (X ∨ Xn ) − (X ∧ Xn ) implies


 
lim E |Xn − X| = E[X] − E[X] = 0.
n→∞

Lp -completeness
Proposition 8.14 (Lp -completeness, Riesz-Fischer). Let 1 6 p < ∞ and
Xn ∈ Lp with
Z
lim |Xn − Xm |p dP = 0.
n,m→∞

Then there exists a r.v. X ∈ Lp such that

(i) lim Xnk = X P-a.s. along some subsequence,


k→∞

(ii) lim Xn = X in Lp .
n→∞

Proof. See textbooks on measure theory.

46
9 Distribution of random variables
Let (Ω, A, P) be a probability space, and X : Ω → R̄ be a r.v.
Let µ be the distribution of X (under P), i.e., µ(A) = P(X ∈ A) for all
A ∈ B(R̄).
Assume that P(X ∈ R) = 1 (in particular, X P-a.s. finite, and µ is a
probability measure on (R, B(R)).

Definition 9.1. The function F : R → [0, 1], defined by

(1.10)

F (b) := P(X 6 b) = µ (−∞, b] , b ∈ R,

is called the distribution function of X resp. µ.

Proposition 9.2. (i) F is monotone increasing: a 6 b ⇒ F (a) 6 F (b).


right continuous: F (a) = lim F (b)
b&a
normalized: lima&−∞ F (a) = 0,, limb%+∞ F (b) = 1.
(ii)To any function F as in (i) there exists a unique probability measure µ on
(R, B(R)) with (1.10), and there exist P, and X as in (1.10).

Proof. (i) Monotonicity is obvious.


Right continuity: if b & a then (−∞, b] & (−∞, a], hence by continuity
of µ from above (cf. Proposition 1.9):
 1.9 
F (a) = µ (−∞, a] = lim µ (−∞, b] = lim F (b).
b&a b&a

Similarly, (−∞, a] & ∅ if a & −∞ (resp. (−∞, b] % R if b % ∞), and


thus

lim F (a) = lim µ (−∞, a] = 0
a&−∞ a&−∞

(resp. lim F (b) = lim µ (−∞, b] = 1).



b%∞ b%∞

(ii) Existence: Let λ be the Lebesgue measure on (0, 1). Define the “inverse
function” G of F : R → [0, 1] by

G : (0, 1) → R

G(y) := inf x ∈ R F (x) > y .

47
Since 0 < y < F (x) ⇒ G(y) 6 x, we have

0, F (x) ⊂ {G 6 x}.

By definition G(y) 6 x ⇒ ∃ xn & x with F (xn ) > y, hence by


right-continuity F (x) > y, so that

{G 6 x} ⊂ 0, F (x) .

Combining both inclusions we obtain that


 
0, F (x) ⊂ {G 6 x} ⊂ 0, F (x) .

so that G is measurable.
Let µ := G(λ) = λ ◦ G−1 (probability measure on (R, B(R))). Then
 
µ (−∞, x] = λ({G 6 x}) = λ 0, F (x) = F (x) ∀x ∈ R.

Uniqueness: later.

Remark 9.3. (i) Let µ be an arbitrary probability measure on (R, B(R))


with distribution function F . Then there exists a probability space (Ω, A, P)
and a random variable X : Ω → R such that µ = P ◦ X −1 (i.e.
µ can be "simulated"). Indeed, we just have to choose (Ω, A, P) =
((0, 1), B((0, 1)), λ) and X = G =“F −1 ” (λ, G as in the proof of Propo-
sistion 9.2(ii)).

(ii) Some authors define the distribution function F by F (x) := µ (−∞, x) .



In this case F is left continuous, not right continuous.

Remark 9.4. (i) Let F be the distribution function of µ and x ∈ R: Then


1 
F (x) − F (x−) = lim µ x − , x = µ({x})
n%∞ n

is called the step height of F in x. In particular:

F continuous ⇔ ∀ x ∈ R : µ({x}) = 0 “µ is continuous” .

(ii) Since F as in (i) is monotone increasing and bounded, F has at most


countable many points of discontinuity.

48
Definition 9.5. (i) F (resp. µ) is called discrete, if there exists a countable
set S ⊂ R with µ(S) = 1. In this case, µ is uniquenely determined by the
weights µ({x}), x ∈ S, and F is a step function of the following type:
X
F (x) = µ({y}).
y∈S,
y 6x

(ii) F (resp. µ) is called absolutely continuous, if there exists a measurbale


function f > 0 called the “density ”, such that
Z x
F (x) = f (t) dt, (1.11)
−∞

(resp., for all A ∈ B(R):


Z Z ∞
µ(A) = f (t) dt = 1A · f dt). (1.12)
A −∞
Z +∞
In particular f (t) dt = 1.
−∞
R +∞
Remark 9.6. (i) Every measurable function f > 0 with −∞ f (t) dt = 1
defines a probability measure on (R, B(R)) by A 7→ A f (t) dt.
R

(ii) In the previous definition “(1.11)⇒(1.12)”, because A 7→ µ(A) = A f (t) dt


R
defines a probability measure on (R, B(R)) with distribution function F .
Thus from the uniqueness in 9.2(ii), we know that (1.12) must hold if the
distribution function F satisfies (1.11).
Conversly if a µ satisfies (1.12) then clearly its distribution function satis-
fies (1.11).

Example 9.7. (i) Uniform distribution on [a, b]. Let f := b−a 1


· 1[a,b] .
The associated absolutely continuous distribution function is given by

Z x 0
 if x 6 a
F (x) = f (t) dt = 1
· (x − a) if x ∈ [a, b]
−∞  b−a
1 if x > b.

(continuous analogue to the discrete uniform distribution on a finite set)

49
(ii) (Continuous) exponential distribution with parameter α > 0.
(
αe−αx if x > 0
f (x) :=
0 if x < 0,
(
x
1 − e−αx if x > 0
Z
F (x) = f (t) dt =
−∞ 0 if x < 0.

(continuous analogue of the geometric distribution on N ∪ {0}


Z k+1
f (x) dx = F (k + 1) − F (k) = e−αk (1| −{ze−α})
k =:p

= (1 − p)k p = P (X = k).)

(iii) Normal distribution N (m, σ 2 ), m ∈ R, σ 2 > 0


1 (x−m)2
fm,σ2 (x) = √ · e− 2σ 2 .
2πσ 2
The associated distribution function is given by
Z x
1 (y−m)2
Fm,σ2 (x) = √ · e− 2σ 2 dy
2πσ 2 −∞
Z x−m  
z= y−m
σ 1 σ 2
− z2 x−m
= √ · e dz = F0,1 .
2π −∞ σ
Φ := F0,1 is called the distribution function of the standard normal distri-
bution N (0, 1).
The expectation E[X] (or more general E[h(X)]) can be calculated with the
help of the distribution µ of X:
Proposition 9.8. Let h > 0 be measurable, then
 
Z +∞
E h(X) = h(x) µ(dx)
−∞
Z +∞


 h(x) · f (x) dx if µ absolutely continuous with density f
−∞
= X


 h(x) · µ({x}) if µ discrete, i.e. µ(S) = 1 and S countable.
x∈S

50
Proof. See Assignments.

Example 9.9. Let X be N (m, σ 2 )-distributed. Then


Z Z
E[X] = x · fm,σ2 (x) dx = m + (x − m) · fm,σ2 (x) dx = m .
| {z }
=0

The p-th central moment of X is given by


Z
p
|x − m|p · fm,σ2 (x) dx,
 
E |X − m| =
Z
= |x|p · f0,σ2 (x) dx.
Z ∞
1 x2
=2 xp · √ · e− 2σ2 dx,
0 2πσ 2
Z ∞
1 p p+1
−1
= √ ·
|{z} π 2 · σ
2 p
y 2 · e−y dy.
2x
y= 2σ 2
|0 {z }
=Γ( p+1
2
)

In particular:
r
  2
p = 1 : E |X − m| = σ ·
π
p = 2 : E |X − m|2 = σ 2
 

3 σ3
p = 3 : E |X − m|3 = 2 2 · √
 
π
p = 4 : E |X − m|4 = 3σ 4 .
 

10 Weak convergence of probability measures


Let S be a metric space, S = B(S) be the Borel σ-algebra on S, and µ, µn ,
n ∈ N, be probability measures on (S, S).
What is a reasonable notion of convergence of the sequence µn to µ? The
notion of “pointwise convergence” (= strong convergence) in the sense that
n→∞
µn (A) −−−→ µ(A) for all A ∈ S is too strong for many applications.

51
Definition 10.1. Let µ and µn , n ∈ N, be probability measures on (S, S).
The sequence (µn ) converges to µ weakly if for all f ∈ Cb (S) (= the space of
bounded continuous functions on S) it follows that
Z Z
n→∞
f dµn −−−→ f dµ.

By Example 11.10 below weak limits are unique.


n→∞ n→∞
Example 10.2. (i) xn −−−→ x in S implies δxn −−−→ δx weakly.

(ii) Let S := R1 and µn := N 0, n1 . Then µn → δ0 weakly, since for all



f ∈ Cb (R )
x2
Z Z
1 − 1
f dµn = f (x) · q ·e 2· n
dx
2π n1
x= √yn
Z  y  1 y2
= f √ · √ · e− 2 dy
n 2π
Lebesgue Z
n→∞
−−−→ f (0) = f dδ0 .

Proposition 10.3 (Portemanteau-Theorem). Let S be a metric space with


metric d. Then the following statements are equivalent:

(i) µn → µ weakly
n→∞
(ii) f dµ for all f bounded and uniformly continuous (w.r.t.
R R
f dµn −−−→
d)

(iii) lim supn→∞ µn (F ) 6 µ(F ) for all F ⊂ S closed

(iv) lim inf n→∞ µn (G) > µ(G) for all G ⊂ S open

(v) limn→∞ µn (A) = µ(A) for all µ-continuity sets A, i.e. ∀ A ∈ S with
µ(∂A) = 0.
n→∞
(vi) f dµ for all f bounded, measurable and µ-a.s. contin-
R R
f dµn −−−→
uous.

Proof. (i)⇒(ii), (iii)⇔(iv), (vi)⇒(i) are obvious.

52
(ii)⇒(iii): Let F ⊂ S be closed and define d(x, F ) := inf y∈F d(x, y), x ∈ S.
The sets
 
1
Gm := x ∈ S d(x, F ) < , m∈N are open,
m

and Gm & ∩Gm = F , hence µ(Gm ) & µ(F ). In particular: ∀ε > 0


there exists some m0 = m0 (ε) ∈ N with

µ(Gm0 ) < µ(F ) + ε.

Define

1
 if x 6 0
ϕ(x) := 1 − x if x ∈ [0, 1]
if x > 1.

0

and let fm0 := ϕ m0 · d( · , F ) .




fm0 is Lipschitz continuous, hence uniformly continuous. Moreover, 0 6


fm0 6 1, fm0 = 0 on Gcm0 and fm0 = 1 on F , and thus
Z Z
(ii)
lim sup µn (F ) 6 lim sup fm0 dµn = fm0 dµ
n→∞ n→∞

6 µ(Gm0 ) < µ(F ) + ε.

(Idea: Construct uniformly continuous functions fm = ϕ m · d( · , F ) ,



m ∈ N, such that

1F ≤ fm ≤ 1Gm & 1F as m % ∞.)

(iii)⇒(v): For a subset A ⊂ S we denote the closure by Ā, the interior by Å,
and the boundary by ∂A. Let A be such that µ(Ā \ Å) = µ(∂A) = 0.
Then

(iv)
µ(A) = µ(Å) 6 lim inf µn (Å) ≤ lim inf µn (A) ≤ lim sup µn (A)
n→∞ n→∞ n→∞
(iii)
≤ lim sup µn (Ā) 6 µ(Ā) = µ(A).
n→∞

53
(v)⇒(vi): Let f be as in (vi). The distribution function
 F (x) = µ({f 6 x})
has at most countably many jumps. Thus D := x ∈ R µ({f = x}) 6=
0 is at most countable, and so R \ D ⊂ R is dense. By denseness and
since f is bounded: for any ε > 0 we find c0 < · · · < cm ∈ R \ D with

|ck+1 − ck | ≤ ε ∀k = 0, ..., m − 1 and c0 6 f < c m .

Let Ak := {f ∈ [ck , ck+1 )}, k = 0, ..., m − 1. Then Ak is a µ-continuity


set, because
• ∂Ak ⊂ {f = ck } ∪ {f = ck+1 } ∪ Df , where

Df := {x ∈ R | f is not continuous at x}

and µ({f = ck } ∪ {f = ck+1 } ∪ Df ) = 0 since ci 6∈ D and Df has


zero µ-measure by assumption. (Note: Df is measurable !)

Proof. Let ω ∈ (Ω \ Df ) ∩ ∂f −1 ([ck , ck+1 )). Then

∀ε > 0, ∃δ > 0, such that f B(ω, δ) ⊂ B f (ω), ε (1.13)


 

and

∀δ̃ > 0 : B(ω, δ̃) ∩ f −1 ([ck , ck+1 )) 6= ∅


B(ω, δ̃) ∩ f −1 (R \ [ck , ck+1 )) 6= ∅. (1.14)

Choosing δ̃ = δ in (1.14) and applying f , we get

f B(ω, δ)∩f −1 ([ck , ck+1 )) 6= ∅ 6= f B(ω, δ)∩f −1 (R\[ck , ck+1 )) ,


 

thus by (1.13)
  
B f (ω), ε ∩ [ck , ck+1 ) 6= ∅ 6= B f (ω), ε ∩ R \ [ck , ck+1 ) .

Since ε > 0 is arbitrary, we get f (ω) ∈ ∂[ck , ck+1 ) = {ck , ck+1 }


and so

∂Ak = (Ω \ Df ) ∩ ∂f −1 ([ck , ck+1 )) ∪ Df ∩ ∂f −1 ([ck , ck+1 )) .


| {z } | {z }
⊂ {f =ck }∪{f =ck+1 } ⊂ Df

54
Pm−1
Let g := k=0 ck · IAk . Then kf − gk∞ 6 ε and
Z Z
f dµ − f dµn
Z Z Z Z
6 |f − g| dµ + g dµ − g dµn + |g − f | dµn
| {z } | {z }
6ε 6ε
m−1 (v)
n→∞
X  
6 2ε + |ck | · µ Ak − µn Ak −−−→ 2ε.
k=0

Corollary 10.4. Let X, Xn , n ∈ N, be measurable mappings from (Ω, A, P)


to (S, S) with distributions µ, µn , n ∈ N. Then:
n→∞ n→∞
Xn −−−→ X in probability ⇒ µn −−−→ µ weakly

Here, limn→∞ Xn = X in probability, if limn→∞ P(d(X, Xn ) > δ) = 0 for all


δ > 0.

Proof. Let f ∈ Cb (S) be uniformly continuous and ε > 0. Then there exists a
δ = δ(ε) > 0 such that:
x, y ∈ S with d(x, y) 6 δ implies |f (x) − f (y)| ≤ ε
Hence
Z Z
   
f dµ − f dµn = E f (X) − E f (Xn )
Z Z
6 f (X) − f (Xn ) dP + f (X) − f (Xn ) dP
{d(X,Xn )6δ} {d(X,Xn )>δ}

6 ε + 2kf k∞ · P d(Xn , X) > δ .
| {z }
n→∞
−−−→0

Corollary 10.5. Let S = R1 and let µ, µn , n ∈ N, be probability measures on


(R, B(R)) with distributions functions F , Fn . Then the following statements
are equivalent:
n→∞
(i) µn −−−→ µ vaguely, i.e. limn→∞ f dµn = f dµ for all f ∈ C0 (R)
R R
(= the space of continuous functions with compact support)

55
n→∞
(ii) µn −−−→ µ weakly

n→∞
(iii) Fn (x) −−−→ F (x) for all x where F is continuous.

n→∞
(iv) µn (a, b] −−−→ µ (a, b] for all (a, b] with µ({a}) = µ({b}) = 0.
 

Proof. (i)⇒(ii): Exercise.

(ii)⇒(iii): Let x be such that F is continuous in x. Then µ({x}) = 0, which


implies by the Portmanteau theorem:

 n→∞ 
Fn (x) = µn (−∞, x] −−−→ µ (−∞, x] = F (x).

(iii)⇒(iv): Let (a, b] be such that µ({a}) = µ({b}) = 0 then F is continuous


in a and b and thus

 (iii)
µ (a, b] = F (b) − F (a) = lim Fn (b) − lim Fn (a)
n→∞ n→∞

= lim µn (a, b] .
n→∞

(iv)⇒(i): Let D := x ∈ R µ({x}) 6= 0 . Then D is at most countable,



hence R \ D ⊂ R dense. Let f ∈ C0 (R), then f is uniformly continuous.
Hence for any ε > 0 we find δ = δ(ε) > 0 such that

x, y ∈ R and |x − y| ≤ δ =⇒ |f (x) − f (y)| ≤ ε

and we can find c0 < · · · < cm ∈ R \ D such that supp(f ) ⊂ (c0 , cm ]


and |ck+1 − ck | ≤ δ for k = 0, ..., m − 1. Consequently

m
X
f− f (ck−1 ) · I(ck−1 ,ck ] 6 sup sup f (x)−f (ck−1 ) ≤ ε,
∞ 1≤k≤m x∈[ck−1 ,ck ]
|k=1 {z }
=:g

56
and so
Z Z
f dµ − f dµn
Z Z Z Z
6 |f − g| dµ + g dµ − g dµn + |f − g| dµn
| {z } | {z }
≤ε ≤ε
m
X  
6 2ε + |f (ck−1 )| · µ (ck−1 , ck ] − µn (ck−1 , ck ]
k=1
(iv)
n→∞
−−−→ 2ε.

11 Dynkin-systems and Uniqueness of


probability measures
Let Ω 6= ∅.

Definition 11.1. A collection of subsets D ⊂ P(Ω) is called a Dynkin-system,


if:

(i) Ω ∈ D.

(ii) A ∈ D ⇒ Ac ∈ D.

(iii) Ai ∈ D, i ∈ N, pairwise disjoint, then


[
Ai ∈ D.
i∈N

Example 11.2. (i) Every σ-Algebra A ⊂ P(Ω) is a Dynkin-system

(ii) Let P1 , P2 be probability measures on (Ω, A). Then

D := A ∈ A P1 (A) = P2 (A)


is a Dynkin-system.

57
Remark 11.3. (i) Let D be a Dynkin-system. Then

A, B ∈ D , A ⊂ B ⇒ B \ A = (B c ∪ A)c ∈ D

(ii) Every Dynkin-system which is closed under finite intersections (short no-
tation: ∩-stable), is a σ-algebra, because:
(a) A, B ∈ D A ∪ B = A ∪ B \ (A ∩ B) ∈ D.


| {z }
∈D by ass.
| {z }
(i)
∈D
(b) Ai ∈ D, i ∈ N
i−1 i−1
 
   
[ [ [ [ [ c
⇒ Ai = Ai \ An = Ai ∩ An ∈ D.
n=1
i∈N i∈N
| {z } i∈N
| n=1{z }
pairwise disjoint ! (a)

| {z∈ D }
∈ D by ass.,

Proposition 11.4. Let B ⊂ P(Ω) be a ∩-stable collection of subsets (i.e.


A, B ∈ B ⇒ A ∩ B ∈ B). Then

σ(B) = D(B) ,

where
\
D(B) := D
D Dynkin-system
B⊂D

is called the Dynkin-system generated by B.


Proof. Since σ(B) is a Dynkin system that contains B, we obtain σ(B) ⊃
D(B).
In order to show σ(B) ⊂ D(B), it is enough to show that D(B) is a σ-algebra.
By Remark 11.3(ii) it is enough to show that D(B) is ∩-stable. For arbitrary
D ∈ D(B) set
DD := {Q ∈ P(Ω) | Q ∩ D ∈ D(B)}.
Then DD is a Dynkin system. In particular, if D(B) ⊂ DD , then D(B) is
∩-stable and we are done. Now

E∈B =⇒ B ⊂ DE =⇒ D(B) ⊂ DE
B is ∩-stable

58
=⇒ E ∩ D = D ∩ E ∈ D(B) =⇒ E ∈ DD .
Thus B ⊂ DD , and so D(B) ⊂ DD , which concludes the proof.

Proposition 11.5 (Uniqueness of probability measures). Let P1 , P2 be prob-


ability measures on (Ω, A), and B ⊂ A be a ∩-stable collection of subsets.
Then:

P1 (A) = P2 (A) for all A ∈ B ⇒ P1 = P2 on σ(B).

Proof. The collection of subsets

D := A ∈ A P1 (A) = P2 (A)


is a Dynkin-system containing B. Consequently,


11.4
σ(B) = D(B) ⊂ D.

Example 11.6. (i) For p ∈ [0, 1] the probability measure Pp on (Ω :=


{0, 1}N , A) is uniquely determined by
n
X
k
Pp (X1 = x1 , . . . , Xn = xn ) = p (1 − p) n−k
, with k := xi
i=1

for all x1 , . . . , xn ∈ {0, 1}, n ∈ N, because the collection of cylindrical


sets

∅, {X1 = x1 , . . . , Xn = xn }, n ∈ N, x1 , . . . , xn ∈ {0, 1}

is ∩-stable, generating A (cf. Example 1.7).


(Existence of Pp for p = 1
2 see Example 3.6. Existence for p ∈ (0, 1) \ { 12 }
later.)

(ii) A probability measure on (R, B(R)) is  uniquely determined through its


distribution function F (:= µ (−∞, · ] ), because

µ (a, b] = F (b) − F (a),

and the collection of intervals (a, b], a, b ∈ R, is ∩-stable, generating


B(R).

59
Definition 11.7. Let H be a vector space of real-valued bounded functions
on Ω. H is called a monotone vector space (MVS), if:

(i) 1 = 1Ω ∈ H (constants are in H)

(ii) Let fn ∈ H, n ∈ N with 0 ≤ f1 ≤ · · · ≤ fn % f , f bounded. Then


f ∈ H.

Lemma 11.8. A MVS H is closed under uniform convergence.


n→∞
Proof. Let fn ∈ H, n ∈ N fn −→ f uniformly on Ω, i.e.
n→∞
kf − fn k∞,Ω := sup |f (ω) − fn (ω)| −→ 0.
ω∈Ω

W.l.o.g. f ≥ 0 (otherwise consider f + kf k∞,Ω and fn + kfn k∞,Ω ). There is


a subsequence (fnk )k∈N , such that

εk := kfnk − fnk+1 k∞,Ω , k ≥ 1


P∞
satisfies < ∞. Thus ak := & 0. Put
P
k≥1 εk n=k εn
k→∞

gk := fnk −ak + 2a1 ∈ H.


|{z} | {z }
∈H =const.

Then

(1) gk+1 − gk = fnk +1 − fnk + ak − ak+1 ≥ 0 (gk %)


| {z }
=εk

(2) gk % f + 2a1 , and f + 2a1 is bounded


P∞ P∞
(3) g1 = fn1 + a1 = fn1 + k=1 kfnk − fnk +1 k∞,Ω ≥ fn1 − k=1 (fnk −
fnk +1 ) = f ≥ 0.

(1) − (3) ⇒ (gk )k≥1 fulfills assumptions of Definition 11.7 (ii).

⇒ lim gk = f + 2a1 ∈ H ⇒ f ∈ H.
k→∞

Notations: If M is a class of functions on a set Ω, then

60
σ(M) := “smallest σ-algebra for which all f ∈ M are measurable”

σ(M)b := “bounded σ(M)-measurable functions”

The next theorem plays the same role in measure theory and probability theory
as the Stone-Weierstrass Theorem in analysis.

Proposition 11.9. (Monotone class theorem (in multiplicative form)) Let


M be a class of bounded functions on Ω which is closed under multiplication
(a so-called multiplicative class), i.e f, g ∈ M ⇒ f · g ∈ M. Let H be a MVS
on Ω, and M ⊂ H. Then σ(M)b ⊂ H.

Proof. Let M0 = span(M, 1) = smallest subspace of H containing M and


1. Then M0 is an algebra (i.e. a linear space and a multiplicative class) and
σ(M0 ) = σ(M). Let M0 be the uniform closure of M0 w.r.t. k · k∞,Ω . Then
M0 is again an algebra, σ(M0 ) = σ(M0 ) = σ(M), and M0 ⊂ H by Lemma
11.8.

Claim: f ∈ M0 ⇒ f ∧ α ∈ M0 , ∀α ∈ R.

Proof: L := kf k∞,Ω . Weierstrass ⇒ ∀ε > 0 ∃Pε polynomial s.t. |x ∧


α − Pε (x)| < ε, ∀x ∈ [−L, L]. Since Pε (f ) ∈ M0 , we get f ∧ α ∈ M0 .
Every f ∈ σ(M0 )b is the uniform limit of σ(M0 )-elementary functions (f + , f − ∈
σ(M0 )b can be uniformly approximated by σ(M0 )-measurable elementary func-
tions by Proposition 4.4 (ii)). Since H is closed under uniform convergence, it
suffices hence to show that 1A ∈ H for any A ∈ σ(M0 ). In other words, we
have to show

σ(M0 ) = {A ∈ σ(M0 ) | 1A ∈ H} =: S.

“⊃”: Clear
“⊂”: S is a Dynkin system. Put

E := {A ∈ σ(M0 ) | ∃fn ∈ M0 , fn ≥ 0, n ≥ 1 : fn % 1A }.

E is closed under intersections ⇒ σ(E) = D(E). Since H is a MVS and


M0 ⊂ H we have: A ∈ E ⇒ 1A ∈ H. Thus E ⊂ S, and then

σ(E) = D(E) ⊂ S ⊂ σ(M0 ). (1.15)

61
For f ∈ M0 , α ∈ R, we have:
0 ≤ n · (f − α)+ ∧ 1 % 1{f >α} ⇒ {f > α} ∈ E
| {z } n→∞ | {z }
=: fn ∈ M0 such sets generate σ(M0 )

⇒ σ(M0 ) ⊂ σ(E) ⇒ equalities in (1.15).

Example 11.10. Let µ1 , µ2 be probability measures on a metric space (S, d)


with Borel σ-algebra B(S). Suppose
Z Z
f dµ1 = f dµ2 ∀f ∈ Cb (S),
S S

where Cb (S) denotes the continuous and bounded functions on S. Then µ1 =


µ2 .

Proof. H := {f ∈ B(S)b | f dµ2 } is a MVS , and


R R
f dµ1 =

Cb (S) ⊂ H =⇒ H ⊃ σ(Cb (S))b .


Theorem 11.9

How big is σ(Cb (S)) ? Clearly σ(Cb (S)) ⊂ B(S) since every continuous func-
tion on S is measurable w.r.t. B(S). Let F ⊂ S be closed and d(x, F ) :=
inf y∈F d(x, y), x ∈ S. Then d(·, F ) is Lipschitz continuous and so f :=
d(·, F ) ∧ 1 ∈ Cb (S). Moreover

F = {f = 0} ∈ σ(Cb (S)).

Hence B(S) ⊂ σ(Cb (S)). In particular 1A ∈ σ(Cb (S))b for all A ∈ B(S),
hence µ1 = µ2 .

Example 11.11. Let µ be a probability measure on a metric space (S, d) with


Borel σ-algebra B(S). Then Cb (S) is dense in L1 (µ).

Proof. One shows that


Z
f ∈ L (µ) | ∃(fn )n∈N ⊂ Cb (S) with lim
1

|f − fn |dµ = 0
n→∞ S

is a MVS and concludes as in Example 11.10.

62
2 Independence

1 Independent events
Let (Ω, A, P) be a probability space.

Definition 1.1. The events Ai ∈ A, i ∈ I, are said to be independent (w.r.t.


P ), if for any finite subset J ⊂ I
\  Y
P Aj = P(Aj ).
j∈J j∈J

A family of collections of subsets Bi ⊂ A, i ∈ I, is said to be independent,


if for all finite subsets J ⊂ I and for all subsets Aj ∈ Bj , j ∈ J
\  Y
P Aj = P(Aj ).
j∈J j∈J

Proposition 1.2. Let Bi , i ∈ I, be independent collections of subsets that are


closed under intersections. Then:

(i) σ(Bi ), i ∈ I, are independent.

(ii) Let Jk , k ∈ K, be a partition of the index set I. Then the σ-algebras


[ 
σ Bi , k ∈ K,
i∈Jk

are independent.

Proof. (i) Let J ⊂ I, J finite, be of the form J = {j1 , . . . , jn }. Let Aj1 ∈


σ(Bj1 ), . . . , Ajn ∈ σ(Bjn ).
We have to show that

P(Aj1 ∩ · · · ∩ Ajn ) = P(Aj1 ) · · · P(Ajn ). (2.1)

63
To this end suppose first that Aj2 ∈ Bj2 , . . . , Ajn ∈ Bjn , and define

Dj1 := A ∈ σ(Bj1 ) P(A ∩ Aj2 ∩ · · · ∩ Ajn )




= P(A) · P(Aj2 ) · · · P(Ajn ) .

Then Dj1 is a Dynkin system (!) containing Bj1 . Proposition 1.11.4 now
implies

σ(Bj1 ) = D(Bj1 ) ⊂ Dj1 ,

hence σ(Bj1 ) = Dj1 . Iterating the above argument for Dj2 , Dj3 , implies
(2.1).

(ii) For k ∈ K define


n\ o
Ck := Aj J ⊂ Jk , J finite, Aj ∈ Bj .
j∈J

Then Ck is closed under intersections and the Ck , k ∈ K, are independent,


because: given k1 , . . . , kn ∈ K and finite subsets J 1 ⊂ Jk1 , . . . , J n ⊂ Jkn ,
then
 \  \  Bind.
i ,i∈I n
Y \ 
P Ai ∩ · · · ∩ Ai = P Ai .
i∈J 1 i∈J n j=1 i∈J j
| {z } | {z }
∈Ck1 ∈Ckn
S 
(i) now implies that σ(Ck ), k ∈ K is independent. But Ck ⊂ σ i∈Jk Bi
S 
since Ck only contains finite intersections of events in σ i∈Jk Bi and
Ck ⊃ Bi . Hence
S
i∈Jk
[ 
σ(Ck ) = σ Bi , k ∈ K,
i∈Jk

which implies the assertion.

Example 1.3. Let Ai ∈ A, i ∈ I, be independent. Then σ({Ai }) =


{∅, Ai , Aci , Ω}, i ∈ I, is an independent family of σ-algebras, but the collection
of events {Ai , Aci | i ∈ I}, is in general not independent.

64
Remark 1.4. Pairwise independence does not imply independence in general.
Example: Consider two tosses with a fair coin, i.e.

P := uniform distribution.

Ω := (i, k) i, k ∈ {0, 1} ,

Consider the events

A := "1. toss 1" = (1, 0), (1, 1)




B := "2. toss 1" = (0, 1), (1, 1)




C := "1. and 2. toss equal" = (0, 0), (1, 1) .




Then P(A) = P(B) = P(C) = 1


2 and A, B, C are pairwise independent
1
P(A ∩ B) = P(B ∩ C) = P(C ∩ A) =
4
= P(A) · P(B) = P(A) · P(C) = P(B) · P(C).

But on the other hand


1
P(A ∩ B ∩ C) = 6= P(A) · P(B) · P(C).
4
Example 1.5. Independent 0-1-experiments with success probability
p ∈ [0, 1]. Let Ω := {0, 1}N , Xi (ω) := xi and ω := (xi )i∈N . Let Pp be a
probability measure on A := σ {Xi = 1}, i = 1, 2, . . . , with

(i) Pp (Xi = 1) = p (hence Pp (Xi = 0) = Pp ({Xi = 1}c ) = 1 − p).

(ii) {Xi = 1}, i ∈ N, are independent w.r.t. Pp .

Then for any x1 , . . . , xn ∈ {0, 1}:


(ii) and n
1.3 Y (i)
Pp (Xi1 = x1 , . . . , Xin = xn ) = Pp (Xij = xj ) = pk (1 − p)n−k ,
j=1
Pn
where k := i=1 xi . Hence Pp is uniquely determined by (i) and (ii).

Proposition 1.6 (Kolmogorov’s Zero-One Law). Let Bn , n ∈ N, be indepen-


dent σ-algebras, and
∞ [
\ ∞ 
B∞ := σ Bm
n=1 m=n

65
be the tail σ-algebra (resp. σ-algebra of terminal events). Then

P(A) ∈ {0, 1} ∀ A ∈ B∞

i.e., P is deterministic on B∞ .

Illustration: Independent 0-1-experiments


Let Bi = σ {Xi = 1} . Then
\ [ 
B∞ = σ Bm
n∈N m>n

is the σ-algebra containing the events of the remote future, e.g.

lim sup{Xi = 1} = {“infinitely many 1’s”}


i→∞
 n 
1X
ω ∈ {0, 1}N lim Xi (ω) exists
n→∞ n
i=1
| {z }
=: Snn(ω)

Proof of the Zero-One Law. Proposition 1.2 implies that for all n ≥ 2

[ 
B1 , B2 , . . . , Bn−1 , σ Bm
m=n
S 
are independent. Since B∞ ⊂ σ B
m>n m , this implies that for all n ≥ 2

B1 , B2 , . . . , Bn−1 , B∞

are independent. By definition this implies that

B∞ , Bn , n ∈ N are independent

and now Proposition 1.2(ii) implies that


[ 
σ Bn and B∞
n∈N
S 
are idependent. Since B∞ ⊂ σ n>1 Bn we finally obtain that B∞ and B∞
are independent. The conclusion now follows from the next lemma.

66
Lemma 1.7. Let B ⊂ A be a σ-algebra such that B is independent from B.
Then

P(A) ∈ {0, 1} ∀A ∈ B.

Proof. For all A ∈ B

P(A) = P(A ∩ A) = P(A) · P(A) = P(A)2 .

Hence P(A) = 0 or P(A) = 1.

For any sequence An , n ∈ N, of independent events in A, Kolmogorov’s


Zero-One Law implies in particular for
\ [ 
A∞ := Am =: lim sup An
n→∞
n∈N m>n

that P(A∞ ) ∈ {0, 1}.


Proof: The σ-algebras Bn := σ({An }) = {∅, Ω, An , Acn }, n ∈ N, are indepen-
dent by Proposition 1.2 and A∞ ∈ B∞ .

Lemma 1.8 (Borel-Cantelli). (i) Let Ai ∈ A, i ∈ N. Then



X 
P(Ai ) < ∞ ⇒ P lim sup Ai = 0.
i→∞
i=1

(ii) Assume that Ai ∈ A, i ∈ N, are independent. Then



X 
P(Ai ) = ∞ ⇒ P lim sup Ai = 1.
i→∞
i=1

Proof. (i) See Lemma 1.1.11.

(ii) It suffices to show that



[  ∞
\ 
P Am = 1 resp. P c
Am = 0 ∀n.
m=n m=n

The last equality follows from the fact that

67

\   n+k
\ 
P Acm = lim P Acm
k→∞
m=n m=n
| {z }
(by ind.)
Qn+k c
= m=n P(Am )

n+k
Y
= lim (1 − P(Am ))
k→∞
m=n
n+k
!
X
≤ lim exp − P(Am ) = 0,
k→∞
m=n

where we used the inequality 1 − α 6 e−α for all α ∈ [0, 1].

Example 1.9. Independent 0-1-experiments with success probability


p ∈ (0, 1). Let (x1 , . . . , xN ) ∈ {0, 1}N (“binary text of length N ”).

Pp (“text occurs”) = ?

To calculate this probability we partition the infinite sequence ω = (yn ) into


blocks of length N

(y1 , y2 , . . . ... ... . . .) ∈ Ω := {0, 1}N .


| {z } | {z }
1. block 2. block
length = N length = N

and consider the events Ai = “text occurs in the ith block”. Clearly, Ai , i ∈ N,
are independent events (!) by Proposition 1.2(ii) with equal probability

Pp (Ai ) = pK (1 − p)N −K =: α > 0.


PN
where K := i=1 xi is the total sum of ones in the text. In particular,
∞ ∞
i=1 α = ∞, and now Borel-Cantelli implies Pp (A∞ ) = 1,
P P
i=1 Pp (Ai ) =
where

A∞ = lim sup Ai := “text occurs infinitely many times” .


i→∞

68
Moreover: since the indicator functions 1A1 , 1A2 , . . . are uncorrelated (since
A1 , A2 , . . . are independent) with uniformly bounded variances, the strong law
of large numbers implies that
n
1X Pp -a.s.
1Ai −−−−→ E[1A1 ] = α ,
n
i=1

i.e. the relative frequency of the given text in blocks of the infinite sequence is
strictly positive.

2 Independent random variables


Let (Ω, A, P) be a probability space.

Definition 2.1. A family Xi , i ∈ I, of r.v. on (Ω, A, P) is said to be indepen-


dent, if the σ-algebras
   
σ(Xi ) := Xi−1 B(R̄) = {Xi ∈ A} A ∈ B(R̄) , i ∈ I,

are independent, i.e. for all finite subsets J ⊂ I and any Borel subsets Aj ∈
B(R̄)
\  Y
P {Xj ∈ Aj } = P(Xj ∈ Aj ).
j∈J j∈J

Remark 2.2. Let Xi , i ∈ I, be independent and hi : R̄ → R̄, i ∈ I,


B(R̄)/B(R̄)-measurable. Then Yi := hi (Xi ), i ∈ I, are again independent,
because σ (Yi ) ⊂ σ (Xi ) for all i ∈ I.

Proposition 2.3. Let X1 , . . . , Xn be independent r.v., ≥ 0. Then

E[X1 · . . . · Xn ] = E[X1 ] · . . . · E[Xn ].

Proof. W.l.o.g. n = 2. (Proof of the general case by induction, using the fact
that X1 · . . . · Xn−1 and Xn are independent , since X1 · . . . · Xn−1 is
 measurable
w.r.t σ σ(X1 ) ∪ · · · ∪ σ(Xn−1 ) , and σ σ(X1 ) ∪ · · · ∪ σ(Xn−1 ) and σ(Xn )

are independent by Proposition 1.2.)

69
It therefore suffices to consider two independent r.v. X, Y ≥ 0, and we have
to show that

E[XY ] = E[X] · E[Y ]. (2.2)

W.l.o.g. X, Y simple (for general X and Y there exist increasing sequences of


simple r.v. Xn (resp. Yn ), which are σ(X)-measurable, resp. σ(Y )-measurable,
converging pointwise to X, resp. Y . Then E[Xn Yn ] = E[Xn ] · E[Yn ] for all n
implies (2.2) using monotone integration).
But for X, Y simple, hence

m
X n
X
X= αi 1Ai and Y = βj 1Bj ,
i=1 j=1

with αi , βj > 0 and Ai ∈ σ(X) resp. Bj ∈ σ(Y ) it follows that


X
E[XY ] = αi βj · P(Ai ∩ Bj )
i,j
X
= αi βj · P(Ai ) · P(Bj ) = E[X] · E[Y ].
i,j

Corollary 2.4. X, Y independent, X, Y ∈ L1

⇒ XY ∈ L1 and E[XY ] = E[X] · E[Y ] .

Proof. Let ε1 , ε2 ∈ {+, −}. Then X ε1 and Y ε2 are independent by Remark 2.2
and nonnegative. Proposition 2.3 implies

E[X ε1 · Y ε2 ] = E[X ε1 ] · E[Y ε2 ].

In particular X ε1 · Y ε2 in L1 , because E[X ε1 ] · E[Y ε2 ] < ∞. Hence

X · Y = X + · Y + + X − · Y − − (X + · Y − + X − · Y + ) ∈ L1 ,

and E[XY ] = E[X] · E[Y ].

70
Remark 2.5. (i) In general the converse to the above corollary does not hold:
For example let X be N (0, 1)-distributed and Y = X 2 . Then X and Y
are not independent, but

E[XY ] = E[X 3 ] = E[X] · E[Y ] = 0 .

Moreover, although X and Y are uncorrelated we have that E[X + Y + ] 6=


E[X + ] · E[Y + ] (exercise). Thus X + and Y + are not uncorrelated, and
thus uncorrelation is not inherited like independence by composition with
measurable functions (cf. Remark 2.2)

(ii)
X, Y ∈ L2 independent ⇒ X, Y uncorrelated
because

cov(X, Y ) = E[XY ] − E[X] · E[Y ] = 0 .

Corollary 2.6 (to the strong law of large numbers). Let X1 , X2 , . . . ∈ L2 be


independent with supi∈N var(Xi ) < ∞. Then
n
1X
P-a.s.

lim Xi − E[Xi ] = 0
n→∞ n
i=1

n
1X
If E[Xi ] ≡ m then lim Xi = m P-a.s.
n→∞ n
i=1

3 Kolmogorov’s law of large numbers


Proposition 3.1 (Kolmogorov, 1930). Let X1 , X2 , . . . ∈ L1 be independent,
identically distributed (i.i.d.), m = E[Xi ]. Then
n
1X n→∞
Xi −−−→ m P-a.s.
n
| i=1
{z }
empirical
mean

Proposition 3.1 follows from the following more general result:

71
Proposition 3.2 (Etemadi, 1981). Let X1 , X2 , . . . ∈ L1 be pairwise indepen-
dent, identically distributed, m = E[Xi ]. Then

n
1X n→∞
Xi −−−→ m P-a.s.
n
i=1

Proof. W.l.o.g. Xi > 0. Otherwise consider X1+ , X2+ , . . . (also ∈ L1 , pairwise


independent, identically distributed) and X1− , X2− , . . . (also ∈ L1 , pairwise
independent, identically distributed).

1. Replace Xi by X̃i := 1{Xi <i} Xi .


Clearly,
(
x if x < i
X̃i = hi (Xi ) with hi (x) :=
0 if x > i

Then X̃1 , X̃2 , . . . are pairwise independent by Remark 2.2. For the proof it
Pn
is now sufficient to show that for S̃n := i=1 X̃i we have that

S̃n n→∞
−−−→ m P-a.s.
n

Indeed,

X ∞
X ∞
X
P(Xn 6= X̃n ) = P(Xn > n) = P(X1 > n)
n=1 n=1 n=1
∞ X
X ∞ ∞
X
 
= P X1 ∈ [k, k + 1) = k · P X1 ∈ [k, k + 1)
n=1 k=n k=1

X  
= E k · 1{X1 ∈[k,k+1)} 6 E[X1 ] < ∞
k=1
| {z }
6X1 ·1{X1 ∈[k,k+1)}

implies by the Borel-Cantelli lemma

P(Xn 6= X̃n infinitely often) = 0 .

72
2. Reduce the proof to convergence along the subsequence kn = bαn c (=
largest natural number ≤ αn ), α > 1.
We will show in Step 3. that

S̃kn − E[S̃kn ] n→∞


−−−→ 0 P-a.s. (2.3)
kn
This will imply the assertion of the Proposition, because
    i→∞
E[X̃i ] = E 1{Xi <i} · Xi = E 1{X1 <i} · X1 % E[X1 ](= m)

hence
kn
1 1 X n→∞
· E[S̃kn ] = E[X̃i ] −−−→ m,
kn kn
i=1

and thus by (2.3)


1 n→∞
· S̃kn (ω) −−−→ m ∀ω ∈ Nαc where Nα ∈ A with P(Nα ) = 0.
kn
It follows for l ∈ N ∩ [kn , kn+1 ), n big, and ω ∈ Nαc

kn S̃k S̃l S̃k kn+1


· n (ω) 6 (ω) 6 n+1 (ω) · .
kn+1 kn l kn+1 kn
| {z } | {z } | {z } | {z }
n→∞ n→∞ n→∞ n→∞
−−−→ 1
α
−−−→m −−−→m −−−→α

Hence for all ω ∈


/ Nα

1 S̃l (ω) S̃ (ω)


· m 6 lim inf 6 lim sup l 6 α · m.
α l→∞ l l→∞ l

Finally choose a sequence αn & 1. Then for all ω in


\ [
Nαc n = Ω \ Nαn
n>1 n>1

we get

S̃l (ω)
lim = m.
l→∞ l

73
3. Due to Lemma 1.7.7 it suffices for the proof of (2.3) to show that

∞  !
X S̃kn − E[S̃kn ]
∀ε > 0 : P > ε < ∞,
kn
n=1

(fast convergence in probability to 0).

Pairwise independence of the X̃i implies that the X̃i pairwise uncorrelated,
hence

! kn
S̃kn − E[S̃kn ] 1 1 X
P >ε 6 2 2 · var(S̃kn ) = 2 2 var(X̃i )
kn kn ε kn ε
i=1
kn
1 X 
E (X̃i )2 .

6 2 2
kn ε
i=1

It therefore suffices to show that

∞  kn 
X 1 X  2
 X 1  2

s := 2
E (X̃ i ) = · E (X̃ i ) < ∞.
kn kn2
n=1 i=1 (n,i)∈N2 ,
i6kn

To this end note that

∞  X 
X 1
· E (X̃i )2 .
 
s=
kn2
i=1 n : kn >i

We will show in the following that there exists a constant c such that

X 1 c
6 . (2.4)
kn2 i2
n : kn >i

74
This will then imply that
(2.4) ∞ ∞
X 1  2
 X 1  2

s 6 c · E (X̃i ) = c · E 1{X <i} · X 1
i2 2 1
i
i=1 i=1
∞  i 
X 1 X 2

6 c l · P X1 ∈ [l − 1, l)
i2
i=1 l=1
∞ ∞
X  !
X 1
l2 ·

= c ·P X1 ∈ [l − 1, l)
i2
l=1 i=l
| {z }
62l−1

X X∞
  
6 2c l · P X1 ∈ [l − 1, l) = 2c E l · 1{X1 ∈[l−1,l)}
l=1 l=1
| {z }
6(X1 +1)·1{X1 ∈[l−1,l)}

6 2c · E[X1 ] + 1 < ∞,
where we used the fact that
∞ ∞ ∞  
X 1 1 X 1 1 X 1 1 1 1 2
6 + = + − = + 6 .
i2 l2 (i − 1)i l2 i−1 i l2 l l
i=l i=l+1 i=l+1

It remains to show (2.4). To this end note that


bαn c = kn 6 αn < kn + 1
 
α>1 α−1
n n n−1
⇒ kn > α − 1 > α − α = αn .
α
| {z }
=:cα

Let ni be the smallest natural number satisfying kni = bαni c > i, hence
αni > i, then
X 1 X 1 1 α2 c−2 1
6 c−2
α = c−2
α · · α −2(ni −1)
6 α
· .
kn2 α2n 1 − α−2 1 − α−2 i2
n : kn >i n>ni

Corollary 3.3. Let X1 , X2 , . . . be pairwise independent, identically distributed


with Xi > 0. Then
n
1X
P-a.s.

lim Xi = E[X1 ] ∈ [0, ∞]
n→∞ n
i=1

75
Proof. W.l.o.g. E[X1 ] = ∞ (otherwise
 just apply Proposition 3.2). Then (by
Pn n→∞
Proposition 3.2) n i=1 Xi (ω) ∧ N −−−→ E[X1 ∧ N ], P-a.s. for all N ,
1

hence
n n
1X 1X  n→∞ N →∞
Xi > Xi ∧ N −−−→ E[X1 ∧ N ] % E[X1 ] P-a.s.
n n
i=1 i=1
Pn
which implies lim inf n→∞ 1
n i=1 Xi = ∞ P-a.s. and the statement of the
corollary follows.

Example 3.4. Growth in random media Let Y1 , Y2 , . . . be i.i.d., Yi > 0,


with m := E[Yi ] (existence of such a sequence later!)
Define X0 = 1 and inductively Xn := Xn−1 · Yn
Clearly, Xn = Y1 · . . . · Yn and E[Xn ] = E[Y1 ] · . . . · E[Yn ] = mn , hence

+∞
 if m > 1 exponential growth (supercritical)
E[Xn ] → 1 if m = 1 critical
if m < 1 exponential decay (subcritical)

0

What will be the long-time behaviour of Xn (ω)?


Surprisingly, in the supercritical case m > 1, one may observe that limn→∞ Xn =
0 with positive probability.
Explanation: Suppose that log Yi ∈ L1 . Then by Kolmogorov’s LLN (Propo-
sition 3.1)
n
1 1X n→∞
log Xn = log Yi −−−→ E[log Y1 ] =: α P-a.s.
n n
i=1

and

α < 0: ∃ ε > 0 with α + ε < 0, so that Xn (ω) 6 en(α+ε) ∀ n > n0 (ω), hence
P-a.s. exponential decay
α > 0: ∃ ε > 0 with α − ε > 0, so that Xn (ω) > en(α−ε) ∀ n > n0 (ω), hence
P-a.s. exponential growth
Note that by Jensen’s inequality

α = E[log Y1 ] 6 log E[Y1 ],


| {z }
=m

76
and typically the inequality is strict, i.e. α < log m, so that it might happen
that α < 0 although m > 1 (!)
Illustration: as a particular example consider the following model.
Let X0 := 1 be the capital at time 0. At time n − 1 invest 12 Xn−1 and win
c 12 Xn−1 or 0, both with probability 12 , where c > 0 is a constant. Then
(
1 c 12 Xn−1 with prob. 1
2
Xn = Xn−1 +
|2 {z } 0 with prob. 1
2,
not invested | {z }
gain/loss

thus (
1
2 (1 + c)Xn−1 with prob. 1
2
Xn = = Xn−1 Yn
1
2 Xn−1 with prob. 1
2,
with (
1
2 (1 + c) with prob. 12
Yn := 1
2 with prob. 12 ,
so that E[Yi ] = 41 (1 + c) + 4 = 4 (supercritical if c >
1 c+2
2)
On the other hand
"   #
1 1 1 1 1 + c c<3
E[log Y1 ] = · log (1 + c) + log = · log < 0.
2 2 2 2 4
n→∞
Hence Xn −−−→ 0 P-a.s. with exponential rate for c < 3, whereas at the same
time for c > 2, E[Xn ] = mn % ∞ with exponential rate.

Back to Kolmogorov’s law of large numbers: let X1 , X2 , . . . ∈ L1 i.i.d. with


m := E[Xi ]. Then
n
1X n→∞
Xi (ω) −−−→ E[X1 ] P-a.s.
n
i=1
which corresponds to the empirical measurement of the expectation E[X1 ]. We
want a similar statement for the distribution P ◦ X1−1 . For this, define the
“random measure”
n
1X 
%n (ω, A) := 1A Xi (ω)
n
i=1

= “relative frequency of the visit in A”

77
Then
n
1X
%n (ω, · ) = δXi (ω)
n
i=1

is a probability measure on R, B(R) (for fixed ω) and it is called the empirical



distribution of the first n observations.

Proposition 3.5. (“Fundamental Theorem of Statistics”) Let X1 , X2 , . . . be


i.i.d. ( r merely pairwise independent, identically distributed). Then it holds for
P-almost every ω ∈ Ω
n→∞
%n (ω, · ) −−−→ µ := P ◦ X1−1 weakly,

which corresponds to the empirical measurement of the theoretical distribution


µ.

Proof. Clearly, Kolmogorov’s law of large numbers implies that for any x ∈ R
n
 1X 
Fn (ω, x) := %n ω, (−∞, x] = 1(−∞,x] Xi (ω)
n
i=1

−→ E[1(−∞,x] (X1 )] = P(X1 ≤ x) = µ (−∞, x] =: F (x)

P-a.s., hence for every ω ∈


/ N (x) for some P-null set N (x).
Then
[
N := N (r).
r∈Q

is a P-null set too, and for all x ∈ R and all s, r ∈ Q with s < x < r and
ω∈ / N:

F (s) = lim Fn (ω, s) 6 lim inf Fn (ω, x)


n→∞ n→∞

6 lim sup Fn (ω, x) 6 lim Fn (ω, r) = F (r).


n→∞ n→∞

Hence, if F is continuous at x, then for ω ∈


/N

lim Fn (ω, x) = F (x).


n→∞

Now the assertion follows from Corollary 10.5.

78
4 Joint distribution and convolution
Let Xi ∈ L1 i.i.d. Kolmogorov’s law of large numbers implies that
n
1X n→∞
Xi (ω) −−−→ E[X1 ] P-a.s.
n
|i=1 {z }
=:Sn

hence
Z
Sn −1 Sn
  h  i
f (x) d P ◦ (x) = E f
n n
(Lebesgue) Z
n→∞ 
−−−→ f E[X1 ] = f (x) dδE[X1 ] (x) ∀f ∈ Cb (R)

i.e., the distribution of Snn converges weakly to δE[X1 ] . This is not surprising,
because at least for Xi ∈ L2
  n
Sn 1 X n→∞
var = 2 var(Xi ) −−−→ 0.
n n | {z }
i=1 =var(X )
1

We will see later that if we rescale Sn appropriately, namely Sn



n
, so that
Sn
= var(X1 ), then the sequence of distributions of is asymptotically
Sn

var √
n

n

distributed as a normal distribution N ( n E[X1 ], var(X1 )). In particular, if Sn
Sn −1
is centered, i.e. E[Sn ] = 0, or equivalently E[X1 ] = 0, then P ◦ √

n

N (0, var(X1 )) weakly.
One problem in this context is: how to calculate the distribution of Sn ? Since
Sn is a function of X1 , . . . , Xn , we need to consider their joint distribution in
the sense of the following definition:

Definition 4.1. Let X1 , . . . , Xn be real-valued r.v. on (Ω, A, P). Then the


distribution µ̄ := P ◦ X̄ −1 of the transformation

X̄ : Ω → Rn ,

ω 7→ X̄(ω) := X1 (ω), . . . , Xn (ω)

under P is said to be the joint distribution of X1 , . . . , Xn .


Note that µ̄ is a probability measure on Rn , B(Rn ) with µ̄(Ā) = P(X̄ ∈ Ā)
for all Ā ∈ B(Rn ).

79
Remark 4.2. (i) µ̄ is well-defined, because X̄ : Ω → Rn is A/B(Rn )-
measurable.
Proof:

B(Rn ) = σ A1 × · · · × An Ai ∈ B(R)
 
 
(= σ A1 × · · · × An Ai = (−∞, xi ] , xi ∈ R )

and if Ai ∈ B(R), i = 1, ..., n, then


n
\
X̄ −1
(A1 × · · · × An ) = {Xi ∈ Ai } ∈A
| {z }
i=1 ∈A

which implies the measurability of the transformation X̄.

(ii) Proposition 1.11.5 implies that µ̄ is uniquely determined by


n
\ 
µ̄(A1 × · · · × An ) = P {Xi ∈ Ai } .
i=1

Example 4.3. (i) Let X, Y be r.v., uniformly distributed on [0, 1]. Then
• X, Y independent ⇒ joint distribution = uniform distribution on
[0, 1]2
• X = Y ⇒ joint distribution = uniform distribution on the diagonal

(ii) Let X, Y be independent, N (m, σ 2 ) distributed. The following Proposition


4.5 shows that the joint distribution of X and Y has the density
 
1 1 2 2

f (x, y) = · exp − · (x − m) + (y − m)
2πσ 2 2σ 2

which is a particular example of a 2-dimensional normal distribution.


In the case m = 0 it follows that
p
R := X 2 + Y 2,
Y
Φ := arctan ,
X

80
are independent and
Φ has a uniform distribution on − π2 , π2 ,
 
( 2
r r
if r > 0

exp − 2σ
R has a density σ2 2

0 if r < 0.

Definition 4.4. (Products of probability spaces) The product of measurable


spaces (Ωi , Ai ), i = 1, . . . n, is defined as the measurable space (Ω, A) given
by
Ω := Ω1 × . . . × Ωn
endowed with the smallest σ-algebra
A := σ({A1 × . . . × An | Ai ∈ Ai , 1 ≤ i ≤ n})
generated by measurable cylindrical sets. A is said to be the product σ-algebra
Nn
of Ai (notation: i=1 Ai ).
Let Pi , i = 1, . . . , n, be probability measures on (Ωi , Ai ). Then there exists
a unique probability measure P on the product space (Ω, A) satisfying
P(A1 × · · · × An ) = P1 (A1 ) · . . . · Pn (An )
for every measurable cylindrical set. P is called the product measure of Pi .
Nn
Notation: i=1 Pi (uniqueness of P follows from 1.11.5, existence later!)
Proposition 4.5. Let X1 , . . . , Xn be r.v. on (Ω, A, P) with distributions
µ1 , . . . , µn and joint distribution µ̄. Then
n
O
X1 , . . . , Xn independent ⇔ µ̄ = µi ,
i=1
Qn
(i.e. µ̄(A1 × · · · × An ) = i=1 µi (Ai ) if Ai ∈ B(R)).
In this case:
(i) µ̄ is uniquely determined by µ1 , . . . , µn .

(ii)
Z
ϕ(x1 , . . . , xn ) dµ̄(x1 , . . . , xn )
Z Z  !
= ··· ϕ(x1 , . . . , xn ) µi1 (dxi1 ) · · · µin (dxin ).

81
for all B(Rn )-measurable functions ϕ : Rn → R̄ with ϕ ≥ 0 or ϕ µ̄-
integrable.

(iii) If µi is absolutely continuous with density fi (i.e. µi (dxi ) = fi (xi )dxi ),


i = 1, . . . , n, then µ̄ is absolutely continuous with density
n
Y
f¯(x̄) := fi (xi ).
i=1

(i.e. µ̄(dx̄) = f¯(x̄)dx̄)


Proof. The equivalence is obvious.
(i) Obvious from part (ii) of the previous Remark 4.2.

(ii) It is enough to show the identity in (ii) for any ϕ ∈ B(Rn )b . From this it
easily extends to ϕ as stated. Now, the set of ϕ ∈ B(Rn )b for which the
identity in (ii) holds forms a monotone vector space which contains the
multiplicative class of functions ϕ(x1 , . . . , xn ) = 1A1 ×···×An (x1 , . . . , xn ),
Ai ∈ B(R), 1 ≤ i ≤ n. Since B(Rn ) is generated by sets of the form
A1 × · · · × An , the identity in (ii) folllows for all ϕ ∈ B(Rn )b by the
monotone class theorem (Theorem 11.9).

(iii) f¯ is nonnegative and measurable on Rn w.r.t. B(Rn ), and


Z n Z
Y
f¯(x̄) dx̄ = fi (xi ) dxi = 1.
Rn i=1 R

Hence,
Z
µ̌(A) := f¯(x̄) dx̄, A ∈ B(Rn ),
A

defines a probability measure on (Rn , B(Rn )). For A1 , . . . , An ∈ B(R)


it follows that
n
Y n Z
Y
µ̄(A1 × · · · × An ) = µi (Ai ) = fi (xi ) dxi
i=1 i=1 Ai
Z
(ii)
= 1A1 ×···×An (x̄) · f¯(x̄) dx̄ = µ̌(A1 × · · · × An ).

Hence µ̄ = µ̌ by 1.11.5.

82
Let X1 , . . . , Xn be independent, Sn := X1 + · · · + Xn
How to calculate the distribution of Sn with the help of the distribution of
Xi ?
In the following denote by Tx : R1 → R1 , y 7→ x + y, the translation by
x ∈ R.
Proposition 4.6. Let X1 , X2 be independent r.v. with distributions µ1 , µ2 .
Then:
(i) The distribution of X1 + X2 is given by the convolution
Z
µ1 ∗ µ2 := µ1 (dx1 ) µ2 ◦ Tx−1
1
, i.e.
Z Z
µ1 ∗ µ2 (A) = 1A (x1 + x2 ) µ1 (dx1 ) µ2 (dx2 )
Z
= µ1 (dx1 ) µ2 (A − x1 ) ∀ A ∈ B(R1 ) .

(ii) If one of the distributions µ1 , µ2 is absolutely continuous, e.g. µ2 with


density f2 , then µ1 ∗ µ2 is absolutely continuous again with density
Z
f (x) := µ1 (dx1 ) f2 (x − x1 )
 Z 
= f1 (x1 ) · f2 (x − x1 ) dx1 =: (f1 ∗ f2 )(x) if µ1 = f1 dx1 .

Proof. (i) Let A ∈ B(R), and define Ā := (x1 , x2 ) ∈ R2 x1 + x2 ∈ A .



Then

P(X1 + X2 ∈ A) = P (X1 , X2 ) ∈ Ā = (µ1 ⊗ µ2 )(Ā)
ZZ
= 1Ā (x1 , x2 ) d(µ1 ⊗ µ2 )(x1 , x2 )
ZZ
= 1A (x1 + x2 ) d(µ1 ⊗ µ2 )(x1 , x2 )
Z Z 
= 1A−x1 (x2 ) µ2 (dx2 ) µ1 (dx1 )
Z
= µ2 (A − x1 ) µ1 (dx1 ) = (µ1 ∗ µ2 )(A).

83
(ii)
Z Z Z
(µ1 ∗ µ2 )(A) = µ1 (dx1 ) µ2 (A − x1 ) = µ1 (dx1 ) f2 (x2 ) dx2
A−x1
change of variable Z Z
x−x1 =x2
= µ1 (dx1 ) f2 (x − x1 ) dx
A
Z Z 
4.5
= µ1 (dx1 ) f2 (x − x1 ) dx.
A

Example 4.7.

(i) Let X1 , X2 be independent r.v. with Poisson-distribution πλ1 and πλ2 .


Then X1 + X2 has Poisson-distribution πλ1 +λ2 , because
n n
X
−(λ1 +λ2 )
X λk 1 λn−k
2
(πλ1 ∗ πλ2 )(n) = πλ1 (k) · πλ2 (n − k) = e ·
k! (n − k)!
k=0 k=0
n 
(λ1 + λ2 )n

1 X n
= e−(λ1 +λ2 ) · λk1 λn−k
2 = e−(λ1 +λ2 ) · .
n! k n!
k=0

(ii) Let X1 , X2 be independent r.v. with normal distributions N (mi , σi2 ), i =


1, 2. Then X1 +X2 has normal distribution N (m1 +m2 , σ12 +σ22 ), because
fm1 +m2 ,σ12 +σ22 = fm1 ,σ12 ∗ fm2 ,σ22 (Exercise!)

(iii) The Gamma distribution Γα,p is defined through its density γα,p given by
(
1
Γ(p)
· αp xp−1 e−αx if x > 0
γα,p (x) =
0 if x 6 0

If X1 , X2 are independent with distribution Γα,pi , i = 1, 2, then X1 + X2


has distribution Γα,p1 +p2 . (Exercise!)
In the particular case pi = 1: The sum Sn = T1 + . . . + Tn of indepen-
dent r.v. Ti with exponential distribution with parameter α has Gamma-
distribution Γα,n , i.e.
(
αn
(n−1)!
· e−αx xn−1 if x > 0
γα,n (x) =
0 if x 6 0.

84
Example 4.8 (The waiting time paradox). Let T1 , T2 , . . . be independent, ex-
ponentially distributed waiting times (e.g. time between reception of two phone
calls in a call center, or two arrivals/departures of buses at a bus station) with
parameter α > 0, so that in particular
Z ∞
1
E[Ti ] = x · αe−αx dx = · · · = .
0 α

Set Sn := T1 + · · · + Tn , S0 := 0 and fix some time t > 0. Let X denote


the time-interval from the preceding event to t, and Y denote the time-interval
from t to the next event, i.e. for some n ∈ N ∪ {0}, we have X = t − Sn and
Y = Sn+1 − t. More precisely

X ∞
X
X =t− 1{Sk ≤t} Tk and Y = 1{Sk−1 ≤t} Tk − t.
k=1 k=1

T T
z }|1 { z }|2 { . . . t
| {z } | {z }
X Y
Question: How long on average is the waiting time from t until the next event
(“phone call will be received”, “bus will arrive/depart”), i.e. how big is E[Y ] ?

Naive guess: E[Y ] = 1


2α is wrong.

Correct Answer: E[Y ] = α1 , and

1 1
E[X] = (1 − e−αt ) ≈ for large t .
α α
More precisely:

(i) Y has exponential distribution with parameter α.

(ii) X has exponential distribution with parameter α, “compressed to” [0, t],
i.e.

P(X > s) = e−αs ∀ 0 6 s 6 t,


P(X = t) = e−αt ;

85
In particular,
Z t
1
E[X] = s · αe−αs ds + t · e−αt = · · · = (1 − e−αt )
0 α

and [t − X, t + Y ] has on average length 1


α (2 − e−αt ) (≈ 2 · 1
α for large
t).

(iii) X, Y are independent.

Descriptive: Choosing t “randomly” (i.e. for different ω ∈ Ω, t may be in a dif-


ferent interval [Sn (ω), Sn+1 (ω))) it is more likely to pick a large waiting interval.

Proof. Let us first determine the joint distribution of X and Y : Fix 0 6 x 6 t


and y > 0. Then for Sn := T1 + · · · + Tn , S0 := 0:

P(X > x, Y > y)


[  
= P {t − Sn > x, Sn+1 − t > y}
n≥0

X 
= P(T1 > y + t) + P Sn 6 t − x, Tn+1 > y + t − Sn
n=1
∞ ZZ
X
= e−α(t+y) + 1[0,t−x]×[y+t−s,∞) (s, r) · γα,n (s) · αe−αr ds dr
n=1
∞ Z
X t−x
−α(t+y)
=e + γα,n (s) · e−α(y+t−s) ds
n=1 0
 Z t−x ∞
X 
−α(t+y) αs
=e 1+ e γα,n (s) ds
0
|n=1 {z }

 Z t−x 
= e−α(t+y) 1 + αeαs ds
0
−α(t+y)
=e ·e α(t−x)
= e−αy · e−αx .

Consequently:

86
(i) Setting x = 0 we get: Y ist exponentially distributed with parameter α.

(ii) Setting y = 0 we get: X ist exponentially distributed with parameter α,


compressed to [0, t].

(iii) X, Y are independent.

5 Characteristic functions
Let M1+ (Rn ) be the set of all probability measures on (Rn , B(Rn )).
For given µ ∈ M1+ (Rn ) define its characteristic function as the complex-
valued function µ̂ : Rn → C defined by
Z Z Z
ihu,yi
µ̂(u) := e µ(dy) := cos(hu, yi) µ(dy) + i sin(hu, yi) µ(dy) .

Proposition 5.1. Let µ ∈ M1+ (Rn ). Then

(i) µ̂(0) = 1.

(ii) |µ̂| 6 1.

(iii) µ̂(−u) = µ̂(u).

(iv) µ̂ is uniformly continuous.

(v) µ̂ is positive definite, i.e. for all c1 , . . . , cm ∈ C, u1 , . . . , um ∈ Rn , m > 1:


m
X
cj c̄k · µ̂(uj − uk ) > 0.
j,k=1

Proof. Exercise.

Proposition 5.2 (Uniqueness theorem). Let µ1 , µ2 ∈ M1+ (Rn ) with µ̂1 = µ̂2 .
Then µ1 = µ2 .

Proposition 5.3 (Bochner’s theorem). Let ϕ : Rn → C be a continuous,


positive definite function with ϕ(0) = 1. Then there exists one (and only one)
µ ∈ M1+ (Rn ) with µ̂ = ϕ.

87
Proposition 5.4 (Lévy’s continuity theorem). Let (µm )m∈N be a sequence in
M1+ (Rn ). Then
(i) limm→∞ µm = µ weakly implies limm→∞ µ̂m = µ̂ uniformly on every
compact subset of Rn .

(ii) Conversely, if (µ̂m )m∈N converges pointwise to some function ϕ : Rn → C


which is continuous at u = 0, then there exists a unique µ ∈ M1+ (Rn )
such that µ̂ = ϕ and limm→∞ µm = µ weakly.
Proof. For the proofs or references where to find the proofs of the three previous
propositions see Klenke, Theorems 15.8, 15.29, and 15.23.
Let (Ω, A, P) be a probability space and X̄ : Ω → Rn be A/B(Rn )-
measurable. Let PX̄ (:= P ◦ X̄ −1 ) be the distribution of X̄. Then
Z Z h i
ihu,yi
ϕX̄ (u) := P̂X̄ (u) = e PX̄ (dy) = eihu,X̄i dP = E eihu,X̄i

is called the characteristic function of X̄.


Remark 5.5. X1 , . . . , Xn are independent if and only if
Yn
n
P̂(X ,...,X ) (u1 , . . . , un ) = (⊗j=1 PXj )(u1 , . . . , un ) ( =
\ P̂Xj (uj ) ),
| 1{z n} j=1
| {z }
=ϕX̄ =ϕXj (uj )
n
Y
i.e. P̂(X1 ,...,Xn ) = P̂Xj ◦ P rj , where P rj (u) = uj .
j=1

Proposition 5.6. Let X1 , . . . , Xn be independent r.v., α ∈ R and S :=


Pn
α k=1 Xk . Then for all u ∈ R:
n
Y
ϕS (u) = ϕXk (αu).
k=1
Proof.
Z n
Z Y n Z
Indep.
Y
ϕS (u) = eiuS dP = eiαuXk dP = eiαuXk dP
k=1 k=1
n
Y
= ϕXk (αu).
k=1

88
Proposition 5.7. For all u ∈ Rn :
  n2 Z
1 1 2 1 2
eihu,yi e− 2 kyk dy = e− 2 kuk .

Proof. See Theorem 15.12 in Klenke.

Example 5.8. (i) δ̂a (u) = eiua for all a, u ∈ R.


P∞ P∞
(ii) Let µ := i=0 αi δai (αi ≥ 0, i=0 αi = 1). Then

X
µ̂(u) = αi eiuai , u ∈ R.
i=0

Special cases:
Pn
a) Binomial distribution βnp = n
pk q n−k δk , q = 1 − p, p ∈

k=0 k
[0, 1]. Then for all u ∈ R:
n  
X n
β̂np (u) = pk q n−k · eiuk = (q + peiu )n .
k
k=0

P∞ −α αn δ .
b) Poisson distribution πα = n=0 e n! n Then for all u ∈ R:

−α
X αn iu
−1)
π̂α (u) = e · eiun = eα(e .
n!
n=0 | {z }
iu n
= (αen! )

6 Central limit theorem


Definition 6.1. Let X1 , X2 , . . . ∈ L2 be independent r.v.’s with strictly posi-
tive variances, Sn := X1 + · · · + Xn and

Sn − E[Sn ]
Sn∗ := p (“standardized sum”)
var(Sn )

In particular E[Sn∗ ] = 0 and var(Sn∗ ) = 1.

89
The sequence X1 , X2 , . . . of r.v.’s is said to have the central limit property
(CLP), if

lim PSn∗ = N (0, 1) weakly,


n→∞

or equivalently
Z b
1 x2
lim P(Sn∗ 6 b) = √ e− 2 dx = Φ(b), ∀b ∈ R.
n→∞ 2π −∞

Proposition 6.2. (Central limit theorem) Let X1 , X2 , . . . ∈ L2 be independent


r.v., σn2 := var(Xn ) > 0 and
n
X  12
sn := σk2 .
k=1

Assume that (Xn )n∈N satisfies Lindeberg’s condition


n Z  2
X Xk − E[Xk ]
lim dP = 0 ∀ε > 0. (L)
sn
n o
n→∞ |Xk −E[Xk ]|

k=1 sn
| {z }
=:Ln (ε)

Then (Xn )n∈N has the CLP.

Remark 6.3. (i) (Xn )n∈N i.i.d. ⇒ (Xn )n∈N satisfies (L).
Proof: Let m := E[Xn ], σ 2 := var(Xn ). Then s2n = nσ 2 , so that
Z Lebesgue
−2 2 n→∞
Ln (ε) = σ √
(X1 − m) dP −−−→ 0.
{|X1 −m|>ε nσ}

(ii) The following stronger condition, known as Lyapunov’s condition, is often


easier to check (typically with δ = 1):
n h i
X 2+δ
E Xk − E[Xk ]
k=1
∃ δ > 0 : lim = 0. (Lya)
n→∞ s2+δ
n

90
To see that Lyapunov’s condition implies Lindeberg’s condition note that
for all ε > 0:
Xk − E[Xk ]
>1
εsn
2+δ
Xk − E[Xk ] 2
⇒ > Xk − E[Xk ]
(εsn )δ
and therefore
n h i
X 2+δ
E Xk − E[Xk ]
1 k=1
Ln (ε) 6 · .
εδ s2+δ
n

(iii) Let (Xn ) be bounded and suppose that sn → ∞. Then (Xn ) satisfies
Lyapunov’s condition for any δ > 0, because
α
|Xk | 6
2
⇒ Xk − E[Xk ] 6 α
n h i n h i
X 2+δ X 2 δ
E Xk − E[Xk ] E Xk − E[Xk ] α
k=1 k=1
⇒ 6
s2+δ
n s2n sδn
 δ n i  α δ
α 1 X
h
2
= · E Xk − E[Xk ] = .
sn s2n sn
|k=1 {z }
=s2n

Lemma 6.4. Suppose that (Xn ) satisfies Lindeberg’s condition. Then


σk
lim max = 0. (2.5)
n→∞ 16k6n sn

Proof. For all 1 6 k 6 n, n ∈ N and ε > 0, we have


 2 Z  2 Z  2
σk Xk − E[Xk ] Xk − E[Xk ]
= dP 6 dP + ε2
sn sn sn
n o
|Xk −E[Xk ]|
sn

6 Ln (ε) + ε2 .

91
The proof of Proposition 6.2 requires some further preparations.
Lemma 6.5. For all t ∈ R and n ∈ N:
it (it)2 (it)n−1 |t|n
eit − 1 − − − ... − 6 .
1! 2! (n − 1)! n!

Proof. Define f (t) := eit , then f (k) (t) = ik eit . Then Taylor series expansion
around t = 0, applied to real and imaginary part, implies that
it (it)n−1
eit − 1 − − ... − = Rn (t)
1! (n − 1)!
with
t |t|
|t|n
Z Z
1 1
Rn (t) = (t−s)n−1 in eis ds 6 sn−1 ds = .
(n − 1)! 0 (n − 1)! 0 n!

Proposition 6.6. Let X ∈ L2 . Then ϕX (u) = eiuX dP is two times


R
continuously differentiable with
Z Z
ϕ0X (u) =i XeiuX
dP , ϕ00X (u) =− X 2 eiuX dP .

In particular

ϕ0X (0) = i · E[X], ϕ00X (0) = − E[X 2 ], |ϕ00X | 6 E[X 2 ].

Moreover, for all u ∈ R


1
ϕX (u) = 1 + iu · E[X] + · θ(u)u2 · E[X 2 ]
2
with θ(u) 6 1 and θ(u) ∈ C.
Proof. Clearly,

(eiuX )0 = iX · eiuX , (eiuX )00 = −X 2 eiuX , |eiuX | = 1 .

Now, Lebesgue’s dominated convergence theorem implies all assertions up to


the last one. For the proof of the last assertion note that the previous lemma
implies in the case n = 2, t = uX that
1 2 2
|eiuX − 1 − iuX| 6 ·u X .
2

92
Hence
Z
1 2
ϕX (u) − 1 − iu · E[X] = (eiuX − 1 − iuX) dP 6 · u · E[X 2 ].
2
ϕX (u)−1−iu·E[X]
Now define θ(u) := 0 if u2 E[X 2 ] = 0, and θ(u) := 1
·u2 ·E[X 2 ]
otherwise.
2

From now on assume that X1 , X2 , · · · ∈ L2 are independent and


n
X  12
E[Xn ] = 0, σn2 := var(Xn ) > 0, sn = σk2 , n ≥ 1.
k=1

Proposition 6.7. Suppose that the following two conditions (F) and (b) hold:
σk
(F ) lim max =0 (Feller’s condition)
n→∞ 1≤k≤n sn
n 
u 1
X 
(b) lim ϕXk − 1 = − u2 ∀u ∈ R.
n→∞ sn 2
k=1

Then (Xn ) has the CLP.


Proof. It is sufficient to show that
n
Y u 1 2
lim ϕXk = e− 2 u , ∀u ∈ R, (2.6)
n→∞ sn
k=1
Pn
because for Sn∗ = 1
sn k=1 Xk we have by Proposition 5.6 that
n
Y u
ϕSn∗ (u) = ϕXk ,
sn
k=1

n→∞ 1 2
and ϕSn∗ (u) −−−→ e− 2 u (= N
\ (0, 1)(u) by Proposition 5.7) pointwise as well
as N
\ (0, 1)(u) continuous at u = 0, implies by Lévy’s continuity theorem 5.4
and the Uniqueness theorem 5.2 that limn→∞ PSn∗ = N (0, 1) weakly.
For the proof of (2.6) we need to show that for all u ∈ R
n n
u Y u
Y  
lim ϕXk − exp ϕXk −1 = 0.
n→∞ sn sn
k=1
|k=1 P {z }
= exp( ...) → exp(− 12 u2 ) by (b)

93
To this end fix u ∈ R and note that |ϕXk | 6 1, hence

u u
   
exp ϕXk −1 = exp Re ϕXk − 1 6 1.
sn sn

Furthermore, for a1 , . . . , an , b1 , . . . , bn ∈ z ∈ C |z| 6 1




n
Y n
Y
ak − bk
k=1 k=1

= (a1 − b1 ) · a2 · · · an + b1 · (a2 − b2 ) · a3 · · · an + . . .
+ b1 · · · bn−1 · (an − bn )
n
X
6 |ak − bk |.
k=1

Consequently,
n n
u Y u
Y  
ϕXk − exp ϕXk −1
sn sn
k=1 k=1
n
u u
X  
6 ϕXk − exp ϕXk −1 =: Dn .
sn sn
k=1

If we define zk := ϕXk ( sun ) − 1, we can write

n
X
Dn = |zk + 1 − ezk |.
k=1

Fix ε ∈ (0, 12 ]. Then

|z + 1 − ez | 6 |z|2 6 ε|z| ∀ |z| 6 ε .

Note that E[Xk ] = 0 and E[Xk2 ] = σk2 . The previous proposition now implies
that for all k
u u 1 u u 2 1 u 2 2
|zk | = ϕXk −1 = i ·E[Xk ]+ ·θ( ) ·E[Xk2 ] 6 σk ,
sn sn 2 sn sn 2 sn

94
and moreover by (F) we can find n0 ∈ N such that for all n > n0 and 1 6 k 6 n
1 u 2 2
σk < ε.
2 sn
Hence for all n > n0
n n
X u2 X σk2 u2
Dn 6 ε |zk | 6 ε = ε · .
2 s2n 2
k=1 k=1

Consequently, limn→∞ Dn = 0.
Proof of Proposition 6.2. W.l.o.g. assume that E[Xn ] = 0 for all n ∈ N. We
will use Proposition 6.7. Since (L) ⇒ (F) by Lemma 6.4 it remains to show (b)
of Proposition 6.7. We will show that Lindeberg’s condition implies (b), i.e. we
show that (L) implies
n 
u 1
X 
lim ϕXk − 1 = − · u2 .
n→∞ sn 2
k=1

Let u ∈ R, n ∈ N, 1 6 k 6 n. By Lemma 6.5, we get


3
u u 1 u2 1 u
 
Yk := exp i · · Xk − 1 − i · · Xk + · 2 · Xk2 6 · Xk ,
sn sn 2 sn 6 sn
| {z }
E[... ]=0

and by the triangle inquality and Lemma 6.5 again, we get

u u 1 u2 u2
 
Yk 6 exp i · · Xk − 1 − i · · Xk + · 2 · Xk2 6 2 · Xk2 .
sn sn 2 sn sn

Then
n  n 
u 1 u2 2
X  1 X u

2
ϕXk −1 + ·u = ϕXk − 1 + · 2 · σk
sn 2 sn 2 sn
k=1 k=1

n Z 
u u 1 u2
X 
· Xk + · 2 · Xk2 dP

6 exp i · · Xk − 1 − i ·
sn sn 2 sn
k=1 | {z }
E[... ]=0
n
X
6 E[Yk ],
k=1

95
and for any ε > 0
Z Z
E[Yk ] = Yk dP + Yk dP
{|Xk |>εsn } {|Xk |<εsn }

u2 |u|3
Z Z
6 2 Xk2 dP + 3 |Xk |3 dP.
sn {|Xk |>εsn } 6sn {|Xk |<εsn }

Note that
σk2
Z Z
1 3ε
|Xk | dP 6 2 Xk2 dP = ε · ,
s3n {|Xk |<εsn } sn s2n

so that we obtain
n n Z n
X
2
X X k 2 |u|3 X σk2
E[Yk ] 6 u dP + ·ε
X
{| snk |>ε} sn 6 s2n
k=1 k=1
|k=1{z }
=1

|u|3
= u2 Ln (ε) + · ε.
6
Consequently
n
X
lim E[Yk ] = 0 ,
n→∞
k=1
and thus
n h
u 1
X i 
lim ϕXk − 1 + · u2 = 0.
n→∞ sn 2
k=1

Example 6.8 (Applications). (i) "Ruin probability"


Consider a portfolio of n contracts of a risc insurance (e.g. car insurance,
fire insurance, health insurance, ...). Let Xi > 0 be the claim size (or claim
severity) of the ith contract, 1 6 i 6 n. We assume that X1 , . . . , Xn ∈ L2
are i.i.d. with m := E[Xi ] and σ 2 := var(Xi ).
Suppose the insurance holder has to pay he following premium

Π := m + λσ 2
= average claim size + safety loading.

96
After some fixed amount of time:

Income: nΠ
n
X
Expenditures: Sn = Xi .
i=1

Suppose that K is the initial capital of the insurance company. What is


the probability P(R), where

R := {Sn > K + nΠ} denotes the ruin ?

We assume here that:


• No interest rate.
• Payments due only at the end of the time period.
Let
Sn − nm
Sn∗ := √ .

The central limit theorem implies for large n that Sn∗ ∼ N (0, 1), so that

K + nΠ − nm K + nλσ 2
   
P(R) = P Sn∗ > √ = P Sn∗ > √
nσ nσ
CLT
 K + nλσ 2 
≈ 1−Φ √ ,

| {z }
n→∞
−−−→∞

where Φ denotes the distribution function of the standard normal distribu-


tion. Note that the ruin probability decreases with an increasing number
of contracts.
Example
Assume that n = 2000, σ = 60, λ = 0.5‰.

(a) K = 0 ⇒ P(R) ≈ 1 − Φ(1.342) ≈ 9%.


(b) K = 1500 ⇒ P(R) ≈ 3%.

97
How large do we have to choose n in order to let the probability of ruin
P(R) fall below 1‰?
Answer: Φ(. . .) > 0.999, hence n > 10 611.

(ii) Stirling’s formula


Remark: Stirling proved the following formula
√ 1
n! ≈ 2πnn+ 2 e−n (2.7)

in the year 1730 and De Moivre used it in his proof of the CLT for Bernoulli
experiments (1733).
Conversely, in 1977, Weng provided an independent proof of the formula,
using the CLT (note that we did not use Stirling’s formula in our proof of
the CLT). Here is Weng’s proof:
Let X1 , X2 , . . . be i.i.d. with distribution π1 (Poisson distribution with
parameter 1), i.e.,

−1
X 1
PXn = e δk .
k!
k=0

Then Sn := X1 + · · · + Xn has Poisson distribution πn , i.e.,



−n
X nk
PSn = e δk ,
k!
k=0

and in particular E[Sn ] = var(Sn ) = n. Thus

Sn − n
Sn∗ = √ ,
n

so that Sn∗ = tn ◦ Sn for tn (x) := √ .


x−n
n
Then
Z Z
= E f (Sn∗ ) = E (f ◦ tn )(Sn ) =
   
f dPSn∗ f ◦ tn d PSn
|{z}
=πn

In particular, for

f∞ (x) := x−

= (−x) ∨ 0

98
it follows that
Z Z x − n
f∞ dPSn∗ = f∞ √ πn (dx)
R n
|
(
{z }
= 0 if x>n
n−x
= √
n
if x6n

n
−n
X nk n−k
=e · √
k! n
k=0 | {z }
=f∞ ( k−n
√ )
n

n
e−n X nk (n − k)
 
= √ · n+
n k!
k=1
n
e−n X nk+1 nk
  
= √ · n+ −
n k! (k − 1)!
|k=1 {z }
n+1 1
= n n! − n0! (telescoping sum)
1
e−n · nn+ 2
= .
n!
Moreover,
Z Z 0 0
1 − x2
2
1 x2 1
f∞ dN (0, 1) = √ (−x)·e dx = √ ·e− 2 =√ .
2π −∞ 2π −∞ 2π

Hence, Stirling’s formula (2.7) would follow, once we have shown that
Z Z
n→∞
f∞ dPSn∗ −−−→ f∞ dN (0, 1). (2.8)

Note that this is not implied by the weak convergence in the CLT since
f∞ is continuous but unbounded. Hence, we consider for given m ∈ N

fm := f∞ ∧ m ∈ Cb (R) .

The CLT now implies that


Z Z
n→∞
fm dPSn∗ −−−→ fm dN (0, 1).

99
Define gm := f∞ − fm (≥ 0). (2.8) then follows from a "3ε-argument",
once we have shown that
Z
1
(0 6) gm dPSn∗ 6 ∀m,
m
Z
1
(0 6) gm dN (0, 1) 6 ∀m.
m

The first inequality follows from


Z Z

gm dPSn∗ = |x| − m dPSn∗
]−∞,−m[
Z
6 |x| dPSn∗
]−∞,−m]
|x|
>1
Z
m 1
6 x2 dPSn∗
m ]−∞,−m]
1
6 · var(Sn∗ ) ,
m | {z }
=1

the second inequality can be shown similarly.

100
3 Conditional probabilities

1 Elementary definitions
Let (Ω, A, P) be a probability space.

Definition 1.1. Let B ∈ A with P(B) > 0. Then

P(A ∩ B)
P(A | B) := , A ∈ A,
P(B)

is called the conditional probability of A given B. In the case P(B) = 0 we


simply define P(A | B) := 0. For P(B) > 0 the probability measure

PB := P( · | B)

on (Ω, A) is is called the conditional distribution given B.

Remark 1.2. (i) P(A) is called the a priori probability of A.


P(A | B) is called the a posteriori probability of A, given the information
that B occurred.

(ii) In the case of Laplace experiments

|A ∩ B|
P(A | B) = = proportion of elements in A that are in B.
|B|

(iii) If A and B are disjoint (hence A ∩ B = ∅), then P(A | B) = 0.

(iv) If A and B are independent, then

P(A) · P(B)
P(A | B) = = P(A).
P(B)

101
Example 1.3. (i) Suppose that a family has two children. Consider the fol-
lowing two events: B := "at least one boy" and A := "two boys". Then
P(A | B) = 13 , because

Ω = (boy, boy), (girl, boy), (boy, girl), (girl, girl) ,
P = uniform distribution,

and thus
|A ∩ B| 1
P(A | B) = = .
|B| 3

(ii) Let X1 , X2 be independent r.v. with Poisson distribution with parameters


λ1 , λ2 . Then
(
0 if k > n
P(X1 = k | X1 + X2 = n) =
? if 0 6 k 6 n.

According to Example 4.7 X1 +X2 has Poisson distribution with parameter


λ := λ1 + λ2 . Consequently,

P(X1 = k, X2 = n − k)
P(X1 = k | X1 + X2 = n) =
P (X1 + X2 = n)
k
λn−k
e−λ1 λk!1 · e−λ2 (n−k)!
2    k  n−k
n λ1 λ2
= n = · ,
e−λ λn! k λ λ

i.e., P(X1 ∈ · | X1 +X2 = n) is the binomial distribution with parameters


n and p = λ1λ+λ
1
2
.

(iii) Consider n independent 0-1-experiments X1 , . . . , Xn with success proba-


bility p ∈ (0, 1). Let

Sn := X1 + . . . + Xn

and

Xi : Ω := {0, 1}n → {0, 1},


(x1 , . . . , xn ) 7→ xi .

102
For given (x1 , . . . , xn ) ∈ {0, 1}n and fixed k ∈ {0, . . . , n}

P(X1 = x1 , . . . , Xn = xn | Sn = k)
if i xi 6= k
( P
0
= pk (1−p)n−k n −1
otherwise

=
(nk)pk (1−p)n−k k

Thus the conditional distribution ν := P (X1 , ..., Xn ) ∈ · | Sn = k is



the uniform distribution on
n n
X o
Ωk := (x1 , . . . , xn ) xi = k ,
i=1

n
because |Ωk | = and for all (x1 , ...xn ) ∈ Ωk we have ν({(x1 , ...xn )}) =

k
n −1
.

k

Proposition 1.4. (Formula for total probability) Let B1 , . . . , Bn be dis-


Sn
joint, Bi ∈ A ∀ 1 ≤ i ≤ n. Then for all A ⊂ i=1 Bi , A ∈ A:
n
X
P(A) = P(A | Bi ) · P(Bi ).
i=1

Proof. Clearly, A = ∪ i6n (A ∩ Bi ). Consequently,


n n n
X X P(A ∩ Bi ) X
P(A) = P(A ∩ Bi ) = P(Bi ) = P(A | Bi )P(Bi ) .
P(Bi )
i=1 i:P(Bi )>0 i=1

Example 1.5. (Simpson’s paradox)


Consider applications of male (M ) and female (F ) students at a university
in the United States
Applications accepted
M 2084 1036 PM (A) ≈ 0.49
F 1067 349 PF (A) ≈ 0.33

Is this an example for discrimination of female students? A closer look to the


biggest four faculties B1 , . . . , B4 :

103
male female
Appl. acc. PM (A | Bi ) Appl. acc. PF (A | Bi )
B1 826 551 0.67 108 89 0.82
B2 560 353 0.63 25 17 0.68
B3 325 110 0.34 593 219 0.37
B4 373 22 0.06 341 24 0.07
2084 1036 1067 349
It follows that for all four faculties the probability of being accepted was higher
for female students than it was for male students:

PM (A | Bi ) < PF (A | Bi ).

Nevertheless, the preference turns into its opposite if looking at the total
probability of admission:
4
X
PF (A) = PF (A | Bi ) · PF (Bi )
i=1
4
X
< PM (A | B) · PM (Bi ) = PM (A).
i=1

For an explanation consider the distributions of applications:


|B1 ∩ M | 826 4
PM (B1 ) = = ≈ ,
|M | 2084 10

|B1 ∩ F | 108 1
PF (B1 ) = = ≈ ,
|F | 1067 10
etc. and observe that male students mainly applied at faculties with a high
probability of admission, whereas female students mainly applied at faculties
with a low probability of admission.

Proposition 1.6 (Bayes’ theorem). Let B1 , . . . , Bn ∈ A be disjoint with


P(Bi ) > 0 for i = 1, . . . , n. Let A ∈ A, A ⊂ ni=1 Bi with P(A) > 0.
S
Then:
P(A | Bi ) · P(Bi )
P(Bi | A) = n .
X
P(A | Bj ) · P(Bj )
j=1

104
Proof.
P(A ∩ Bi ) 1.4 P(A | Bi ) · P(Bi )
P(Bi | A) = = n .
P(A) X
P(A | Bj ) · P(Bj )
j=1

Example 1.7 (A posteriori probabilities in medical tests). Suppose that one out
of 145 persons of the same age have the disease D, i.e. the a priori probability
of having D is P(D) = 145
1
.
Suppose now that a medical test for D is given which detects D in 96 % of
all cases, i.e.
P(positive | D) = 0.96 .
However, the test also is positive in 6% of the cases, where the person does not
have D, i.e.
P(positive | Dc ) = 0.06 .
Suppose now that the test is positive. What is the a posteriori probability of
actually having D?
So we are interested in the conditional probability P(D | positive):
1.6 P(positive | D) · P(D)
P(D | positive) =
P(positive | D) · P(D) + P(positive | Dc ) · P(Dc )
1
0.96 · 145 1 1
= 1 144
= 6
= .
0.96 · 145 + 0.06 · 145 1+ 96 · 144 10
Note: in only one out of ten cases, a person with a positive result actually has
D.
Another conditional probability of interest in this context is the probability of
not having D, once the test is negative, i.e., P(Dc | negative):
P(negative | Dc ) · P(Dc )
P(Dc | negative) =
P(negative | D) · P(D) + P(negative | Dc ) · P(Dc )
144
0.94 · 145 94 · 144
= 1 144
= ≈ 0.9997.
0.04 · 145 + 0.94 · 145
4 + 94 · 144
Note: The two conditional probabilities interchange, if the a priori probability
of not having D is low (e.g. 145
1
). If the risc of having D is high and one wants
to test whether or not having D, the a posteriori probability of not having D,
given that the test was negative, is only 0.1.

105
Example 1.8 (computing total probabilities with conditional probabilities). Let
S be a finite set, Ω := S n+1 , n ∈ N, and P be a probability measure on Ω.
Let Xi : Ω → S, i = 0, . . . , n, be the canonical projections Xi (ω) := xi for
ω = (x0 , . . . , xn ).
If we interpret 0, 1, . . . , n as time points, then (Xi )06i6n may be seen as a
stochastic process and X0 (ω), . . . , Xn (ω) is said to be a sample path (or a
trajectory ) of the process.
For all ω ∈ Ω we either have P ({ω}) = 0 or

P({ω}) = P(X0 = x0 , . . . , Xn = xn )
= P(X0 = x0 , . . . , Xn−1 = xn−1 )
· P(Xn = xn | X0 = x0 , . . . , Xn−1 = xn−1 )
..
.
= P(X0 = x0 )
· P(X1 = x1 | X0 = x0 )
· P(X2 = x2 | X0 = x0 , X1 = x1 )
···
· P(Xn = xn | X0 = x0 , . . . , Xn−1 = xn−1 ).

Note: P({ω}) 6= 0 implies P(X0 = x0 , . . . , Xk = xk ) 6= 0 for all k ∈


{0, . . . , n}.
Conclusion: A probability measure P on Ω is uniquely determined by the
following:

Initial distribution: µ := P ◦ X0−1

Transition probabilities: the conditional distributions

P(Xk = xk | X0 = x0 , . . . , Xk−1 = xk−1 )

for any k ∈ {1, . . . , n} and (x0 , . . . , xk ) ∈ S (k+1) .

Existence of P for given initial distribution and given transition probabilities is


shown in Section 3.3.

106
Example 1.9. A stochastic process is called a Markov chain, if

P(Xk = xk | X0 = x0 , . . . , Xk−1 = xk−1 ) = P(Xk = xk | Xk−1 = xk−1 ),


i.e., if the transition probabilities for Xk only depend on Xk−1 .
If we interpret by Xk−1 the “present”, by Xk the “future” and by “X0 , . . . , Xk−2 ”
the past, then we can state the Markov property as: given the “present”, the
“future” of the Markov chain is independent of the “past”.

2 Transition probabilities and Fubini’s


theorem
Let (S1 , S1 ) and (S2 , S2 ) be measurable spaces.
Definition 2.1. A mapping
K : S1 × S2 → [0, 1]
(x1 , A2 ) 7→ K(x1 , A2 )

is said to be a transition probability (from (S1 , S1 ) to (S2 , S2 )), if


(i) ∀x1 ∈ S1 : K(x1 , · ) is a probability measure on (S2 , S2 ).

(ii) ∀A2 ∈ S2 : K( · , A2 ) is S1 -measurable.


Example 2.2. (i) For given probability measure µ on (S2 , S2 ) define

K(x1 , · ) := µ ∀ x1 ∈ S1 no coupling!

(ii) Let T : S1 → S2 be a S1 /S2 -measurable mapping, and

K(x1 , · ) := δT (x1 ) ∀x1 ∈ S1 .

(iii) Stochastic matrices Let S1 , S2 be countable and Si = P(Si ), i = 1, 2.


In this case, any transition probability from (S1 , S1 ) to (S2 , S2 ) is given by

K(x1 , x2 ) := K x1 , {x2 } , x1 ∈ S1 , x2 ∈ S2 ,

where K : S1 × S2 → [0, 1] is a mapping, such that for all x1 ∈ S1


x2 ∈S2 K(x1 , x2 ) = 1. Consequently, K can be identified with a stochas-
P
tic matrix, or a transition matrix, i.e. a matrix with nonnegative entries
and row sums equal to one.

107
Example 2.3. (i) Transition probabilities of the random walk on Zd
S1 = S2 = S := Zd with S := P(Zd )

1 X
K(x, · ) := δy , x ∈ Zd ,
2d
y∈N (x)

with

N (x) := y ∈ Zd kx − yk = 1


denotes the set of nearest neighbours of x.

(ii) Ehrenfest model Consider a box containing N balls. The box is divided
into two parts (“left” and “right“). A ball is selected randomly and put into
the other half.
“microscopic level” the state space is S := {0, 1}N with
x = (x1 , . . . , xN ) ∈ S defined by
(
1 if the ith ball is contained in the “left” half
xi :=
0 if the ith ball is contained in the “right” half

the transition probability for x = (x1 , . . . , xN ) ∈ S is given by

N
1 X
K(x, · ) := δ(x1 ,...,xi−1 ,1−xi ,xi+1 ,...,xN ) .
N
i=1

“macroscopic level” the state space is S := {0, . . . , N }, where j ∈ S


denotes the number of balls contained in the left half. The transition
probabilities are given by

N −j j
K(j, · ) := · δj+1 + · δj−1 .
N N

(iii) Transition probabilities of the Ornstein-Uhlenbeck process S =


S1 = S2 = R, K(x, · ) := N (αx, σ 2 ) with α ∈ R, σ 2 > 0.

We now turn to Fubini’s theorem. To this end, let µ1 be a probability measure


on (S1 , S1 ) and K( · , · ) be a transition probability from (S1 , S1 ) to (S2 , S2 ).

108
Our aim is to construct a probability measure P (:= µ1 ⊗ K) on the product
space (Ω, A), where

Ω := S1 × S2
!
A := S1 ⊗ S2 := σ(X1 , X2 ) = σ {A1 × A2 | A1 ∈ S1 , A2 ∈ S2 } ,


and

Xi : Ω = S1 × S2 → Si , i = 1, 2,
(x1 , x2 ) 7→ xi ,
satisfying
Z
P(A1 × A2 ) = K(x1 , A2 ) µ1 (dx1 )
A1

for all A1 ∈ S1 and A2 ∈ S2 .


Proposition 2.4 (Fubini). Let µ1 be a probability measure on (S1 , S1 ), K a
transition probability from (S1 , S1 ) to (S2 , S2 ), and

Ω := S1 × S2 , (3.1)
A := σ {A1 × A2 | Ai ∈ Si } =: S1 ⊗ S2 . (3.2)


Then there exists a probability measure P (=: µ1 ⊗ K) on (Ω, A), such that
for all A-measurable functions f > 0
Z Z Z 
f dP = f (x1 , x2 ) K(x1 , dx2 ) µ1 (dx1 ), (3.3)

in particular, for all A ∈ A


Z
P(A) = K(x1 , Ax1 ) µ1 (dx1 ) . (3.4)

Here

Ax1 = {x2 ∈ S2 | (x1 , x2 ) ∈ A}

is called the section of A by x1 . Furthermore, for A1 ∈ S1 , A2 ∈ S2 :


Z
P(A1 × A2 ) = K(x1 , A2 ) µ1 (dx1 ) (3.5)
A1

and P is uniquely determined by (3.5).

109
S2

(1)
A x1
(1) (2)
x2 A x2 Ax2
(2)
A x1

(3)
A x1
A

x1 S1

Note
(1) (2) (3) (1) (2)
Ax1 = Ax1 ∪ Ax1 ∪ Ax1 and Ax2 = Ax2 ∪ Ax2 .

Proof. Uniqueness: Clearly, the collection of cylindrical sets A1 × A2 with Ai ∈


Si is stable under intersections and generates A, so that the uniqueness now
follows from Proposition 1.11.5.
Existence: For given x1 ∈ S1 let

ϕx1 (x2 ) := (x1 , x2 ) .

ϕx1 : S2 → Ω is measurable, because for A1 ∈ S1 , A2 ∈ S2


(
∅ if x1 ∈
/ A1
ϕ−1
x1 (A1 × A2 ) =
A2 if x1 ∈ A1 .

It follows that for any f : Ω → R A-measurable and any x1 ∈ S1 , the mapping

fx1 := f ◦ ϕx1 : S2 → R , x2 7→ f (x1 , x2 )

110
is S2 /B(R)-measurable.
Suppose now that f > 0 is A-measurable. Then
Z  Z 
x1 7→ f (x1 , x2 ) K(x1 , dx2 ) = fx1 (x2 ) K(x1 , dx2 ) (3.6)

is well-defined.
We will show in the following that this function is S1 -measurable. We will
prove the assertion for f = 1A , A ∈ A first. For general f the measurability
then follows by measure-theoretic induction.
Note that for f = 1A we have that
Z
1 (x , x2 ) K(x1 , dx2 ) = K(x1 , Ax1 ).
1
|A {z }
=1Ax1 (x2 )

Hence, in the following we consider

D := A ∈ A x1 7→ K(x1 , Ax1 ) is S1 -measurable .




D is a Dynkin system (!) and contains all cylindrical sets A = A1 × A2 with


Ai ∈ Si , because

K x1 , (A1 × A2 )x1 = 1A1 (x1 ) · K(x1 , A2 ) .

Since measurable cylindrical sets are stable under intersections, we conclude


that D = A.
It follows that for all nonnegative A-measurable functions f : Ω → R, the
integral
Z Z 
f (x1 , x2 ) K(x1 , dx2 ) µ(dx1 )

is well-defined.
For all A ∈ A we can now define
Z Z  Z
P(A) := 1 (x , x2 ) K(x1 , dx2 )
1 µ(dx1 ) = K(x1 , Ax1 ) µ(dx1 ).
|A {z }
=1Ax1 (x2 )

P is a probability measure on (Ω, A), because


Z Z
P(Ω) = K(x1 , S2 ) µ(dx1 ) = 1 µ(dx1 ) = 1.

111
For the proof of the σ-additivity, let A1 , A2 , . . . be pairwise disjoint subsets in
A. It follows that for all x1 ∈ S1 the subsets (A1 )x1 , (A2 )x1 , . . . are pairwise
disjoint too, hence
 
[  Z  [  
P An = K x1 , An µ(dx1 )
x1
n∈N n∈N

Z X

= K x1 , (An )x1 µ(dx1 )
n=1
∞ Z
X ∞
X

= K x1 , (An )x1 µ(dx1 ) = P(An ).
n=1 n=1

In the second equality we used that K(x1 , ·) is a probability measure for all x1
and in the third equality we used monotone integration.
Finally, (3.3) follows from measure-theoretic induction.

2.1 Examples and Applications


Remark 2.5. The classical Fubini theorem is a particular case of Proposi-
tion 2.4: K(x1 , · ) = µ2 . In this case, the measure µ1 ⊗ K, constructed in
Fubini’s theorem, is called the product measure of µ1 and µ2 and is denoted by
µ1 ⊗ µ2 . Moreover, in this case
Z Z Z 
f dP = f (x1 , x2 ) µ2 (dx2 ) µ1 (dx1 ).

Remark 2.6 (Marginal distributions). Let Xi : Ω → Si , i = 1, 2, be the


natural projections Xi (x1 , x2 ) := xi . The distributions of Xi under the

measure µ1 ⊗ K are called the marginal distributions and they are given by
(P ◦ X1−1 )(A1 ) = P(X1 ∈ A1 ) = P(A1 × S2 )
Z
= K(x1 , S2 ) µ1 (dx1 ) = µ1 (A1 )
A1 | {z }
=1

and
(P ◦ X2−1 )(A2 ) = P(X2 ∈ A2 ) = P(S1 × A2 )
Z
= K(x1 , A2 ) µ1 (dx1 ) =: (µ1 K)(A2 ).

112
So, the marginal distributions are

P ◦ X1−1 = µ1 P ◦ X2−1 = µ1 K .

Definition 2.7. Let S1 = S2 = S and S1 = S2 = S. A probability measure µ


on (S, S) is said to be an equilibrium distribution for K (or invariant distribution
under K) if µ = µK.

Example 2.8. (i) Ehrenfest model (macroscopic) Let S = {0, 1, . . . , N }


and
y N −y
K(y, · ) = · δy−1 + · δy+1 .
N N
In this case, the binomial distribution µ with parameter N, 21 is an equili-
birum distribution, because for any x ∈ S
X
(µK)({x}) = µ({y}) · K(y, x)
y∈S

x+1 N − (x − 1)
= µ({x + 1}) · + µ({x − 1}) ·
N N
   
N x+1 N N − (x − 1)
= 2−N · + 2−N ·
x+1 N x−1 N
     
N −1 N −1 N
= 2−N + = 2−N ·
x x−1 x
= µ({x}).

(ii) Ornstein-Uhlenbeck process Let S = R and K(x, · ) = N (αx, σ 2 )


with |α| < 1. Then

σ2
 
µ = N 0,
1 − α2

is an equilibrium distribution. (Exercise.)

We now turn to the converse problem: Given a probability measure P on the


product space (Ω, A). Can we “disintegrate” P, i.e., can we find a probability
measure µ1 on (S1 , S1 ) and a transition probability from S1 to S2 such that

P = µ1 ⊗ K ?

113
Answer: In most cases - yes, e.g. if S1 and S2 are Polish spaces (i.e., a
topological space having a countable basis, whose topology is induced by some
complete metric), using conditional expectations (see below).

Example 2.9. In the particular case, when S1 is countable (and S1 = P(S1 )),
we can disintegrate P explicitly as follows: Necessarily, µ1 has to be the distri-
bution of the projection X1 onto the first coordinate. To define the kernel K,
let ν be any probability measure on (S2 , S2 ) and define

P(X2 ∈ A2 | X1 = x1 ) if µ ({x }) > 0
| 1 {z 1 }


K(x1 , A2 ) := =P(X1 =x1 )

if µ1 ({x1 }) = 0.

ν(A )
2

Then
X
P(A1 × A2 ) = P(X1 ∈ A1 , X2 ∈ A2 ) = P(X1 = x1 , X2 ∈ A2 )
x1 ∈A1
X
= P(X1 = x1 ) · P(X2 ∈ A2 | X1 = x1 )
x1 ∈A1 ,
µ1 ({x1 })>0
X Z
= µ1 ({x1 }) · K(x1 , A2 ) = K(x1 , A2 ) µ1 (dx1 )
x1 ∈A1 A1

= (µ1 ⊗ K)(A1 × A2 ),

hence P = µ1 ⊗ K.

3 The canonical model for the evolution of a


stochastic system in discrete time
Consider the following situation: suppose we are given

• measurable spaces (Si , Si ), i = 0, 1, 2, . . . and we define

S n := S0 × S1 × · · · × Sn
Sn := S0 ⊗ S1 ⊗ · · · ⊗ Sn = σ {A0 × · · · × An | Ai ∈ Si } .


114
• – an initial distribution µ0 on (S0 , S0 )
– transition probabilities

Kn (x0 , . . . , xn−1 ), dxn

from (S n−1 , Sn−1 ) to (Sn , Sn ), n = 1, 2, . . ..

Using Fubini’s theorem, we can then define probability measures Pn on S n ,


n = 0, 1, 2, . . . as follows:

P0 := µ0 on S0 ,

Pn := P n−1 ⊗ Kn on S n = S n−1 × Sn

Note that Fubini’s theorem (see Proposition 2.4) implies that for any Sn -
measurable function f : S n → R+ :
Z
f dPn
Z Z
n−1
 
= P d(x0 , . . . , xn−1 ) Kn (x0 , . . . , xn−1 ), dxn f (x0 , . . . , xn−1 , xn )

= ···
Z Z Z

= µ0 (dx0 ) K1 (x0 , dx1 ) · · · Kn (x0 , . . . , xn−1 ), dxn f (x0 , . . . , xn ).

3.1 The canonical model


Let Ω := S0 × S1 × . . . be the set of all paths (or trajectories) ω = (x0 , x1 , . . . )
with xi ∈ Si , and

Xn (ω) := xn (projection onto nth -coordinate),


An := σ(X0 , . . . , Xn ) ⊂A ,



[ 
A := σ(X0 , X1 , . . . ) = σ An .
n=1

115
Our main goal in this section is to construct a probability measure P on (Ω, A)
satisfying
Z Z
f (X0 , . . . , Xn ) dP = f dPn ∀ n = 0, 1, 2, . . . ,

i.e. the “finite dimensional distributions” P ◦ (X0 , . . . , Xn )−1 of P, i.e., the joint
distributions of X0 , . . . , Xn under P, are given by Pn for any n ≥ 0.

Proposition 3.1 (Ionescu-Tulcea). There exists a unique probability measure P


on (Ω, A) such that for all n > 0 und all Sn -measurable functions f : S n → R+ :
Z Z Z
f dP(X0 ,...,Xn ) = f (X0 , . . . , Xn ) dP = f dPn . (3.7)
Sn Ω Sn

In other words: there exists a unique P on (Ω, A) such that Pn = P ◦


(X0 , . . . , Xn )−1 .

Proof. Uniqueness: Obvious, because the collection of finite cylindrical subsets


n
n\ o
E := {Xi ∈ Ai } n > 0, Ai ∈ Si
i=0

is closed under intersections and generates A.


Existence: Let A ∈ An , hence

A = (X0 , . . . , Xn )−1 (An ) for some An ∈ Sn

In particular 1A = 1An (X0 , . . . , Xn ). Therefore, in order to have (3.7) we must


define

P(A) := Pn (An ) . (3.8)

We have to check that P is well-defined. Indeed, since A ∈ An ⊂ An+1 , we can


also write A = (X0 , . . . , Xn+1 )−1 (An+1 ) where An+1 = An × Sn+1 ∈ An+1 .
But

Pn+1 (An+1 ) = Pn+1 (An × Sn+1 )


Z
Kn+1 (x0 , . . . , xn ), Sn+1 dPn = Pn (An ) .

=
An | {z }
=1

116
S∞
It follows that P is well-defined by (3.8) on B = n=0 An . B is an algebra (i.e.,
a collection of subsets of Ω containing Ω, that is closed under complements and
finite (!) unions), and P is finitely additive on B, since P is (σ-) additive on
An for every n. To extend P to a σ-additive probability measure on A = σ(B)
with the help of Caratheodory’s extension theorem, it suffices now to show that
P is ∅-continuous, i.e., the following condition is satisfied:
n→∞
Bn ∈ B , Bn & ∅ ⇒ P(Bn ) −−−→ 0 .

(For Caratheodory’s extension theorem see text books on measure theory, or


Theorem 1.41 in Klenke.)
W.l.o.g. B0 = Ω and Bn ∈ An (if Bn ∈ Am \ An with m > n just repeat
Bn−1 m-times!). Then

Bn = An × Sn+1 × Sn+2 × . . . for some An ∈ Sn

with
An+1 ⊂ An × Sn+1
and we have to show that
n→∞
P(Bn ) = Pn (An ) −−−→ 0


(i.e., inf n Pn (An ) = 0).


In order to do so, we will assume

inf Pn (An ) > 0 .


n∈N

and show that this implies



\
Bn 6= ∅ .
n=0

Note that
Z
n n
P (A ) = µ0 (dx0 ) f0,n (x0 )

with
Z Z

f0,n (x0 ) := K1 (x0 , dx1 ) · · · Kn (x0 , . . . , xn−1 ), dxn 1An (x0 , . . . , xn ).

117
It is easy to see that the sequence (f0,n )n∈N is decreasing, because
Z

Kn+1 (x0 , . . . , xn ), dxn+1 1An+1 (x0 , . . . , xn+1 )
Z

6 Kn+1 (x0 , . . . , xn ), dxn+1 1An ×Sn+1 (x0 , . . . , xn+1 )

= 1An (x0 , . . . , xn ) ,

hence
Z Z

f0,n+1 (x0 ) = K1 (x0 , dx1 ) · · · Kn+1 (x0 , . . . , xn ), dxn+1 1An+1 (x0 , . . . , xn+1 )
Z Z

≤ K1 (x0 , dx1 ) · · · Kn (x0 , . . . , xn−1 ), dxn 1An (x0 , . . . , xn )

= f0,n (x0 ) .

In particular, using Lebesgue’s theorem


Z Z
inf f0,n dµ0 = inf f0,n dµ0 = inf Pn (An ) > 0 .
n∈N n∈N n∈N

Therefore we can find some x̄0 ∈ S0 with

inf f0,n (x̄0 ) > 0 .


n∈N

On the other hand we can write


Z
f0,n (x̄0 ) = K1 (x̄0 , dx1 ) f1,n (x1 )

with
Z

f1,n (x1 ) := K2 (x̄0 , x1 ), dx2
Z

··· Kn (x̄0 , x1 , . . . , xn−1 ), dxn 1An (x̄0 , x1 , . . . , xn ) .

Using the same argument as above (now with µ1 = K1 (x̄0 , ·)) we can find some
x̄1 ∈ S1 with

inf f1,n (x̄1 ) > 0 .


n∈N

118
Iterating this procedure, we find for any i = 0, 1, . . . some x̄i ∈ Si such that
for all m > 1
Z

inf Km (x̄0 , . . . , x̄m−1 ), dxm
n∈N
Z

··· Kn (x̄0 , . . . , x̄m−1 , xm , . . . , xn−1 ), dxn

1An (x̄0 , . . . , x̄m−1 , xm , . . . , xn )


> 0.

In particular, if m = n
Z

0< Km (x̄0 , . . . , x̄m−1 ), dxm 1Am (x̄0 , . . . , x̄m−1 , xm )

6 1Am−1 (x̄0 , . . . , x̄m−1 ) ,

so that (x̄0 , . . . , x̄m−1 ) ∈ Am−1 ∀m ≥ 1. Thus

ω̄ := (x̄0 , x̄1 , . . . ) ∈ Bm−1 = Am−1 × Sm × Sm+1 × · · · ∀m ≥ 1,

i.e.

\
ω̄ ∈ Bm .
m=0

Hence the assertion is proven.

Definition 3.2. Suppose that (Si , Si ) = (S, S) for all i = 0, 1, 2, . . .. Then


(Xn )n>0 on (Ω, A, P) (with P as in the previous proposition) is called a stochas-
tic process (in discrete time) withstate space (S, S), initial distribution µ0 and
transition probabilities Kn ( · , · ) n∈N .

3.2 Examples
1) Infinite product measures: Consider the situation of Definition 3.2 and
let

n ∈ N (independent of (x0 , . . . , xn−1 )!)



Kn (x0 , . . . , xn−1 ), · = µn ,

119
Then

O
µn := P
n=0

is called the product measure associated with µ0 , µ1 , . . . .


For all n > 0 and A0 , . . . , An ∈ S we have that

I.-T.
P(X0 ∈ A0 , . . . , Xn ∈ An ) = Pn (A0 × · · · × An )
Z Z Z
= µ0 (dx0 ) µ1 (dx1 ) · · · µn (dxn ) 1A0 ×···×An (x0 , . . . , xn )

= µ0 (A0 ) · µ1 (A1 ) · · · µn (An ).

In particular, PXn = µn for all n, and the canonical projections X0 , X1 , . . . are


independent. We thus have the following:

Proposition 3.3. Let (µn )n≥0 be a sequence of probability measures on a


measurable space (S, S). Then there exists a probability space (Ω, A, P) and a
sequence (Xn )n≥0 of independent r.v. with PXn = µn for all n ≥ 0.

We have thus proven in particular the existence of a probability space mod-


elling infinitely many independent 0 − 1-experiments!

2) Markov chains:

time-inhomogeneous,

Kn (x0 , . . . , xn−1 ), · = K̃n (xn−1 , · )

time-homogeneous, if K̃n = K for all n.


For given initial distribution µ and transition probabilities (K̃n )n≥1 (resp.
K) there exists a unique probability measure P on (Ω, A), which is said to
be the canonical model for the time evolution of a time-inhomogeneous (resp.
time-homogeneous) Markov chain.

Example 3.4. Let S = R, β > 0, x0 ∈ R \ {0}, µ0 = δx0 and K(x, · ) =


N (0, βx2 ) for x 6= 0, K(0, · ) = δ0 .
For which β does the sequence (Xn ) converge and what is its limit?

120
For n > 1
Z
I.-T.
E[Xn2 ] = x2n Pn d(x0 , . . . , xn )


Z Z 
= x2n K(xn−1 , dxn ) Pn−1 (d(x0 , . . . , xn−1 ))
| {z }
=βx2n−1 ,
K(xn−1 ,dxn )=N (0,βx2n−1 )
2
= β · E[Xn−1 ] = · · · = β n x20 .

If β < 1 it follows that



hX ∞
i X ∞
X
2 2
E Xn = E[Xn ] = β n x20 < ∞,
n=1 n=1 n=1


hence Xn2 < ∞ P-a.s., and therefore
P
n=1

lim Xn = 0 P-a.s.
n→∞

A similar calculation as above for the first absolute moment yields


r   n2
  2   2 
E |Xn | = · · · = · β · E |Xn−1 | = · · · = ·β · E |X0 | ,
π π | {z }
=|x0 |

because Z r
2
|Xn | K(xn−1 , dxn ) = · β |xn−1 | .
π
Consequently,
∞ ∞   n2
2
hX i X
E |Xn | = ·β · |x0 | ,
π
n=1 n=1

so that also for β < π2 :

lim Xn = 0 P-a.s.
n→∞

121
In fact, if we define
 Z ∞ 
4 − x2
2
β0 := exp − √ log x · e dx
2π 0

= 2eC ≈ 3.56,

where
n
1
X 
C := lim − log n ≈ 0.577
n→∞ k
k=1

denotes the Euler-Mascheroni constant, it follows that


n→∞
∀ β < β0 : Xn −−−→ 0 P-a.s. with exponential rate
n→∞
∀ β > β0 : |Xn | −−−→ ∞ P-a.s. with exponential rate.

Proof. It is easy to see that for all n: Xn 6= 0 P-a.s. For n ∈ N we can then
define
(
Xn
Xn−1 on {Xn−1 6= 0}
Yn :=
0 on {Xn−1 = 0}.

Then Y1 , Y2 , . . . are independent r.v. with (identical) distribution N (0, β), be-
cause for all measurable functions f : Rn → R+
Z
f (Y1 , . . . , Yn ) dP

Z     n2   12
I.-T. x1 xn 1 1
= f ,..., · ·
x0 xn−1 2πβ x0 · · · x2n−1
2

x2 x2n
 
· exp − 1 2 − · · · − dx1 . . . dxn
2βx0 2βx2n−1
 n2
y 2 + · · · + yn2
Z   
1
= f (y1 , . . . , yn ) · · exp − 1 dy1 . . . dyn .
2πβ 2β
Note that

|Xn | = |x0 | · |Y1 | · · · |Yn |

122
and thus
n
1 1 1X
· log|Xn | = · log|x0 | + log|Yi | .
n n n
i=1
Note that (log|Yi |)i∈N are independent and identically distributed with
Z ∞
1 x2
log x · e− 2β dx .
 
E log|Yi | = 2 · √
2πβ 0
Kolmogorov’s law of large numbers now implies that
Z ∞
1 2 x2
lim · log|Xn | = √ log x · e− 2β dx P-a.s.
n→∞ n 2πβ 0
Consequently,
Z
n→∞
|Xn | −−−→ 0 with exponential rate, if · · · < 0,
Z
n→∞
|Xn | −−−→ ∞ with exponential rate, if · · · > 0.

Note that
Z ∞ y= √x Z ∞
2 2
− x2β β 2 p y2
√ log x · e dx = √ log( βy) · e− 2 dy
2πβ 0 2π 0
Z ∞
1 2 y2
= · log β + √ log y · e− 2 dy
2 2π 0

<0 ⇔ β < β0 .
It remains to check that
Z ∞
4 x2
−√ log x · e− 2 dx = log 2 + C
2π 0
where C is the Euler-Mascheroni constant (Exercise!)
Example 3.5. Consider independent 0-1-experiments with success probability
p ∈ [0, 1] but suppose that p ist unknown. In the canonical model:
Si := {0, 1}, i ∈ N; Ω := {0, 1}N ,
Xi : Ω → {0, 1}, i ∈ N, projections,

O
µi := pδ1 + (1 − p)δ0 , i ∈ N; Pp := µi
i=1

123
An and A are defined as above.
Since p is unknown, we choose an a priori distribution µ on [0, 1], B([0, 1])

(as a distribution for the unknown parameter p).
Claim: K(p, · ) := Pp ( · ) is a transition probability from [0, 1], B([0, 1]) to

(Ω, A).

Proof. We only need to show that for given A ∈ A the mapping p 7→ Pp (A) is
measurable on [0, 1]. To this end define

D := A ∈ A p 7→ Pp (A) is B([0, 1])-measurable




Then D is a Dynkin system and contains all finite cylindrical sets

{X1 = x1 , . . . , Xn = xn }, n ∈ N, x1 , . . . , xn ∈ {0, 1} ,

because
Pn Pn
xi
Pp (X1 = x1 , . . . , Xn = xn ) = p i=1 (1 − p)n− i=1 xi

is measurable (even continuous) in p!


The claim now follows from the fact, that the finite cylindrical sets are closed
under intersections and generate A.

Let P̄ := µ ⊗ K on Ω̄ := [0, 1] × Ω with B([0, 1]) ⊗ A. Using Remark 2.6 it


follows that P̄ has marginal distributions µ and
Z
P( · ) := Pp ( · ) µ(dp) (3.9)

on (Ω, A). The integral can be seen as mixture of Pp according to the a priori
distribution µ.
Note: The Xi are no longer independent under P!
We now calculate the initial distribution PX1 and the transition probabilities in
the particular case where µ is the Lebesgue measure (i.e., the uniform distribu-
tion on the unknown parameter p):
Z
P ◦ X1−1

= pδ1 + (1 − p)δ0 ( · ) µ(dp)
Z Z
1 1
= p µ(dp) · δ1 + (1 − p) µ(dp) · δ0 = · δ1 + · δ0 .
2 2

124
Pn
For given n ∈ N and x1 , . . . , xn ∈ {0, 1} with k := i=1 xi it follows that

P(Xn+1 = 1 | X1 = x1 , . . . , Xn = xn )
P(Xn+1 = 1, Xn = xn , . . . , X1 = x1 )
=
P(Xn = xn , . . . , X1 = x1 )
Z
pk+1 (1 − p)n−k µ(dp)
(3.9)
= Z
pk (1 − p)n−k µ(dp)

Γ(k + 2)Γ(n − k + 1) Γ(n + 2) k+1


= =
Γ(n + 3) Γ(k + 1)Γ(n − k + 1) n+2
 
n 1 n k
= 1− · + · .
n+2 2 n+2 n
| {z }
convex combination

Proposition 3.6. Let P be a probability measure on (Ω, A) (“canonical model”,


see Section 3.1), and

µn := P ◦ Xn−1 , n ∈ N0 .

Then:

O n
O
Xn , n ∈ N, independent ⇔ P= µn n
(i.e. P = µk ∀n ≥ 0).
n=0 k=0

N∞
Proof. Let P̃ := n=0 µn . Then

P = P̃

if and only if for all n ∈ N0 and all A0 ∈ S0 , . . . , An ∈ Sn

P(X0 ∈ A0 , . . . , Xn ∈ An ) = P̃(X0 ∈ A0 , . . . , Xn ∈ An )
n
Y n
Y
= µi (Ai ) = P(Xi ∈ Ai ),
i=0 i=0

which is the case if and only if Xn , n ∈ N0 , are independent.

125
Definition 3.7. Let Si := S, i ∈ N0 , (Ω, A) be the canonical model and P be
a probability measure on (Ω, A). In particular, (Xn )n>0 is a stochastic process
in the sense of Definition 3.2. Let J ⊂ N0 , |J| < ∞. Then the distribution of
(Xj )j∈J under P

µJ := P ◦ ((Xi )i∈J )−1

is said to be the finite dimensional distribution (w.r.t. J) on (S J , SJ ).

Remark 3.8. P is uniquely determined by its finite-dimensional distributions


resp. by

µ{0,...,n} , n ∈ N.

4 Stationarity
Let (S, S) be a measurable space, Ω = S N0 and (Ω, A) be the associated
canonical model. Let P be a probability measure on (Ω, A).

Definition 4.1. The mapping T : Ω → Ω, defined by

ω = (x0 , x1 , . . . ) 7→ T ω := (x1 , x2 , . . . )

is called the shift-operator on Ω.

Remark 4.2. For all n ∈ N0 , A0 , . . . , An ∈ S

T −1 {X0 ∈ A0 , . . . , Xn ∈ An } = {X1 ∈ A0 , . . . , Xn+1 ∈ An }.




In particular: T is A/A-measurable

Definition 4.3. The measure P is said to be stationary (or shift-invariant) if

P ◦ T −1 = P .

Proposition 4.4. The measure P is stationary if and only if for all k, n ∈ N0 :

µ{0,...,n} = µ{k,...,k+n} .

126
Proof.

P ◦ T −1 = P
⇔ P ◦ T −k = P ∀k ∈ N0
3.8
⇔ (P ◦ T −k ) ◦ (X0 , . . . , Xn )−1 = P ◦ (X0 , . . . , Xn )−1 ∀ k, n ∈ N0
⇔ µ{k,...,n+k} = µ{0,...,n} .

Remark 4.5. (i) The last proposition implies in the particular case

O
P= µn with µn := P ◦ Xn−1
n=0

that

P stationary ⇔ µn = µ0 ∀ n ≥ 0.

(ii) If P = µn as in (i), hence X0 , X1 , X2 , . . . independent, Kolmogorov’s


N
zero-one law implies that P = 0 − 1 on the tail-field
\
A∗ := σ(Xn , Xn+1 , . . . )
n>0
N∞
Proposition 4.6. Let P = n=0 µn , µn := P ◦ Xn , n ∈ N0 . Then P is
−1

ergodic, i.e.
P = 0 − 1 on I := {A ∈ A | T −1 (A) = A} .
I is called the σ-algebra of shift-invariant sets.

Proof. Using part (ii) of the previous remark, it suffices to show that I ⊂ A∗ .
But

A ∈ I ⇒ A = T −n (A) ∈ σ(Xn , Xn+1 , . . . ) ∀ n ∈ N


⇒ A ∈ A∗ .

127

You might also like