CH 3
CH 3
Gerald Trutnau
Seoul National University
Fall Term 2024
Non-Corrected version
2
Contents
1 Basic Notions 5
1 Probability spaces . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Discrete models . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Transformations of probability spaces . . . . . . . . . . . . . . 18
4 Random variables . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Variance and Covariance . . . . . . . . . . . . . . . . . . . . . 30
7 The (strong and the weak) law of large numbers . . . . . . . . 33
8 Convergence and uniform integrability . . . . . . . . . . . . . . 39
9 Distribution of random variables . . . . . . . . . . . . . . . . . 47
10 Weak convergence of probability measures . . . . . . . . . . . 51
11 Dynkin-systems and Uniqueness of probability measures . . . . 57
2 Independence 63
1 Independent events . . . . . . . . . . . . . . . . . . . . . . . . 63
2 Independent random variables . . . . . . . . . . . . . . . . . . 69
3 Kolmogorov’s law of large numbers . . . . . . . . . . . . . . . 71
4 Joint distribution and convolution . . . . . . . . . . . . . . . . 79
5 Characteristic functions . . . . . . . . . . . . . . . . . . . . . 87
6 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . 89
3
Bibliography
3. Billingsley, P., Probability and Measure, third edition. Wiley, 1995. ISBN:
0-471-00710-2.
4
1 Basic Notions
1 Probability spaces
Probability theory is the mathematical theory of randomness. The basic notion is
that of a random experiment, which is an event whose outcome is not predictable
and can only be determined after performing it and then observing the outcome.
Probability theory tries to quantify the possible outcomes by attaching a
probability to every event. This is of importance for example for an insurance
company when asking the question what is a fair price of an insurance against
events like fire or death that are events that can happen but need not happen.
The set of all possible outcomes of a random experiment is denoted by Ω.
The set Ω may be finite, denumerable or even uncountable.
Example 1.1. Examples of random experiments and corresponding Ω:
(i) Coin tossing: The possible outcomes of tossing a coin are either “head”
or “tail”. Denoting one outcome by “0” and the other one by “1”, the set
of all possible outcomes is given by Ω = {0, 1}.
(ii) Tossing a coin n times: In this case any sequence of zeros and ones
(alias heads or tails) of length n are considered as one possible outcome;
hence
Ω = (x1 , x2 , . . . , xn ) xi ∈ {0, 1} =: {0, 1}n
5
(iv) A random number between 0 and 1: Ω = [0, 1].
ω(t)
t
Events:
Reasonable subsets A ⊂ Ω for which it makes sense to calculate the probabil-
ity are called events (a precise definition will be given in Definition 1.3 below).
If we consider an event A and observe ω ∈ A in a random experiment, we say
that A has occured.
Combination of events:
[ “at least one of the events Ai oc-
A1 ∪ A2 , Ai
cur”
i
\
A1 ∩ A2 , Ai “all of the events Ai occur”
i
6
\[
lim sup An := Am “infinitely many of the Am occur”
n→∞
n m>n
n n
X o
n
A = (x1 , . . . , xn ) ∈ {0, 1} xi = k .
i=1
(iv) A random number 0 and 1: “number ∈ [a, b]”, A = [a, b] ⊂ Ω = [0, 1].
A = ω ∈ C [0, 1]
max ω(t) > c .
06t61
c
0
ω(t)
7
Let Ω be countable. A probability distribution function p on Ω is a function
X
p : Ω → [0, 1] with p(ω) = 1 .
ω∈Ω
Given any subset A ⊂ Ω, its probability P(A) can then be defined by simply
adding up X
P(A) = p(ω) .
ω∈A
(i) Ω ∈ A,
(ii) A ∈ A implies Ac ∈ A,
(iii) Ai ∈ A, i ∈ N, implies Ai ∈ A.
S
i∈N
• A1 , . . . , An ∈ A implies
n
[ n
\
Ai ∈ A and Ai ∈ A
i=1 i=1
8
• Ai ∈ A, i ∈ N, implies
\[ [\
Am ∈ A and Am ∈ A.
n m>n n m>n
(iii) Let I be an index set (not necessarily countable) and for any i ∈ I, let
Ai be a σ-algebra. Then i∈I Ai := {A ⊂ Ω | A ∈ Ai for any i ∈ I} is
T
again a σ-algebra.
• P(Ω) = 1
•
∞
[ X
P Ai = P(Ai ) ( “σ-additivity”)
i∈N i=1
9
Example 1.7. (i) Coin tossing Let A := P(Ω) = ∅, {0}, {1}, {0, 1} .
Tossing a fair coin means “head” and “tail” have equal probability 12 , hence:
1
P({0}) := P({1}) := , P(∅) := 0, P({0, 1}) := 1.
2 | {z }
=Ω
:= 2−n .
P (x1 , x2 , . . . ) ∈ {0, 1}N x1 = x̄1 , . . . , xn = x̄n
| {z }
∈A0
(Proof: Later !)
10
R
β
α
0
ω(t)
t0 t
Remark 1.8. Let (Ω, A, P) be a probability space, and let A1 , . . . , An ∈ A be
pairwise disjoint. Then
n
[ X
P Ai = P(Ai ) (P is additive)
i6n i=1
11
Proposition 1.9. Let A be a σ-algebra, and P : A → R+ := [0, ∞) be a
mapping with P(Ω) = 1. Then the following are equivalent:
Proof.
∞
[ 1.9 n
[ (1.1) n
X ∞
X
P Ai = lim P Ai 6 lim P(Ai ) = P(Ai ).
n→∞ n→∞
i=1 i=1 i=1 i=1
Proof. Since
[ n→∞ \ [
Am & Am
m>n n∈N m>n
12
the continuity from above of P implies that
[ 1.10 ∞
1.9 X
P lim sup An = lim P Am 6 lim P(Am ) = 0,
n→∞ n→∞ n→∞
m>n m=n
P∞
since m=1 P(Am ) < ∞.
Example 1.12. (i) Uniform distribution on [0, 1]: Let Ω = [0,1] and A
be the Borel-σ-algebra on Ω (= σ [a, b] 0 6 a 6 b 6 1 ). Let P
be the restriction of the Lebesgue measure on the Borel subset [0, 1] of
R. Then (Ω, A, P) is a probability space. The probability measure P is
called the uniform distribution on [0, 1], since P([a, b]) = b − a for any
0 ≤ a ≤ b ≤ 1 (translation invariance).
if ωi ∈ Ω, i ∈ I.
2 Discrete models
Throughout the whole section
13
• Ω 6= ∅ countable (i.e finite or denumerable)
• A = P(Ω) and
(ii) Every probability measure P on (Ω, A) is of this form, with p(ω) := P({ω})
for all ω ∈ Ω.
Proof. (i)
X
P= p(ω) · δω .
ω∈Ω
(ii) Exercise.
1
p(ω) = ∀ω ∈ Ω .
|Ω|
Then
14
Example 2.3. (i) random permutations: Let M := {1, . . . , n} and Ω :=
all permutations of M . Then |Ω| = n! Let P be the uniform distribution
on Ω.
Problem: What is the probability P(“at least one fixed point”)?
Consider the event Ai := {ω | ω(i) = i} (fixed point at position i). Then
Sylvester’s formula (cf. (1.2)) implies that
n
[
P(“at least one fixed point”) = P Ai
i=1
n
(1.2) X X
= (−1)k+1 · P(Ai1 ∩ · · · ∩ Aik )
| {z }
k=1 16i1 <···<ik 6n
= (n−k)!
n!
(k positions are fixed)
n X (−1)k n
X
k+1 n (n − k)!
= (−1) · =− .
k n! k!
k=1 k=1
Consequently,
n n
X (−1)k X (−1)k n→∞
P(“no fixed point”) = 1 + = −−−→ e−1
k! k!
k=1 k=0
Asymptotics as n → ∞:
n−k
1 X (−1)j n→∞ 1 −1
P(“exactly k fixed points”) = −−−→ ·e
k! j! k!
j=0
15
The Poisson distribution with parameter λ > 0 on N ∪ {0} is given by
∞
−λ
X λj
πλ := e · δj .
j!
j=0
|Ω| = |S|n .
Ω := ω = (x1 , . . . , xn ) xi ∈ S ,
n→∞ λk
−−−→ · e−λ (k = 0, 1, 2, . . .)
k!
16
(for n big and p small, the Poisson distribution with parameter λ = p · n
is a good approximation for B(n, p)).
(iii) Urn model (for example: opinion polls, samples, poker, lottery...)
We consider an urn containing N balls, K red and N − K black (N ≥ 2,
0 6= K 6= N ). Suppose that n 6 N balls are sampled without replace-
ment. What is the probability that exactly k balls in the sample are red?
typical application: suppose that a small lake contains an (unknown) num-
ber N of fish. To estimate N one can do the following: K fish will be
marked by red and after that n (n ≤ N ) fish are “sampled” from the
lake. If k is the number of marked fish in the sample, N̂ := K · nk is an
estimation of the unknown number N . In this case the probability below
with N replaced by N̂ is also an estimation.
Model:
Let Ω be all subsets of {1, . . . , N } having cardinality n, hence
N
Ω := ω ∈ P {1, . . . , N }
|ω| = n , |Ω| =
n
so that
K N −K
k n−k
P(Ak ) = N
(k = 0, . . . , n), (hypergeometric distribution).
n
17
3 Transformations of probability spaces
Throughout this section let (Ω, A) and (Ω̃, Ã) be measurable spaces.
Definition 3.1. A mapping T : Ω → Ω̃ is called A/Ã-measurable (or simply
measurable), if T −1 (Ã) ∈ A for all à ∈ Ã.
Notation:
(ii) Sufficient criterion for measurability: suppose that à := σ(Ã0 ) for some
collection of subsets Ã0 ⊂ P(Ω̃). Then T is A/Ã-measurable, if T −1 (Ã) ∈
A for all à ∈ Ã0 .
(iii) Let Ω, Ω̃ be topological spaces, and A, Ã be the associated Borel σ-
algebras. Then:
T : Ω → Ω̃ is continuous ⇒ T is A/Ã-measurable.
T2 ◦ T1 is A1 /A3 -measurable.
(iv) Exercise.
Definition 3.3. Let T : Ω̄ → Ω be a mapping and let A be a σ-Algebra of
subsets of Ω. The system
σ(T ) := T −1 (A) A ∈ A
18
Proposition 3.4. Let T : Ω → Ω̃ be A/Ã-measurable and P be a probability
measure on (Ω, A). Then
Proof. Clearly, P̃(Ã) > 0 for all à ∈ Ã, P̃(∅) = 0 and P̃(Ω̃) = 1. For pairwise
disjoint Ãi ∈ Ã, i ∈ N, T −1 (Ãi ) are pairwise disjoint too, hence
P is ∞ ∞
[ −1
[ σ-additive X
−1
X
P̃ Ãi = P T Ãi = P T (Ãi ) = P̃(Ãi ).
i∈N i∈N i=1 i=1
| {z }
S
= T −1 (Ãi )
i∈N
so that
X X
P̃(Ã) = P(T ∈ Ã) = P(T = ω̃i )·1Ã (ω̃i ) = P(T = ω̃i )·δω̃i (Ã).
| {z }
i∈I i∈I
=δω̃i (Ã)
19
Define X̃i : Ω̃ → {0, 1} by
X̃i (xn )n∈N := xi , i ∈ N,
and let
à := σ
{X̃i = 1} i ∈ N .
R R
T1 (ω) T2 (ω)
1 1
1 1 1 3
2
1 Ω 4 2 4
1 Ω
20
4 Random variables
Let (Ω, A) be a measurable space and
B(R̄) = B ⊂ R̄ B ∩ R ∈ B(R) .
R̄ := R ∪ {−∞, +∞},
(iii) Let X be a random variable on (Ω, A) with values in R (resp. R̄) and
h : R → R (resp. h : R̄ → R̄) be B(R)/B(R)-measurable (resp.
B(R̄)/B(R̄)-measurable). Then h(X) is a random variable too.
Examples: |X|, X 2 , |X|p , eX , . . .
(iv) The class of random variables on (Ω, A) is closed under the following
countable operations.
If X1 , X2 , . . . are random variables, then
Pn
• · Xi (αi ∈ R, n ∈ N), provided the sum of R̄-valued r.v.’s
i=1 αi
makes sense, ∞ − ∞ = ?,
• supi∈N Xi , inf i∈N Xi , in particular
X1 ∧ X2 := min(X1 , X2 ), X1 ∨ X2 := max(X1 , X2 )
are r.v.’s
• lim supi→∞ Xi , lim inf i→∞ Xi (hence also limi→∞ Xi , if it exists),
21
are random variables too.
since for a real number x it holds x > c ⇔ x > q for some q ∈ Q, q > c.
Important examples
22
(ii) simple random variables
n
X
X= ci · 1Ai , ci ∈ R, Ai ∈ A,
i=1
(i) X = X + − X − , with
(ii) Let X > 0. Then there exists a sequence of simple random variables
Xn , n ∈ N, with 0 ≤ Xn 6 Xn+1 and X = limn→∞ Xn (in short:
0 ≤ Xn % X).
Then
Z Z
E[X] := X dP = X dP
Ω
23
Definition/Construction of the integral w.r.t. P:
Let X be a random variable.
1. If X = 1A , A ∈ A, define
Z
X dP := P(A) .
Pn
2. If X = i=1 ci · 1Ai , ci ∈ R, Ai ∈ A, define
Z n
X
X dP := ci · P(Ai )
i=1
Z Z Z
X + dP − X − dP.
E[X] = X dP :=
Definition 4.6. (i) The set of all P-integrable random variables is defined
by
24
If
N := {X r.v. | X = 0 P-a.s.}
L1.
L := L (Ω, A, P) :=
1 1
(X ∼ Y :⇔ X −Y ∈ N :⇔ X = Y P-a.s.)
N
is a Banach space w.r.t. the norm E |X| .
h X “3.” and
i “2.” above X
E[X] = E x · 1{X=x} = x · P(X = x). (1.4)
x∈X(Ω) x∈X(Ω)
X X X
E[X] = X(ω) · E[1{ω} ] = X(ω) · P({ω}) = p(ω) · X(ω).
| {z }
ω∈Ω ω∈Ω =:p(ω) ω∈Ω
Example 4.8. Infinitely many coin tosses with a fair coin: Let Ω =
{0, 1}N . A and P as in 3.6
(1.4) 1
E[Xi ] = 1 · P(Xi = 1) + 0 · P(Xi = 0) = .
2
25
(ii) Expectation of number of “successes”:
Sn := X1 + · · · + Xn = number of “successes”(= ones) in n tosses
Then for k = 0, 1, . . . , n
X
P(Sn = k) = P(X1 = x1 , . . . , Xn = xn )
(x1 ,...,xn )∈{0,1}n
with
x1 +...+xn =k
n
= · 2−n .
k
Hence
n n
(1.4) X X n n
E[Sn ] = k · P(Sn = k) = k· · 2−n = .
k 2
k=0 k=1
q P∞ P∞
(Recall: 1 d d
)
k k−1
(1−q)2
= dq 1−q = dq k=1 q = k=1 kq
26
Proposition 4.10. Let X, Y be r.v. satisfying (1.3). Then
(i) 0 6 X 6 Y P-a.s. =⇒ 0 6 E[X] 6 E[Y ].
(ii) Xn 6 Y P-a.s. for all n ∈ N =⇒ E lim supn→∞ Xn > lim supn→∞ E[Xn ].
B. Levi
= lim E inf Xk − Y +E Y
n→∞ k >n
4.10 4.10(ii)
6 lim inf E[Xk − Y ] + E Y = lim inf E[Xn ].
n→∞ k>n n→∞
27
Proposition 4.14 (Lebesgue’s dominated convergence theorem, DCT). Let
Xn , n ∈ N be random variables and Y ∈ L1 with |Xn | 6 Y P-a.s. Suppose
that the pointwise limit limn→∞ Xn exists P-a.s. Then
E lim Xn = lim E[Xn ].
n→∞ n→∞
it follows
4.9 Fatou
E lim Xn = E lim inf Xn 6 lim inf E[Xn ] 6 lim sup E[Xn ]
n→∞ n→∞ n→∞ n→∞
Fatou 4.9
6 E lim sup Xn = E lim Xn .
n→∞ n→∞
Example 4.15. Tossing a fair coin Consider the following simple game: A
fair coin is thrown and the player can invest an arbitrary amount of KRW on
either “head” or “tail”. If the right side shows up, the player gets twice his
investment back, otherwise nothing.
Suppose now a player plays the following bold strategy: he doubles his invest-
ment until his first success. Assuming the initial investment was 1000 KRW,
the investment in the nth round is given by
whereas on the other hand limn→∞ In = 0 P-a.s. (more precisely: for all
ω 6= (0, 0, 0, . . . )).
5 Inequalities
Let (Ω, A, P) be a probability space.
28
Proposition 5.1 (Jensen’s inequality). Let h be a convex function defined on
some interval I ⊆ R, X in L1 with X(Ω) ⊂ I. Then E[X] ∈ I and
h E[X] 6 E h(X) .
E[X]2 6 E[X 2 ].
More generally, for 0 < p 6 q:
1 1
E |X|p p 6 E |X|q q .
| {z } | {z }
=:kXkp =:kXkq
q p
Proof. h(x) := |x| p is convex. Since |X| ∧ n ∈ L1 for n ∈ N, we obtain
that
h p i pq h q i
E |X| ∧ n 6 E |X| ∧ n ,
29
Proposition 5.5. Let X be a random variable, h : R̄ → [0, ∞] be increasing.
Then
h(c) P(X > c) 6 E h(X) ∀ c > 0.
Proof.
h(c) P(X > c) 6 h(c) P h(X) > h(c) = E h(c) 1{h(X)>h(c)}
6 E h(X) .
Corollary 5.6. (i) Markov inequality: Choose h(x) = x1[0,∞] (x) and re-
place X by |X| in 5.5. Then
1
P |X| > c 6 E |X| ∀c > 0.
c
In particular,
|X| = 0 P-a.s.
E |X| = 0 ⇒
|X| < ∞ P-a.s.
E |X| < ∞ ⇒
(ii) Chebychev’s inequality: Choose h(x) = x2 1[0,∞] (x) and and replace
X by |X − E[X]| in 5.5. Then X ∈ L2 implies
1 var(X)
h 2 i
P X − E[X] > c 6 2 E X − E[X] = .
c c2
30
is called the variance of X (mean square prediction error).
The variance is a measure for fluctuations of X around E[X]. It indicates
the risk thatpone takes when a prognosis is based on the expectation.
σ(X) := var(X) is called standard deviation.
(ii) var(X) = 0
⇔ P X = E[X] = 1
i.e. X behaves deterministically
var(aX + b) = a2 · var(X).
(1.6)
cov(X, Y ) = 0 ⇔ var(X + Y ) = var(X) + var(Y ) .
31
Proposition 6.7 (Cauchy-Schwarz). Let X and Y ∈ L2 . Then
2 · XY = (X + Y )2 − X 2 − Y 2 ∈ L1 .
and for i 6= j
cov(Xi , Xj ) = E[Xi Xj ] − p2 = 0,
so that X1 , X2 , . . . are pairwise uncorrelated (in fact even independent, see
below).
32
Let Sn := X1 + · · · + Xn be the number of successes. Then
and using Levi and the fact that X1 , X2 , . . . are pairwise uncorrelated, we con-
clude that
∞
X 1
var(X) = 2−2n · p(1 − p) = · p(1 − p).
3
n=1
Finally, let T be the waiting time until the first “success”. Then
Pp (T = n) = Pp X1 = · · · = Xn−1 = 0, Xn = 1
= (1 − p)n−1 p (geometric distribution),
then
∞
hX i ∞
X
E[T ] = E n · 1{T =n} = n · Pp (T = n)
n=1 n=1
∞
X 1
= n · (1 − p)n−1 p = ,
p
|n=1 {z }
“derivative of the
geometr. series”
and analogously
1−p
var(T ) = · · · = .
p2
33
• (Ω, A, P) be a probability space
• X1 , X2 , . . . ∈ L2 r.v. with
– Xi uncorrelated, i.e. cov(Xi , Xj ) = 0 for i 6= j
– uniformly bounded variances, i.e. supi∈N var(Xi ) < ∞.
| {z }
=σ 2 (Xi )
=:σi2
Let
Sn := X1 + · · · + Xn
S (ω)
so that nn is the arithmetic mean of the first n observations X1 (ω), . . . , Xn (ω)
(“empirical mean”).
Our aim in this section is to show that randomness in the empirical mean
vanishes for increasing n, i.e.
Sn (ω) n large
∼ m if E[Xi ] ≡ m .
n
Remark 7.1. W.l.o.g. we may assume that E[Xi ] = 0 for all i, because
otherwise consider X̃i := Xi − E[Xi ] (“centered”), which satisfies:
• X̃i ∈ L2
• var(X̃i ) = var(Xj ).
E[S̃n ] E[Sn ] Pn
• S̃n
n − n = Sn
n − n (S̃n := i=1 X̃i "centered sum")
Proposition 7.2.
S n E[Sn ] 2
lim E − = 0,
n→∞ n n
2
Sn
(resp. lim E −m =0 if E[Xi ] ≡ m).
n→∞ n
34
Proof.
S n E[Sn ] 2 S
n 1
E − = var = · var(Sn )
n n n n2
n
Bienaymé 1 X 2 1 n→∞
= 2
σi 6 · const. −−−→ 0.
n n
i=1
35
Example 7.6. Application: uniform approximation of f ∈ C [0, 1] with
Bernstein polynomials
Let p ∈ [0, 1]. Then by the transformation theorem (see assignments)
n
X k n n
X k
k n−k
Bn (p) := f · p (1 − p) = f · Pp [Sn = k]
n k n
k=0 k=0
Sn
= Ep f .
n
Now
Sn Sn
Bn (p) − f (p) = Ep f − f (p) 6 Ep f − f (p)
n n
Sn Sn
= Ep f − f (p) · 1{| Sn −p|6δ} + Ep f − f (p) · 1{| Sn −p|>δ}
n n n n
Sn Sn
6 ε · Pp − p 6 δ + 2kf k∞ Pp −p >δ .
n n
| {z }
≤ δ21n p(1−p)≤ 4δ12 n
Consequently,
36
implies
1
Zn (ω) < for all but finitely many n ∈ N,
k
Sk2 (ω)
lim =0 / N1 with P(N1 ) = 0 .
∀ω ∈
k→∞ k2
37
2. Step Let Dk := maxk2 6l<(k+1)2 |Sl − Sk2 |. We show fast convergence in
probability of D
k2 to 0. For all ε > 0:
k
k2[
+2k
D
P 2k > ε |Sl − Sk2 | > εk 2
=P
k
l=k2 +1
2
kX +2k
|Sl − Sk2 | > εk 2
6 P
| {z }
l=k2 +1 Chebychev
6 1
ε2 k4
(l−k2 ) ·c
| {z }
62k
(2k)(2k) · c
6
ε2 k 4
4c
= 2 2.
ε k
Lemma 7.7 now implies that
Dk (ω)
lim =0 / N2 with P(N2 ) = 0.
∀ω ∈
k→∞ k 2
S n = Y1 + · · · + Yn
= position of a particle undergoing a “random walk” on Z.
Increasing refinement (+ rescaling and linear interpolation) of the random walk
yields the Brownian motion:
38
S (ω)
The strong law of large numbers implies that nn → 0 P-a.s.
In particular, fluctuations are growing slower than linear.
Sn (ω)
lim sup √ = +1 P-a.s.
n→∞ 2n log log n
Sn (ω)
lim inf √ = −1 P-a.s.
n→∞ 2n log log n
39
(iii) P-a.s. convergence
P lim Xn = X = 1.
n→∞
(i) +3 (ii)
X` >F
if sup|Xn | ∈ Lp
n∈N
(resp. |Xn |p unif. int.)
along some
~ subsequence
(iii)
(iii)⇒(ii):
∞ [
∞ \
\ 1
lim Xn = X = |Xn − X| 6 .
n→∞ k
k=1 m=1 n>m
| {z }
=:Ak
1
\
1.9
1 = P(Ak ) = lim P |Xn − X| 6
m→∞ k
n>m
1 1
6 lim inf P |Xm − X| 6 ≤ lim sup P |Xm − X| 6 6 1.
m→∞ k m→∞ k
Consequently,
1
lim P |Xm − X| > = 0.
m→∞ k
40
(iii)⇒(i): Y := supn∈N |Xn | ∈ Lp , limn→∞ Xn = X P-a.s. implies |X| 6 Y
In particular, |Xn − X|p 6 2p Y p ∈ L1 .
limn→∞ |Xn − X|p = 0 P-a.s. with Lebesgue’s dominated convergence
now implies
lim E |Xn − X|p = 0.
n→∞
• in general (i);(iii) and (iii);(i) (hence (ii);(i) too). For examples, see
Exercises.
Definition 8.4. Let I be an index set. A family (Xi )i∈I ⊂ L1 of r.v. is called
uniformly integrable if
Z
lim sup |Xi | dP = 0.
c→∞ i∈I {|Xi |>c}
Note that by Lebesgue’s theorem {|Xi |>c} |Xi | dP = E[1{|Xi |>c} · |Xi |] →
R
41
The next Proposition is the definitive version of Lebesgue’s theorem on dom-
inated convergence.
Proposition 8.5 (Vitali convergence therorem). Let Xn ∈ L1 , n ≥ 1, and X
be r.v. Then the following statements are equivalent:
(i) lim Xn = X in L1 .
n→∞
Lemma 8.7 (ε-δ criterion). Let (Xi )i∈I ⊂ L1 . Then the following statements
are equivalent:
(i) (Xi )i∈I is uniformly integrable.
6 c + 1 < ∞.
For δ := ε
2c and A ∈ A with P(A) < δ we now conclude
Z Z Z
|Xi | dP = |Xi | dP + |Xi | dP
A A∩{|Xi |<c} A∩{|Xi |>c}
Z Z
ε
6c dP + |Xi | dP < c · P(A) + < ε.
A {|Xi |>c} 2
42
(ii)⇒(i): Let ε > 0 and δ be as in (ii). Using Markov’s inequality (and the two
properties in (ii)), we get for any i ∈ I
1 supi∈I E |Xi | + 1
if c >
P |Xi | > c 6 · E |Xi | < δ, ,
c δ
Z
hence |Xi | dP < ε ∀ i ∈ I.
{|Xi |>c}
Remark 8.8. (i) Existence of dominating integrable r.v. implies uniform in-
tegrability: |Xi | ≤ Y ∈ L1 ∀i ∈ I
Z Z
DCT c%∞
⇒ |Xi | dP 6 Y dP = E 1{Y >c} ·Y −−−−−−−−→ 0,
{|Xi |>c} {Y >c}
c→∞
since 1{Y >c} · Y −−−→ 0 P-a.s. (Markov’s inequality)
In particular, I finite ⇒ (Xi )i∈I ⊂ L1 uniformly integrable.
(see Exercises).
Proof of Proposition 8.5. (i)⇒(ii): see Exercises. (Hint: Use Lemma 8.7).
Fatou
E |X| = E lim inf |Xnk | 6 lim inf E |Xnk |
k→∞ k→∞
6 sup E |Xn | < ∞.
n∈N
43
Let ε > 0. Then there exists δ > 0 such that for all A ∈ A with
P(A) < δ it follows that A |Xn | dP < 2ε for any n ∈ N.
R
Z Z
E |Xn | = |Xn | dP + |Xn | dP < ε,
{|Xn |< 2ε } {|Xn |> 2ε }
| {z } | {z }
≤ 2ε < 2ε
⇒ lim Xn = X in Lp .
n→∞
g(x)
Proposition 8.10. Let g : [0, ∞) → [0, ∞) be measurable with limx→∞ x =
∞. Then
g(x)
Proof. Let ε > 0. Choose c > 0, such that 1
x > ε supi∈I E g |Xi | +1
for x > c. Then for all i ∈ I
Z Z
|Xi |
|Xi | dP = g |Xi | · dP
{|Xi |>c} {|Xi |>c} g |Xi |
Z
ε
6 · g |Xi | dP 6 ε.
sup E g |Xj | + 1 {|Xi |>c}
j∈I
44
Example 8.11. (i) p > 1, supi E[|Xi |p ] < ∞ ⇒ (Xi )i∈I uniformly inte-
grable
Proof: g(x) := x · log+ (x) ist monotone increasing and convex. Consequently,
n n
|S |
h i monotonicity h 1X i convexity 1 X
E g n
6 E g |Xi | 6 E g |Xi |
n n n
i=1 i=1
6 sup E g |Xi | ∀n,
16i6n
and so
|Sn |
h i
sup E g 6 sup E g |Xi | < ∞.
n∈N n i∈N
Consequently, Sn
is uniformly integrable and (1.9) holds. Thus by
n n∈N
Proposition 8.5
Sn
lim = m in L1 .
n→∞ n
45
One complementary remark concerning Lebesgue’s dominated convergence
theorem.
lim Xn = X in L1
n→∞
“⇐”:
X + Xn = X ∨ Xn + X ∧ Xn
| {z } | {z }
:=sup{X,Xn } :=inf{X,Xn }
Then
Lebesgue
lim E[X ∧ Xn ] = E[X]
n→∞
and thus
Lp -completeness
Proposition 8.14 (Lp -completeness, Riesz-Fischer). Let 1 6 p < ∞ and
Xn ∈ Lp with
Z
lim |Xn − Xm |p dP = 0.
n,m→∞
(ii) lim Xn = X in Lp .
n→∞
46
9 Distribution of random variables
Let (Ω, A, P) be a probability space, and X : Ω → R̄ be a r.v.
Let µ be the distribution of X (under P), i.e., µ(A) = P(X ∈ A) for all
A ∈ B(R̄).
Assume that P(X ∈ R) = 1 (in particular, X P-a.s. finite, and µ is a
probability measure on (R, B(R)).
(1.10)
F (b) := P(X 6 b) = µ (−∞, b] , b ∈ R,
(ii) Existence: Let λ be the Lebesgue measure on (0, 1). Define the “inverse
function” G of F : R → [0, 1] by
G : (0, 1) → R
G(y) := inf x ∈ R F (x) > y .
47
Since 0 < y < F (x) ⇒ G(y) 6 x, we have
0, F (x) ⊂ {G 6 x}.
so that G is measurable.
Let µ := G(λ) = λ ◦ G−1 (probability measure on (R, B(R))). Then
µ (−∞, x] = λ({G 6 x}) = λ 0, F (x) = F (x) ∀x ∈ R.
Uniqueness: later.
48
Definition 9.5. (i) F (resp. µ) is called discrete, if there exists a countable
set S ⊂ R with µ(S) = 1. In this case, µ is uniquenely determined by the
weights µ({x}), x ∈ S, and F is a step function of the following type:
X
F (x) = µ({y}).
y∈S,
y 6x
49
(ii) (Continuous) exponential distribution with parameter α > 0.
(
αe−αx if x > 0
f (x) :=
0 if x < 0,
(
x
1 − e−αx if x > 0
Z
F (x) = f (t) dt =
−∞ 0 if x < 0.
= (1 − p)k p = P (X = k).)
50
Proof. See Assignments.
In particular:
r
2
p = 1 : E |X − m| = σ ·
π
p = 2 : E |X − m|2 = σ 2
3 σ3
p = 3 : E |X − m|3 = 2 2 · √
π
p = 4 : E |X − m|4 = 3σ 4 .
51
Definition 10.1. Let µ and µn , n ∈ N, be probability measures on (S, S).
The sequence (µn ) converges to µ weakly if for all f ∈ Cb (S) (= the space of
bounded continuous functions on S) it follows that
Z Z
n→∞
f dµn −−−→ f dµ.
(i) µn → µ weakly
n→∞
(ii) f dµ for all f bounded and uniformly continuous (w.r.t.
R R
f dµn −−−→
d)
(iv) lim inf n→∞ µn (G) > µ(G) for all G ⊂ S open
(v) limn→∞ µn (A) = µ(A) for all µ-continuity sets A, i.e. ∀ A ∈ S with
µ(∂A) = 0.
n→∞
(vi) f dµ for all f bounded, measurable and µ-a.s. contin-
R R
f dµn −−−→
uous.
52
(ii)⇒(iii): Let F ⊂ S be closed and define d(x, F ) := inf y∈F d(x, y), x ∈ S.
The sets
1
Gm := x ∈ S d(x, F ) < , m∈N are open,
m
Define
1
if x 6 0
ϕ(x) := 1 − x if x ∈ [0, 1]
if x > 1.
0
(iii)⇒(v): For a subset A ⊂ S we denote the closure by Ā, the interior by Å,
and the boundary by ∂A. Let A be such that µ(Ā \ Å) = µ(∂A) = 0.
Then
(iv)
µ(A) = µ(Å) 6 lim inf µn (Å) ≤ lim inf µn (A) ≤ lim sup µn (A)
n→∞ n→∞ n→∞
(iii)
≤ lim sup µn (Ā) 6 µ(Ā) = µ(A).
n→∞
53
(v)⇒(vi): Let f be as in (vi). The distribution function
F (x) = µ({f 6 x})
has at most countably many jumps. Thus D := x ∈ R µ({f = x}) 6=
0 is at most countable, and so R \ D ⊂ R is dense. By denseness and
since f is bounded: for any ε > 0 we find c0 < · · · < cm ∈ R \ D with
Df := {x ∈ R | f is not continuous at x}
and
thus by (1.13)
B f (ω), ε ∩ [ck , ck+1 ) 6= ∅ 6= B f (ω), ε ∩ R \ [ck , ck+1 ) .
54
Pm−1
Let g := k=0 ck · IAk . Then kf − gk∞ 6 ε and
Z Z
f dµ − f dµn
Z Z Z Z
6 |f − g| dµ + g dµ − g dµn + |g − f | dµn
| {z } | {z }
6ε 6ε
m−1 (v)
n→∞
X
6 2ε + |ck | · µ Ak − µn Ak −−−→ 2ε.
k=0
Proof. Let f ∈ Cb (S) be uniformly continuous and ε > 0. Then there exists a
δ = δ(ε) > 0 such that:
x, y ∈ S with d(x, y) 6 δ implies |f (x) − f (y)| ≤ ε
Hence
Z Z
f dµ − f dµn = E f (X) − E f (Xn )
Z Z
6 f (X) − f (Xn ) dP + f (X) − f (Xn ) dP
{d(X,Xn )6δ} {d(X,Xn )>δ}
6 ε + 2kf k∞ · P d(Xn , X) > δ .
| {z }
n→∞
−−−→0
55
n→∞
(ii) µn −−−→ µ weakly
n→∞
(iii) Fn (x) −−−→ F (x) for all x where F is continuous.
n→∞
(iv) µn (a, b] −−−→ µ (a, b] for all (a, b] with µ({a}) = µ({b}) = 0.
n→∞
Fn (x) = µn (−∞, x] −−−→ µ (−∞, x] = F (x).
(iii)
µ (a, b] = F (b) − F (a) = lim Fn (b) − lim Fn (a)
n→∞ n→∞
= lim µn (a, b] .
n→∞
m
X
f− f (ck−1 ) · I(ck−1 ,ck ] 6 sup sup f (x)−f (ck−1 ) ≤ ε,
∞ 1≤k≤m x∈[ck−1 ,ck ]
|k=1 {z }
=:g
56
and so
Z Z
f dµ − f dµn
Z Z Z Z
6 |f − g| dµ + g dµ − g dµn + |f − g| dµn
| {z } | {z }
≤ε ≤ε
m
X
6 2ε + |f (ck−1 )| · µ (ck−1 , ck ] − µn (ck−1 , ck ]
k=1
(iv)
n→∞
−−−→ 2ε.
(i) Ω ∈ D.
(ii) A ∈ D ⇒ Ac ∈ D.
[
Ai ∈ D.
i∈N
D := A ∈ A P1 (A) = P2 (A)
is a Dynkin-system.
57
Remark 11.3. (i) Let D be a Dynkin-system. Then
A, B ∈ D , A ⊂ B ⇒ B \ A = (B c ∪ A)c ∈ D
(ii) Every Dynkin-system which is closed under finite intersections (short no-
tation: ∩-stable), is a σ-algebra, because:
(a) A, B ∈ D A ∪ B = A ∪ B \ (A ∩ B) ∈ D.
⇒
| {z }
∈D by ass.
| {z }
(i)
∈D
(b) Ai ∈ D, i ∈ N
i−1 i−1
[ [ [ [ [ c
⇒ Ai = Ai \ An = Ai ∩ An ∈ D.
n=1
i∈N i∈N
| {z } i∈N
| n=1{z }
pairwise disjoint ! (a)
| {z∈ D }
∈ D by ass.,
σ(B) = D(B) ,
where
\
D(B) := D
D Dynkin-system
B⊂D
E∈B =⇒ B ⊂ DE =⇒ D(B) ⊂ DE
B is ∩-stable
58
=⇒ E ∩ D = D ∩ E ∈ D(B) =⇒ E ∈ DD .
Thus B ⊂ DD , and so D(B) ⊂ DD , which concludes the proof.
D := A ∈ A P1 (A) = P2 (A)
∅, {X1 = x1 , . . . , Xn = xn }, n ∈ N, x1 , . . . , xn ∈ {0, 1}
59
Definition 11.7. Let H be a vector space of real-valued bounded functions
on Ω. H is called a monotone vector space (MVS), if:
Then
⇒ lim gk = f + 2a1 ∈ H ⇒ f ∈ H.
k→∞
60
σ(M) := “smallest σ-algebra for which all f ∈ M are measurable”
The next theorem plays the same role in measure theory and probability theory
as the Stone-Weierstrass Theorem in analysis.
Claim: f ∈ M0 ⇒ f ∧ α ∈ M0 , ∀α ∈ R.
σ(M0 ) = {A ∈ σ(M0 ) | 1A ∈ H} =: S.
“⊃”: Clear
“⊂”: S is a Dynkin system. Put
E := {A ∈ σ(M0 ) | ∃fn ∈ M0 , fn ≥ 0, n ≥ 1 : fn % 1A }.
61
For f ∈ M0 , α ∈ R, we have:
0 ≤ n · (f − α)+ ∧ 1 % 1{f >α} ⇒ {f > α} ∈ E
| {z } n→∞ | {z }
=: fn ∈ M0 such sets generate σ(M0 )
How big is σ(Cb (S)) ? Clearly σ(Cb (S)) ⊂ B(S) since every continuous func-
tion on S is measurable w.r.t. B(S). Let F ⊂ S be closed and d(x, F ) :=
inf y∈F d(x, y), x ∈ S. Then d(·, F ) is Lipschitz continuous and so f :=
d(·, F ) ∧ 1 ∈ Cb (S). Moreover
F = {f = 0} ∈ σ(Cb (S)).
Hence B(S) ⊂ σ(Cb (S)). In particular 1A ∈ σ(Cb (S))b for all A ∈ B(S),
hence µ1 = µ2 .
62
2 Independence
1 Independent events
Let (Ω, A, P) be a probability space.
are independent.
63
To this end suppose first that Aj2 ∈ Bj2 , . . . , Ajn ∈ Bjn , and define
Then Dj1 is a Dynkin system (!) containing Bj1 . Proposition 1.11.4 now
implies
hence σ(Bj1 ) = Dj1 . Iterating the above argument for Dj2 , Dj3 , implies
(2.1).
64
Remark 1.4. Pairwise independence does not imply independence in general.
Example: Consider two tosses with a fair coin, i.e.
P := uniform distribution.
Ω := (i, k) i, k ∈ {0, 1} ,
65
be the tail σ-algebra (resp. σ-algebra of terminal events). Then
P(A) ∈ {0, 1} ∀ A ∈ B∞
i.e., P is deterministic on B∞ .
Proof of the Zero-One Law. Proposition 1.2 implies that for all n ≥ 2
∞
[
B1 , B2 , . . . , Bn−1 , σ Bm
m=n
S
are independent. Since B∞ ⊂ σ B
m>n m , this implies that for all n ≥ 2
B1 , B2 , . . . , Bn−1 , B∞
B∞ , Bn , n ∈ N are independent
66
Lemma 1.7. Let B ⊂ A be a σ-algebra such that B is independent from B.
Then
P(A) ∈ {0, 1} ∀A ∈ B.
67
∞
\ n+k
\
P Acm = lim P Acm
k→∞
m=n m=n
| {z }
(by ind.)
Qn+k c
= m=n P(Am )
n+k
Y
= lim (1 − P(Am ))
k→∞
m=n
n+k
!
X
≤ lim exp − P(Am ) = 0,
k→∞
m=n
Pp (“text occurs”) = ?
and consider the events Ai = “text occurs in the ith block”. Clearly, Ai , i ∈ N,
are independent events (!) by Proposition 1.2(ii) with equal probability
68
Moreover: since the indicator functions 1A1 , 1A2 , . . . are uncorrelated (since
A1 , A2 , . . . are independent) with uniformly bounded variances, the strong law
of large numbers implies that
n
1X Pp -a.s.
1Ai −−−−→ E[1A1 ] = α ,
n
i=1
i.e. the relative frequency of the given text in blocks of the infinite sequence is
strictly positive.
are independent, i.e. for all finite subsets J ⊂ I and any Borel subsets Aj ∈
B(R̄)
\ Y
P {Xj ∈ Aj } = P(Xj ∈ Aj ).
j∈J j∈J
Proof. W.l.o.g. n = 2. (Proof of the general case by induction, using the fact
that X1 · . . . · Xn−1 and Xn are independent , since X1 · . . . · Xn−1 is
measurable
w.r.t σ σ(X1 ) ∪ · · · ∪ σ(Xn−1 ) , and σ σ(X1 ) ∪ · · · ∪ σ(Xn−1 ) and σ(Xn )
are independent by Proposition 1.2.)
69
It therefore suffices to consider two independent r.v. X, Y ≥ 0, and we have
to show that
m
X n
X
X= αi 1Ai and Y = βj 1Bj ,
i=1 j=1
Proof. Let ε1 , ε2 ∈ {+, −}. Then X ε1 and Y ε2 are independent by Remark 2.2
and nonnegative. Proposition 2.3 implies
X · Y = X + · Y + + X − · Y − − (X + · Y − + X − · Y + ) ∈ L1 ,
70
Remark 2.5. (i) In general the converse to the above corollary does not hold:
For example let X be N (0, 1)-distributed and Y = X 2 . Then X and Y
are not independent, but
(ii)
X, Y ∈ L2 independent ⇒ X, Y uncorrelated
because
n
1X
If E[Xi ] ≡ m then lim Xi = m P-a.s.
n→∞ n
i=1
71
Proposition 3.2 (Etemadi, 1981). Let X1 , X2 , . . . ∈ L1 be pairwise indepen-
dent, identically distributed, m = E[Xi ]. Then
n
1X n→∞
Xi −−−→ m P-a.s.
n
i=1
Then X̃1 , X̃2 , . . . are pairwise independent by Remark 2.2. For the proof it
Pn
is now sufficient to show that for S̃n := i=1 X̃i we have that
S̃n n→∞
−−−→ m P-a.s.
n
Indeed,
∞
X ∞
X ∞
X
P(Xn 6= X̃n ) = P(Xn > n) = P(X1 > n)
n=1 n=1 n=1
∞ X
X ∞ ∞
X
= P X1 ∈ [k, k + 1) = k · P X1 ∈ [k, k + 1)
n=1 k=n k=1
∞
X
= E k · 1{X1 ∈[k,k+1)} 6 E[X1 ] < ∞
k=1
| {z }
6X1 ·1{X1 ∈[k,k+1)}
72
2. Reduce the proof to convergence along the subsequence kn = bαn c (=
largest natural number ≤ αn ), α > 1.
We will show in Step 3. that
hence
kn
1 1 X n→∞
· E[S̃kn ] = E[X̃i ] −−−→ m,
kn kn
i=1
we get
S̃l (ω)
lim = m.
l→∞ l
73
3. Due to Lemma 1.7.7 it suffices for the proof of (2.3) to show that
∞ !
X S̃kn − E[S̃kn ]
∀ε > 0 : P > ε < ∞,
kn
n=1
Pairwise independence of the X̃i implies that the X̃i pairwise uncorrelated,
hence
! kn
S̃kn − E[S̃kn ] 1 1 X
P >ε 6 2 2 · var(S̃kn ) = 2 2 var(X̃i )
kn kn ε kn ε
i=1
kn
1 X
E (X̃i )2 .
6 2 2
kn ε
i=1
∞ kn
X 1 X 2
X 1 2
s := 2
E (X̃ i ) = · E (X̃ i ) < ∞.
kn kn2
n=1 i=1 (n,i)∈N2 ,
i6kn
∞ X
X 1
· E (X̃i )2 .
s=
kn2
i=1 n : kn >i
We will show in the following that there exists a constant c such that
X 1 c
6 . (2.4)
kn2 i2
n : kn >i
74
This will then imply that
(2.4) ∞ ∞
X 1 2
X 1 2
s 6 c · E (X̃i ) = c · E 1{X <i} · X 1
i2 2 1
i
i=1 i=1
∞ i
X 1 X 2
6 c l · P X1 ∈ [l − 1, l)
i2
i=1 l=1
∞ ∞
X !
X 1
l2 ·
= c ·P X1 ∈ [l − 1, l)
i2
l=1 i=l
| {z }
62l−1
∞
X X∞
6 2c l · P X1 ∈ [l − 1, l) = 2c E l · 1{X1 ∈[l−1,l)}
l=1 l=1
| {z }
6(X1 +1)·1{X1 ∈[l−1,l)}
6 2c · E[X1 ] + 1 < ∞,
where we used the fact that
∞ ∞ ∞
X 1 1 X 1 1 X 1 1 1 1 2
6 + = + − = + 6 .
i2 l2 (i − 1)i l2 i−1 i l2 l l
i=l i=l+1 i=l+1
Let ni be the smallest natural number satisfying kni = bαni c > i, hence
αni > i, then
X 1 X 1 1 α2 c−2 1
6 c−2
α = c−2
α · · α −2(ni −1)
6 α
· .
kn2 α2n 1 − α−2 1 − α−2 i2
n : kn >i n>ni
75
Proof. W.l.o.g. E[X1 ] = ∞ (otherwise
just apply Proposition 3.2). Then (by
Pn n→∞
Proposition 3.2) n i=1 Xi (ω) ∧ N −−−→ E[X1 ∧ N ], P-a.s. for all N ,
1
hence
n n
1X 1X n→∞ N →∞
Xi > Xi ∧ N −−−→ E[X1 ∧ N ] % E[X1 ] P-a.s.
n n
i=1 i=1
Pn
which implies lim inf n→∞ 1
n i=1 Xi = ∞ P-a.s. and the statement of the
corollary follows.
and
α < 0: ∃ ε > 0 with α + ε < 0, so that Xn (ω) 6 en(α+ε) ∀ n > n0 (ω), hence
P-a.s. exponential decay
α > 0: ∃ ε > 0 with α − ε > 0, so that Xn (ω) > en(α−ε) ∀ n > n0 (ω), hence
P-a.s. exponential growth
Note that by Jensen’s inequality
76
and typically the inequality is strict, i.e. α < log m, so that it might happen
that α < 0 although m > 1 (!)
Illustration: as a particular example consider the following model.
Let X0 := 1 be the capital at time 0. At time n − 1 invest 12 Xn−1 and win
c 12 Xn−1 or 0, both with probability 12 , where c > 0 is a constant. Then
(
1 c 12 Xn−1 with prob. 1
2
Xn = Xn−1 +
|2 {z } 0 with prob. 1
2,
not invested | {z }
gain/loss
thus (
1
2 (1 + c)Xn−1 with prob. 1
2
Xn = = Xn−1 Yn
1
2 Xn−1 with prob. 1
2,
with (
1
2 (1 + c) with prob. 12
Yn := 1
2 with prob. 12 ,
so that E[Yi ] = 41 (1 + c) + 4 = 4 (supercritical if c >
1 c+2
2)
On the other hand
" #
1 1 1 1 1 + c c<3
E[log Y1 ] = · log (1 + c) + log = · log < 0.
2 2 2 2 4
n→∞
Hence Xn −−−→ 0 P-a.s. with exponential rate for c < 3, whereas at the same
time for c > 2, E[Xn ] = mn % ∞ with exponential rate.
77
Then
n
1X
%n (ω, · ) = δXi (ω)
n
i=1
Proof. Clearly, Kolmogorov’s law of large numbers implies that for any x ∈ R
n
1X
Fn (ω, x) := %n ω, (−∞, x] = 1(−∞,x] Xi (ω)
n
i=1
−→ E[1(−∞,x] (X1 )] = P(X1 ≤ x) = µ (−∞, x] =: F (x)
is a P-null set too, and for all x ∈ R and all s, r ∈ Q with s < x < r and
ω∈ / N:
78
4 Joint distribution and convolution
Let Xi ∈ L1 i.i.d. Kolmogorov’s law of large numbers implies that
n
1X n→∞
Xi (ω) −−−→ E[X1 ] P-a.s.
n
|i=1 {z }
=:Sn
hence
Z
Sn −1 Sn
h i
f (x) d P ◦ (x) = E f
n n
(Lebesgue) Z
n→∞
−−−→ f E[X1 ] = f (x) dδE[X1 ] (x) ∀f ∈ Cb (R)
i.e., the distribution of Snn converges weakly to δE[X1 ] . This is not surprising,
because at least for Xi ∈ L2
n
Sn 1 X n→∞
var = 2 var(Xi ) −−−→ 0.
n n | {z }
i=1 =var(X )
1
X̄ : Ω → Rn ,
ω 7→ X̄(ω) := X1 (ω), . . . , Xn (ω)
79
Remark 4.2. (i) µ̄ is well-defined, because X̄ : Ω → Rn is A/B(Rn )-
measurable.
Proof:
B(Rn ) = σ A1 × · · · × An Ai ∈ B(R)
(= σ A1 × · · · × An Ai = (−∞, xi ] , xi ∈ R )
Example 4.3. (i) Let X, Y be r.v., uniformly distributed on [0, 1]. Then
• X, Y independent ⇒ joint distribution = uniform distribution on
[0, 1]2
• X = Y ⇒ joint distribution = uniform distribution on the diagonal
80
are independent and
Φ has a uniform distribution on − π2 , π2 ,
( 2
r r
if r > 0
exp − 2σ
R has a density σ2 2
0 if r < 0.
(ii)
Z
ϕ(x1 , . . . , xn ) dµ̄(x1 , . . . , xn )
Z Z !
= ··· ϕ(x1 , . . . , xn ) µi1 (dxi1 ) · · · µin (dxin ).
81
for all B(Rn )-measurable functions ϕ : Rn → R̄ with ϕ ≥ 0 or ϕ µ̄-
integrable.
(ii) It is enough to show the identity in (ii) for any ϕ ∈ B(Rn )b . From this it
easily extends to ϕ as stated. Now, the set of ϕ ∈ B(Rn )b for which the
identity in (ii) holds forms a monotone vector space which contains the
multiplicative class of functions ϕ(x1 , . . . , xn ) = 1A1 ×···×An (x1 , . . . , xn ),
Ai ∈ B(R), 1 ≤ i ≤ n. Since B(Rn ) is generated by sets of the form
A1 × · · · × An , the identity in (ii) folllows for all ϕ ∈ B(Rn )b by the
monotone class theorem (Theorem 11.9).
Hence,
Z
µ̌(A) := f¯(x̄) dx̄, A ∈ B(Rn ),
A
Hence µ̄ = µ̌ by 1.11.5.
82
Let X1 , . . . , Xn be independent, Sn := X1 + · · · + Xn
How to calculate the distribution of Sn with the help of the distribution of
Xi ?
In the following denote by Tx : R1 → R1 , y 7→ x + y, the translation by
x ∈ R.
Proposition 4.6. Let X1 , X2 be independent r.v. with distributions µ1 , µ2 .
Then:
(i) The distribution of X1 + X2 is given by the convolution
Z
µ1 ∗ µ2 := µ1 (dx1 ) µ2 ◦ Tx−1
1
, i.e.
Z Z
µ1 ∗ µ2 (A) = 1A (x1 + x2 ) µ1 (dx1 ) µ2 (dx2 )
Z
= µ1 (dx1 ) µ2 (A − x1 ) ∀ A ∈ B(R1 ) .
83
(ii)
Z Z Z
(µ1 ∗ µ2 )(A) = µ1 (dx1 ) µ2 (A − x1 ) = µ1 (dx1 ) f2 (x2 ) dx2
A−x1
change of variable Z Z
x−x1 =x2
= µ1 (dx1 ) f2 (x − x1 ) dx
A
Z Z
4.5
= µ1 (dx1 ) f2 (x − x1 ) dx.
A
Example 4.7.
(iii) The Gamma distribution Γα,p is defined through its density γα,p given by
(
1
Γ(p)
· αp xp−1 e−αx if x > 0
γα,p (x) =
0 if x 6 0
84
Example 4.8 (The waiting time paradox). Let T1 , T2 , . . . be independent, ex-
ponentially distributed waiting times (e.g. time between reception of two phone
calls in a call center, or two arrivals/departures of buses at a bus station) with
parameter α > 0, so that in particular
Z ∞
1
E[Ti ] = x · αe−αx dx = · · · = .
0 α
T T
z }|1 { z }|2 { . . . t
| {z } | {z }
X Y
Question: How long on average is the waiting time from t until the next event
(“phone call will be received”, “bus will arrive/depart”), i.e. how big is E[Y ] ?
1 1
E[X] = (1 − e−αt ) ≈ for large t .
α α
More precisely:
(ii) X has exponential distribution with parameter α, “compressed to” [0, t],
i.e.
85
In particular,
Z t
1
E[X] = s · αe−αs ds + t · e−αt = · · · = (1 − e−αt )
0 α
Consequently:
86
(i) Setting x = 0 we get: Y ist exponentially distributed with parameter α.
5 Characteristic functions
Let M1+ (Rn ) be the set of all probability measures on (Rn , B(Rn )).
For given µ ∈ M1+ (Rn ) define its characteristic function as the complex-
valued function µ̂ : Rn → C defined by
Z Z Z
ihu,yi
µ̂(u) := e µ(dy) := cos(hu, yi) µ(dy) + i sin(hu, yi) µ(dy) .
(i) µ̂(0) = 1.
(ii) |µ̂| 6 1.
Proof. Exercise.
Proposition 5.2 (Uniqueness theorem). Let µ1 , µ2 ∈ M1+ (Rn ) with µ̂1 = µ̂2 .
Then µ1 = µ2 .
87
Proposition 5.4 (Lévy’s continuity theorem). Let (µm )m∈N be a sequence in
M1+ (Rn ). Then
(i) limm→∞ µm = µ weakly implies limm→∞ µ̂m = µ̂ uniformly on every
compact subset of Rn .
88
Proposition 5.7. For all u ∈ Rn :
n2 Z
1 1 2 1 2
eihu,yi e− 2 kyk dy = e− 2 kuk .
2π
Special cases:
Pn
a) Binomial distribution βnp = n
pk q n−k δk , q = 1 − p, p ∈
k=0 k
[0, 1]. Then for all u ∈ R:
n
X n
β̂np (u) = pk q n−k · eiuk = (q + peiu )n .
k
k=0
P∞ −α αn δ .
b) Poisson distribution πα = n=0 e n! n Then for all u ∈ R:
∞
−α
X αn iu
−1)
π̂α (u) = e · eiun = eα(e .
n!
n=0 | {z }
iu n
= (αen! )
Sn − E[Sn ]
Sn∗ := p (“standardized sum”)
var(Sn )
89
The sequence X1 , X2 , . . . of r.v.’s is said to have the central limit property
(CLP), if
or equivalently
Z b
1 x2
lim P(Sn∗ 6 b) = √ e− 2 dx = Φ(b), ∀b ∈ R.
n→∞ 2π −∞
Remark 6.3. (i) (Xn )n∈N i.i.d. ⇒ (Xn )n∈N satisfies (L).
Proof: Let m := E[Xn ], σ 2 := var(Xn ). Then s2n = nσ 2 , so that
Z Lebesgue
−2 2 n→∞
Ln (ε) = σ √
(X1 − m) dP −−−→ 0.
{|X1 −m|>ε nσ}
90
To see that Lyapunov’s condition implies Lindeberg’s condition note that
for all ε > 0:
Xk − E[Xk ]
>1
εsn
2+δ
Xk − E[Xk ] 2
⇒ > Xk − E[Xk ]
(εsn )δ
and therefore
n h i
X 2+δ
E Xk − E[Xk ]
1 k=1
Ln (ε) 6 · .
εδ s2+δ
n
(iii) Let (Xn ) be bounded and suppose that sn → ∞. Then (Xn ) satisfies
Lyapunov’s condition for any δ > 0, because
α
|Xk | 6
2
⇒ Xk − E[Xk ] 6 α
n h i n h i
X 2+δ X 2 δ
E Xk − E[Xk ] E Xk − E[Xk ] α
k=1 k=1
⇒ 6
s2+δ
n s2n sδn
δ n i α δ
α 1 X
h
2
= · E Xk − E[Xk ] = .
sn s2n sn
|k=1 {z }
=s2n
6 Ln (ε) + ε2 .
91
The proof of Proposition 6.2 requires some further preparations.
Lemma 6.5. For all t ∈ R and n ∈ N:
it (it)2 (it)n−1 |t|n
eit − 1 − − − ... − 6 .
1! 2! (n − 1)! n!
Proof. Define f (t) := eit , then f (k) (t) = ik eit . Then Taylor series expansion
around t = 0, applied to real and imaginary part, implies that
it (it)n−1
eit − 1 − − ... − = Rn (t)
1! (n − 1)!
with
t |t|
|t|n
Z Z
1 1
Rn (t) = (t−s)n−1 in eis ds 6 sn−1 ds = .
(n − 1)! 0 (n − 1)! 0 n!
In particular
92
Hence
Z
1 2
ϕX (u) − 1 − iu · E[X] = (eiuX − 1 − iuX) dP 6 · u · E[X 2 ].
2
ϕX (u)−1−iu·E[X]
Now define θ(u) := 0 if u2 E[X 2 ] = 0, and θ(u) := 1
·u2 ·E[X 2 ]
otherwise.
2
Proposition 6.7. Suppose that the following two conditions (F) and (b) hold:
σk
(F ) lim max =0 (Feller’s condition)
n→∞ 1≤k≤n sn
n
u 1
X
(b) lim ϕXk − 1 = − u2 ∀u ∈ R.
n→∞ sn 2
k=1
n→∞ 1 2
and ϕSn∗ (u) −−−→ e− 2 u (= N
\ (0, 1)(u) by Proposition 5.7) pointwise as well
as N
\ (0, 1)(u) continuous at u = 0, implies by Lévy’s continuity theorem 5.4
and the Uniqueness theorem 5.2 that limn→∞ PSn∗ = N (0, 1) weakly.
For the proof of (2.6) we need to show that for all u ∈ R
n n
u Y u
Y
lim ϕXk − exp ϕXk −1 = 0.
n→∞ sn sn
k=1
|k=1 P {z }
= exp( ...) → exp(− 12 u2 ) by (b)
93
To this end fix u ∈ R and note that |ϕXk | 6 1, hence
u u
exp ϕXk −1 = exp Re ϕXk − 1 6 1.
sn sn
n
Y n
Y
ak − bk
k=1 k=1
= (a1 − b1 ) · a2 · · · an + b1 · (a2 − b2 ) · a3 · · · an + . . .
+ b1 · · · bn−1 · (an − bn )
n
X
6 |ak − bk |.
k=1
Consequently,
n n
u Y u
Y
ϕXk − exp ϕXk −1
sn sn
k=1 k=1
n
u u
X
6 ϕXk − exp ϕXk −1 =: Dn .
sn sn
k=1
n
X
Dn = |zk + 1 − ezk |.
k=1
Note that E[Xk ] = 0 and E[Xk2 ] = σk2 . The previous proposition now implies
that for all k
u u 1 u u 2 1 u 2 2
|zk | = ϕXk −1 = i ·E[Xk ]+ ·θ( ) ·E[Xk2 ] 6 σk ,
sn sn 2 sn sn 2 sn
94
and moreover by (F) we can find n0 ∈ N such that for all n > n0 and 1 6 k 6 n
1 u 2 2
σk < ε.
2 sn
Hence for all n > n0
n n
X u2 X σk2 u2
Dn 6 ε |zk | 6 ε = ε · .
2 s2n 2
k=1 k=1
Consequently, limn→∞ Dn = 0.
Proof of Proposition 6.2. W.l.o.g. assume that E[Xn ] = 0 for all n ∈ N. We
will use Proposition 6.7. Since (L) ⇒ (F) by Lemma 6.4 it remains to show (b)
of Proposition 6.7. We will show that Lindeberg’s condition implies (b), i.e. we
show that (L) implies
n
u 1
X
lim ϕXk − 1 = − · u2 .
n→∞ sn 2
k=1
u u 1 u2 u2
Yk 6 exp i · · Xk − 1 − i · · Xk + · 2 · Xk2 6 2 · Xk2 .
sn sn 2 sn sn
Then
n n
u 1 u2 2
X 1 X u
2
ϕXk −1 + ·u = ϕXk − 1 + · 2 · σk
sn 2 sn 2 sn
k=1 k=1
n Z
u u 1 u2
X
· Xk + · 2 · Xk2 dP
6 exp i · · Xk − 1 − i ·
sn sn 2 sn
k=1 | {z }
E[... ]=0
n
X
6 E[Yk ],
k=1
95
and for any ε > 0
Z Z
E[Yk ] = Yk dP + Yk dP
{|Xk |>εsn } {|Xk |<εsn }
u2 |u|3
Z Z
6 2 Xk2 dP + 3 |Xk |3 dP.
sn {|Xk |>εsn } 6sn {|Xk |<εsn }
Note that
σk2
Z Z
1 3ε
|Xk | dP 6 2 Xk2 dP = ε · ,
s3n {|Xk |<εsn } sn s2n
so that we obtain
n n Z n
X
2
X X k 2 |u|3 X σk2
E[Yk ] 6 u dP + ·ε
X
{| snk |>ε} sn 6 s2n
k=1 k=1
|k=1{z }
=1
|u|3
= u2 Ln (ε) + · ε.
6
Consequently
n
X
lim E[Yk ] = 0 ,
n→∞
k=1
and thus
n h
u 1
X i
lim ϕXk − 1 + · u2 = 0.
n→∞ sn 2
k=1
Π := m + λσ 2
= average claim size + safety loading.
96
After some fixed amount of time:
Income: nΠ
n
X
Expenditures: Sn = Xi .
i=1
The central limit theorem implies for large n that Sn∗ ∼ N (0, 1), so that
K + nΠ − nm K + nλσ 2
P(R) = P Sn∗ > √ = P Sn∗ > √
nσ nσ
CLT
K + nλσ 2
≈ 1−Φ √ ,
nσ
| {z }
n→∞
−−−→∞
97
How large do we have to choose n in order to let the probability of ruin
P(R) fall below 1‰?
Answer: Φ(. . .) > 0.999, hence n > 10 611.
in the year 1730 and De Moivre used it in his proof of the CLT for Bernoulli
experiments (1733).
Conversely, in 1977, Weng provided an independent proof of the formula,
using the CLT (note that we did not use Stirling’s formula in our proof of
the CLT). Here is Weng’s proof:
Let X1 , X2 , . . . be i.i.d. with distribution π1 (Poisson distribution with
parameter 1), i.e.,
∞
−1
X 1
PXn = e δk .
k!
k=0
Sn − n
Sn∗ = √ ,
n
In particular, for
f∞ (x) := x−
= (−x) ∨ 0
98
it follows that
Z Z x − n
f∞ dPSn∗ = f∞ √ πn (dx)
R n
|
(
{z }
= 0 if x>n
n−x
= √
n
if x6n
n
−n
X nk n−k
=e · √
k! n
k=0 | {z }
=f∞ ( k−n
√ )
n
n
e−n X nk (n − k)
= √ · n+
n k!
k=1
n
e−n X nk+1 nk
= √ · n+ −
n k! (k − 1)!
|k=1 {z }
n+1 1
= n n! − n0! (telescoping sum)
1
e−n · nn+ 2
= .
n!
Moreover,
Z Z 0 0
1 − x2
2
1 x2 1
f∞ dN (0, 1) = √ (−x)·e dx = √ ·e− 2 =√ .
2π −∞ 2π −∞ 2π
Hence, Stirling’s formula (2.7) would follow, once we have shown that
Z Z
n→∞
f∞ dPSn∗ −−−→ f∞ dN (0, 1). (2.8)
Note that this is not implied by the weak convergence in the CLT since
f∞ is continuous but unbounded. Hence, we consider for given m ∈ N
fm := f∞ ∧ m ∈ Cb (R) .
99
Define gm := f∞ − fm (≥ 0). (2.8) then follows from a "3ε-argument",
once we have shown that
Z
1
(0 6) gm dPSn∗ 6 ∀m,
m
Z
1
(0 6) gm dN (0, 1) 6 ∀m.
m
100
3 Conditional probabilities
1 Elementary definitions
Let (Ω, A, P) be a probability space.
P(A ∩ B)
P(A | B) := , A ∈ A,
P(B)
PB := P( · | B)
|A ∩ B|
P(A | B) = = proportion of elements in A that are in B.
|B|
P(A) · P(B)
P(A | B) = = P(A).
P(B)
101
Example 1.3. (i) Suppose that a family has two children. Consider the fol-
lowing two events: B := "at least one boy" and A := "two boys". Then
P(A | B) = 13 , because
Ω = (boy, boy), (girl, boy), (boy, girl), (girl, girl) ,
P = uniform distribution,
and thus
|A ∩ B| 1
P(A | B) = = .
|B| 3
P(X1 = k, X2 = n − k)
P(X1 = k | X1 + X2 = n) =
P (X1 + X2 = n)
k
λn−k
e−λ1 λk!1 · e−λ2 (n−k)!
2 k n−k
n λ1 λ2
= n = · ,
e−λ λn! k λ λ
Sn := X1 + . . . + Xn
and
102
For given (x1 , . . . , xn ) ∈ {0, 1}n and fixed k ∈ {0, . . . , n}
P(X1 = x1 , . . . , Xn = xn | Sn = k)
if i xi 6= k
( P
0
= pk (1−p)n−k n −1
otherwise
=
(nk)pk (1−p)n−k k
n
because |Ωk | = and for all (x1 , ...xn ) ∈ Ωk we have ν({(x1 , ...xn )}) =
k
n −1
.
k
103
male female
Appl. acc. PM (A | Bi ) Appl. acc. PF (A | Bi )
B1 826 551 0.67 108 89 0.82
B2 560 353 0.63 25 17 0.68
B3 325 110 0.34 593 219 0.37
B4 373 22 0.06 341 24 0.07
2084 1036 1067 349
It follows that for all four faculties the probability of being accepted was higher
for female students than it was for male students:
PM (A | Bi ) < PF (A | Bi ).
Nevertheless, the preference turns into its opposite if looking at the total
probability of admission:
4
X
PF (A) = PF (A | Bi ) · PF (Bi )
i=1
4
X
< PM (A | B) · PM (Bi ) = PM (A).
i=1
|B1 ∩ F | 108 1
PF (B1 ) = = ≈ ,
|F | 1067 10
etc. and observe that male students mainly applied at faculties with a high
probability of admission, whereas female students mainly applied at faculties
with a low probability of admission.
104
Proof.
P(A ∩ Bi ) 1.4 P(A | Bi ) · P(Bi )
P(Bi | A) = = n .
P(A) X
P(A | Bj ) · P(Bj )
j=1
Example 1.7 (A posteriori probabilities in medical tests). Suppose that one out
of 145 persons of the same age have the disease D, i.e. the a priori probability
of having D is P(D) = 145
1
.
Suppose now that a medical test for D is given which detects D in 96 % of
all cases, i.e.
P(positive | D) = 0.96 .
However, the test also is positive in 6% of the cases, where the person does not
have D, i.e.
P(positive | Dc ) = 0.06 .
Suppose now that the test is positive. What is the a posteriori probability of
actually having D?
So we are interested in the conditional probability P(D | positive):
1.6 P(positive | D) · P(D)
P(D | positive) =
P(positive | D) · P(D) + P(positive | Dc ) · P(Dc )
1
0.96 · 145 1 1
= 1 144
= 6
= .
0.96 · 145 + 0.06 · 145 1+ 96 · 144 10
Note: in only one out of ten cases, a person with a positive result actually has
D.
Another conditional probability of interest in this context is the probability of
not having D, once the test is negative, i.e., P(Dc | negative):
P(negative | Dc ) · P(Dc )
P(Dc | negative) =
P(negative | D) · P(D) + P(negative | Dc ) · P(Dc )
144
0.94 · 145 94 · 144
= 1 144
= ≈ 0.9997.
0.04 · 145 + 0.94 · 145
4 + 94 · 144
Note: The two conditional probabilities interchange, if the a priori probability
of not having D is low (e.g. 145
1
). If the risc of having D is high and one wants
to test whether or not having D, the a posteriori probability of not having D,
given that the test was negative, is only 0.1.
105
Example 1.8 (computing total probabilities with conditional probabilities). Let
S be a finite set, Ω := S n+1 , n ∈ N, and P be a probability measure on Ω.
Let Xi : Ω → S, i = 0, . . . , n, be the canonical projections Xi (ω) := xi for
ω = (x0 , . . . , xn ).
If we interpret 0, 1, . . . , n as time points, then (Xi )06i6n may be seen as a
stochastic process and X0 (ω), . . . , Xn (ω) is said to be a sample path (or a
trajectory ) of the process.
For all ω ∈ Ω we either have P ({ω}) = 0 or
P({ω}) = P(X0 = x0 , . . . , Xn = xn )
= P(X0 = x0 , . . . , Xn−1 = xn−1 )
· P(Xn = xn | X0 = x0 , . . . , Xn−1 = xn−1 )
..
.
= P(X0 = x0 )
· P(X1 = x1 | X0 = x0 )
· P(X2 = x2 | X0 = x0 , X1 = x1 )
···
· P(Xn = xn | X0 = x0 , . . . , Xn−1 = xn−1 ).
106
Example 1.9. A stochastic process is called a Markov chain, if
K(x1 , · ) := µ ∀ x1 ∈ S1 no coupling!
107
Example 2.3. (i) Transition probabilities of the random walk on Zd
S1 = S2 = S := Zd with S := P(Zd )
1 X
K(x, · ) := δy , x ∈ Zd ,
2d
y∈N (x)
with
N (x) := y ∈ Zd kx − yk = 1
(ii) Ehrenfest model Consider a box containing N balls. The box is divided
into two parts (“left” and “right“). A ball is selected randomly and put into
the other half.
“microscopic level” the state space is S := {0, 1}N with
x = (x1 , . . . , xN ) ∈ S defined by
(
1 if the ith ball is contained in the “left” half
xi :=
0 if the ith ball is contained in the “right” half
N
1 X
K(x, · ) := δ(x1 ,...,xi−1 ,1−xi ,xi+1 ,...,xN ) .
N
i=1
N −j j
K(j, · ) := · δj+1 + · δj−1 .
N N
108
Our aim is to construct a probability measure P (:= µ1 ⊗ K) on the product
space (Ω, A), where
Ω := S1 × S2
!
A := S1 ⊗ S2 := σ(X1 , X2 ) = σ {A1 × A2 | A1 ∈ S1 , A2 ∈ S2 } ,
and
Xi : Ω = S1 × S2 → Si , i = 1, 2,
(x1 , x2 ) 7→ xi ,
satisfying
Z
P(A1 × A2 ) = K(x1 , A2 ) µ1 (dx1 )
A1
Ω := S1 × S2 , (3.1)
A := σ {A1 × A2 | Ai ∈ Si } =: S1 ⊗ S2 . (3.2)
Then there exists a probability measure P (=: µ1 ⊗ K) on (Ω, A), such that
for all A-measurable functions f > 0
Z Z Z
f dP = f (x1 , x2 ) K(x1 , dx2 ) µ1 (dx1 ), (3.3)
Ω
Here
109
S2
(1)
A x1
(1) (2)
x2 A x2 Ax2
(2)
A x1
(3)
A x1
A
x1 S1
Note
(1) (2) (3) (1) (2)
Ax1 = Ax1 ∪ Ax1 ∪ Ax1 and Ax2 = Ax2 ∪ Ax2 .
110
is S2 /B(R)-measurable.
Suppose now that f > 0 is A-measurable. Then
Z Z
x1 7→ f (x1 , x2 ) K(x1 , dx2 ) = fx1 (x2 ) K(x1 , dx2 ) (3.6)
is well-defined.
We will show in the following that this function is S1 -measurable. We will
prove the assertion for f = 1A , A ∈ A first. For general f the measurability
then follows by measure-theoretic induction.
Note that for f = 1A we have that
Z
1 (x , x2 ) K(x1 , dx2 ) = K(x1 , Ax1 ).
1
|A {z }
=1Ax1 (x2 )
is well-defined.
For all A ∈ A we can now define
Z Z Z
P(A) := 1 (x , x2 ) K(x1 , dx2 )
1 µ(dx1 ) = K(x1 , Ax1 ) µ(dx1 ).
|A {z }
=1Ax1 (x2 )
111
For the proof of the σ-additivity, let A1 , A2 , . . . be pairwise disjoint subsets in
A. It follows that for all x1 ∈ S1 the subsets (A1 )x1 , (A2 )x1 , . . . are pairwise
disjoint too, hence
[ Z [
P An = K x1 , An µ(dx1 )
x1
n∈N n∈N
∞
Z X
= K x1 , (An )x1 µ(dx1 )
n=1
∞ Z
X ∞
X
= K x1 , (An )x1 µ(dx1 ) = P(An ).
n=1 n=1
In the second equality we used that K(x1 , ·) is a probability measure for all x1
and in the third equality we used monotone integration.
Finally, (3.3) follows from measure-theoretic induction.
and
(P ◦ X2−1 )(A2 ) = P(X2 ∈ A2 ) = P(S1 × A2 )
Z
= K(x1 , A2 ) µ1 (dx1 ) =: (µ1 K)(A2 ).
112
So, the marginal distributions are
P ◦ X1−1 = µ1 P ◦ X2−1 = µ1 K .
x+1 N − (x − 1)
= µ({x + 1}) · + µ({x − 1}) ·
N N
N x+1 N N − (x − 1)
= 2−N · + 2−N ·
x+1 N x−1 N
N −1 N −1 N
= 2−N + = 2−N ·
x x−1 x
= µ({x}).
σ2
µ = N 0,
1 − α2
P = µ1 ⊗ K ?
113
Answer: In most cases - yes, e.g. if S1 and S2 are Polish spaces (i.e., a
topological space having a countable basis, whose topology is induced by some
complete metric), using conditional expectations (see below).
Example 2.9. In the particular case, when S1 is countable (and S1 = P(S1 )),
we can disintegrate P explicitly as follows: Necessarily, µ1 has to be the distri-
bution of the projection X1 onto the first coordinate. To define the kernel K,
let ν be any probability measure on (S2 , S2 ) and define
P(X2 ∈ A2 | X1 = x1 ) if µ ({x }) > 0
| 1 {z 1 }
K(x1 , A2 ) := =P(X1 =x1 )
if µ1 ({x1 }) = 0.
ν(A )
2
Then
X
P(A1 × A2 ) = P(X1 ∈ A1 , X2 ∈ A2 ) = P(X1 = x1 , X2 ∈ A2 )
x1 ∈A1
X
= P(X1 = x1 ) · P(X2 ∈ A2 | X1 = x1 )
x1 ∈A1 ,
µ1 ({x1 })>0
X Z
= µ1 ({x1 }) · K(x1 , A2 ) = K(x1 , A2 ) µ1 (dx1 )
x1 ∈A1 A1
= (µ1 ⊗ K)(A1 × A2 ),
hence P = µ1 ⊗ K.
S n := S0 × S1 × · · · × Sn
Sn := S0 ⊗ S1 ⊗ · · · ⊗ Sn = σ {A0 × · · · × An | Ai ∈ Si } .
114
• – an initial distribution µ0 on (S0 , S0 )
– transition probabilities
Kn (x0 , . . . , xn−1 ), dxn
P0 := µ0 on S0 ,
Pn := P n−1 ⊗ Kn on S n = S n−1 × Sn
Note that Fubini’s theorem (see Proposition 2.4) implies that for any Sn -
measurable function f : S n → R+ :
Z
f dPn
Z Z
n−1
= P d(x0 , . . . , xn−1 ) Kn (x0 , . . . , xn−1 ), dxn f (x0 , . . . , xn−1 , xn )
= ···
Z Z Z
= µ0 (dx0 ) K1 (x0 , dx1 ) · · · Kn (x0 , . . . , xn−1 ), dxn f (x0 , . . . , xn ).
∞
[
A := σ(X0 , X1 , . . . ) = σ An .
n=1
115
Our main goal in this section is to construct a probability measure P on (Ω, A)
satisfying
Z Z
f (X0 , . . . , Xn ) dP = f dPn ∀ n = 0, 1, 2, . . . ,
i.e. the “finite dimensional distributions” P ◦ (X0 , . . . , Xn )−1 of P, i.e., the joint
distributions of X0 , . . . , Xn under P, are given by Pn for any n ≥ 0.
116
S∞
It follows that P is well-defined by (3.8) on B = n=0 An . B is an algebra (i.e.,
a collection of subsets of Ω containing Ω, that is closed under complements and
finite (!) unions), and P is finitely additive on B, since P is (σ-) additive on
An for every n. To extend P to a σ-additive probability measure on A = σ(B)
with the help of Caratheodory’s extension theorem, it suffices now to show that
P is ∅-continuous, i.e., the following condition is satisfied:
n→∞
Bn ∈ B , Bn & ∅ ⇒ P(Bn ) −−−→ 0 .
with
An+1 ⊂ An × Sn+1
and we have to show that
n→∞
P(Bn ) = Pn (An ) −−−→ 0
Note that
Z
n n
P (A ) = µ0 (dx0 ) f0,n (x0 )
with
Z Z
f0,n (x0 ) := K1 (x0 , dx1 ) · · · Kn (x0 , . . . , xn−1 ), dxn 1An (x0 , . . . , xn ).
117
It is easy to see that the sequence (f0,n )n∈N is decreasing, because
Z
Kn+1 (x0 , . . . , xn ), dxn+1 1An+1 (x0 , . . . , xn+1 )
Z
6 Kn+1 (x0 , . . . , xn ), dxn+1 1An ×Sn+1 (x0 , . . . , xn+1 )
= 1An (x0 , . . . , xn ) ,
hence
Z Z
f0,n+1 (x0 ) = K1 (x0 , dx1 ) · · · Kn+1 (x0 , . . . , xn ), dxn+1 1An+1 (x0 , . . . , xn+1 )
Z Z
≤ K1 (x0 , dx1 ) · · · Kn (x0 , . . . , xn−1 ), dxn 1An (x0 , . . . , xn )
= f0,n (x0 ) .
with
Z
f1,n (x1 ) := K2 (x̄0 , x1 ), dx2
Z
··· Kn (x̄0 , x1 , . . . , xn−1 ), dxn 1An (x̄0 , x1 , . . . , xn ) .
Using the same argument as above (now with µ1 = K1 (x̄0 , ·)) we can find some
x̄1 ∈ S1 with
118
Iterating this procedure, we find for any i = 0, 1, . . . some x̄i ∈ Si such that
for all m > 1
Z
inf Km (x̄0 , . . . , x̄m−1 ), dxm
n∈N
Z
··· Kn (x̄0 , . . . , x̄m−1 , xm , . . . , xn−1 ), dxn
In particular, if m = n
Z
0< Km (x̄0 , . . . , x̄m−1 ), dxm 1Am (x̄0 , . . . , x̄m−1 , xm )
i.e.
∞
\
ω̄ ∈ Bm .
m=0
3.2 Examples
1) Infinite product measures: Consider the situation of Definition 3.2 and
let
119
Then
∞
O
µn := P
n=0
I.-T.
P(X0 ∈ A0 , . . . , Xn ∈ An ) = Pn (A0 × · · · × An )
Z Z Z
= µ0 (dx0 ) µ1 (dx1 ) · · · µn (dxn ) 1A0 ×···×An (x0 , . . . , xn )
2) Markov chains:
time-inhomogeneous,
Kn (x0 , . . . , xn−1 ), · = K̃n (xn−1 , · )
120
For n > 1
Z
I.-T.
E[Xn2 ] = x2n Pn d(x0 , . . . , xn )
Z Z
= x2n K(xn−1 , dxn ) Pn−1 (d(x0 , . . . , xn−1 ))
| {z }
=βx2n−1 ,
K(xn−1 ,dxn )=N (0,βx2n−1 )
2
= β · E[Xn−1 ] = · · · = β n x20 .
∞
hence Xn2 < ∞ P-a.s., and therefore
P
n=1
lim Xn = 0 P-a.s.
n→∞
because Z r
2
|Xn | K(xn−1 , dxn ) = · β |xn−1 | .
π
Consequently,
∞ ∞ n2
2
hX i X
E |Xn | = ·β · |x0 | ,
π
n=1 n=1
lim Xn = 0 P-a.s.
n→∞
121
In fact, if we define
Z ∞
4 − x2
2
β0 := exp − √ log x · e dx
2π 0
= 2eC ≈ 3.56,
where
n
1
X
C := lim − log n ≈ 0.577
n→∞ k
k=1
Proof. It is easy to see that for all n: Xn 6= 0 P-a.s. For n ∈ N we can then
define
(
Xn
Xn−1 on {Xn−1 6= 0}
Yn :=
0 on {Xn−1 = 0}.
Then Y1 , Y2 , . . . are independent r.v. with (identical) distribution N (0, β), be-
cause for all measurable functions f : Rn → R+
Z
f (Y1 , . . . , Yn ) dP
Z n2 12
I.-T. x1 xn 1 1
= f ,..., · ·
x0 xn−1 2πβ x0 · · · x2n−1
2
x2 x2n
· exp − 1 2 − · · · − dx1 . . . dxn
2βx0 2βx2n−1
n2
y 2 + · · · + yn2
Z
1
= f (y1 , . . . , yn ) · · exp − 1 dy1 . . . dyn .
2πβ 2β
Note that
122
and thus
n
1 1 1X
· log|Xn | = · log|x0 | + log|Yi | .
n n n
i=1
Note that (log|Yi |)i∈N are independent and identically distributed with
Z ∞
1 x2
log x · e− 2β dx .
E log|Yi | = 2 · √
2πβ 0
Kolmogorov’s law of large numbers now implies that
Z ∞
1 2 x2
lim · log|Xn | = √ log x · e− 2β dx P-a.s.
n→∞ n 2πβ 0
Consequently,
Z
n→∞
|Xn | −−−→ 0 with exponential rate, if · · · < 0,
Z
n→∞
|Xn | −−−→ ∞ with exponential rate, if · · · > 0.
Note that
Z ∞ y= √x Z ∞
2 2
− x2β β 2 p y2
√ log x · e dx = √ log( βy) · e− 2 dy
2πβ 0 2π 0
Z ∞
1 2 y2
= · log β + √ log y · e− 2 dy
2 2π 0
<0 ⇔ β < β0 .
It remains to check that
Z ∞
4 x2
−√ log x · e− 2 dx = log 2 + C
2π 0
where C is the Euler-Mascheroni constant (Exercise!)
Example 3.5. Consider independent 0-1-experiments with success probability
p ∈ [0, 1] but suppose that p ist unknown. In the canonical model:
Si := {0, 1}, i ∈ N; Ω := {0, 1}N ,
Xi : Ω → {0, 1}, i ∈ N, projections,
∞
O
µi := pδ1 + (1 − p)δ0 , i ∈ N; Pp := µi
i=1
123
An and A are defined as above.
Since p is unknown, we choose an a priori distribution µ on [0, 1], B([0, 1])
(as a distribution for the unknown parameter p).
Claim: K(p, · ) := Pp ( · ) is a transition probability from [0, 1], B([0, 1]) to
(Ω, A).
Proof. We only need to show that for given A ∈ A the mapping p 7→ Pp (A) is
measurable on [0, 1]. To this end define
{X1 = x1 , . . . , Xn = xn }, n ∈ N, x1 , . . . , xn ∈ {0, 1} ,
because
Pn Pn
xi
Pp (X1 = x1 , . . . , Xn = xn ) = p i=1 (1 − p)n− i=1 xi
on (Ω, A). The integral can be seen as mixture of Pp according to the a priori
distribution µ.
Note: The Xi are no longer independent under P!
We now calculate the initial distribution PX1 and the transition probabilities in
the particular case where µ is the Lebesgue measure (i.e., the uniform distribu-
tion on the unknown parameter p):
Z
P ◦ X1−1
= pδ1 + (1 − p)δ0 ( · ) µ(dp)
Z Z
1 1
= p µ(dp) · δ1 + (1 − p) µ(dp) · δ0 = · δ1 + · δ0 .
2 2
124
Pn
For given n ∈ N and x1 , . . . , xn ∈ {0, 1} with k := i=1 xi it follows that
P(Xn+1 = 1 | X1 = x1 , . . . , Xn = xn )
P(Xn+1 = 1, Xn = xn , . . . , X1 = x1 )
=
P(Xn = xn , . . . , X1 = x1 )
Z
pk+1 (1 − p)n−k µ(dp)
(3.9)
= Z
pk (1 − p)n−k µ(dp)
µn := P ◦ Xn−1 , n ∈ N0 .
Then:
∞
O n
O
Xn , n ∈ N, independent ⇔ P= µn n
(i.e. P = µk ∀n ≥ 0).
n=0 k=0
N∞
Proof. Let P̃ := n=0 µn . Then
P = P̃
P(X0 ∈ A0 , . . . , Xn ∈ An ) = P̃(X0 ∈ A0 , . . . , Xn ∈ An )
n
Y n
Y
= µi (Ai ) = P(Xi ∈ Ai ),
i=0 i=0
125
Definition 3.7. Let Si := S, i ∈ N0 , (Ω, A) be the canonical model and P be
a probability measure on (Ω, A). In particular, (Xn )n>0 is a stochastic process
in the sense of Definition 3.2. Let J ⊂ N0 , |J| < ∞. Then the distribution of
(Xj )j∈J under P
µ{0,...,n} , n ∈ N.
4 Stationarity
Let (S, S) be a measurable space, Ω = S N0 and (Ω, A) be the associated
canonical model. Let P be a probability measure on (Ω, A).
ω = (x0 , x1 , . . . ) 7→ T ω := (x1 , x2 , . . . )
In particular: T is A/A-measurable
P ◦ T −1 = P .
µ{0,...,n} = µ{k,...,k+n} .
126
Proof.
P ◦ T −1 = P
⇔ P ◦ T −k = P ∀k ∈ N0
3.8
⇔ (P ◦ T −k ) ◦ (X0 , . . . , Xn )−1 = P ◦ (X0 , . . . , Xn )−1 ∀ k, n ∈ N0
⇔ µ{k,...,n+k} = µ{0,...,n} .
Remark 4.5. (i) The last proposition implies in the particular case
∞
O
P= µn with µn := P ◦ Xn−1
n=0
that
P stationary ⇔ µn = µ0 ∀ n ≥ 0.
ergodic, i.e.
P = 0 − 1 on I := {A ∈ A | T −1 (A) = A} .
I is called the σ-algebra of shift-invariant sets.
Proof. Using part (ii) of the previous remark, it suffices to show that I ⊂ A∗ .
But
127