Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (2 votes)
144 views38 pages

Elements of Probability Theory

This chapter introduces concepts in probability theory, including: 1) A probability space consists of an outcome space, events, and a probability measure. A σ-algebra defines events and satisfies closure properties. 2) A random variable is a measurable function from an outcome space to the real numbers. Its distribution function defines probabilities of outcomes. 3) Moments like expectation and variance can be defined as integrals with respect to the probability measure. Jensen's inequality relates convex/concave functions of random variables to moments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
144 views38 pages

Elements of Probability Theory

This chapter introduces concepts in probability theory, including: 1) A probability space consists of an outcome space, events, and a probability measure. A σ-algebra defines events and satisfies closure properties. 2) A random variable is a measurable function from an outcome space to the real numbers. Its distribution function defines probabilities of outcomes. 3) Moments like expectation and variance can be defined as integrals with respect to the probability measure. Jensen's inequality relates convex/concave functions of random variables to moments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Chapter 5

Elements of Probability Theory

The purpose of this chapter is to summarize some important concepts and results in proba-
bility theory. Of particular interest to us are the limit theorems which are powerful tools to
analyze the convergence behaviors of econometric estimators and test statistics. These prop-
erties are the core of the asymptotic analysis in subsequent chapters. For a more complete
and thorough treatment of probability theory see Davidson (1994) and other probability
textbooks, such as Ash (1972) and Billingsley (1979). Bierens (1994), Gallant (1997) and
White (2001) also provide concise coverages of the topics in this chapter. Many results here
are taken freely from the references cited above; we will not refer to them again in the text
unless it is necessary.

5.1 Probability Space and Random Variables

5.1.1 Probability Space

The probability space associated with a random experiment is determined by three compo-
nents: the outcome space Ω whose element ω is an outcome of the experiment, a collection
of events F whose elements are subsets of Ω, and a probability measure IP assigned to the
elements in F.

Given the subset A of Ω, its complement is defined as Ac = {ω ∈ Ω : ω 6∈ A}. In


the probability space (Ω, F, IP), F is a σ-algebra (σ-field) in the sense that it satisfies the
following requirements.

1. Ω ∈ F.

2. If A ∈ F, then Ac ∈ F.

3. If A1 , A2 , . . . are in F, then ∪∞
n=1 An ∈ F.

105
106 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

The first and second properties together imply that Ωc = ∅ is also in F. Combining the
second and third properties we have from de Morgan’s law that

!c ∞
[ \
An = Acn ∈ F.
n=1 n=1

A σ-algebra is thus closed under complementation, countable union and countable inter-
section.

The probability measure IP : F 7→ [0, 1] is a real-valued set function satisfying the


following axioms.

1. IP(Ω) = 1.

2. IP(A) ≥ 0 for all A ∈ F.


P∞
3. if A1 , A2 , . . . ∈ F are disjoint, then IP(∪∞
n=1 An ) = n=1 IP(An ).

From these axioms we easily deduce that IP(∅) = 0, IP(Ac ) = 1 − IP(A), IP(A) ≤ IP(B) if
A ⊆ B, and

IP(A ∪ B) = IP(A) + IP(B) − IP(A ∩ B).

Moreover, if {An } is an increasing (decreasing) sequence in F with the limiting set A, then
limn IP(An ) = IP(A).

Let C be a collection of subsets of Ω. The intersection of all the σ-algebras that contain
C is the smallest σ-algebra containing C; see Exercise 5.1. This σ-algebra is referred to as
the σ-algebra generated by C, denoted as σ(C). When Ω = R, the Borel field is the σ-algebra
generated by all open intervals (a, b) in R, usually denoted as B d . Note that open intervals,
closed intervals [a, b], half-open intervals (a, b] or half lines (−∞, b] can be obtained from
each other by taking complement, union and/or intersection. For example,
∞  ∞ 
\ 1 [ 1i
(a, b] = a, b + , (a, b) = a, b − .
n n
n=1 n=1

Thus, the collection of all closed intervals (half-open intervals, half lines) generates the same
Borel field. This is why open intervals, closed intervals, half-open intervals and half lines
are also known as Borel sets. The Borel field on Rd , denoted as B d , is generated by all open
hypercubes:

(a1 , b1 ) × (a2 , b2 ) × · · · × (ad , bd ).

Equivalently, B d can be generated by all closed hypercubes:

[a1 , b1 ] × [a2 , b2 ] × · · · × [ad , bd ],

c Chung-Ming Kuan, 2007–2010



5.1. PROBABILITY SPACE AND RANDOM VARIABLES 107

or by

(−∞, b1 ] × (−∞, b2 ] × · · · × (−∞, bd ].

The sets that generate the Borel field B d are all Borel sets.

5.1.2 Random Variables

A random variable z defined on (Ω, F, IP) is a function z : Ω 7→ R such that for every B in
the Borel field B, its inverse image of B is in F, i.e.,

z −1 (B) = {ω : z(ω) ∈ B} ∈ F.

We also say that z is a F/B-measurable (or simply F-measurable) function. Non-measurable


functions are very exceptional in practice and hence are not of general interest. Given the
random outcome ω, the resulting value z(ω) is known as a realization of z. The realiza-
tion of z varies with ω and hence is governed by the random mechanism of the underlying
experiment.

A Rd -valued random variable (random vector) z defined on (Ω, F, IP) is a function


z : Ω 7→ Rd such that for every B ∈ B d ,

z −1 (B) = {ω : z(ω) ∈ B} ∈ F;

that is, z is a F/B d -measurable function. Given the random vector z, its inverse images
z −1 (B) form a σ-algebra, denoted as σ(z). This σ-algebra must be in F, and it is the
smallest σ-algebra contained in F such that z is measurable. This is known as the σ-
algebra generated by z or, more intuitively, the information set associated with z.

A function g : R 7→ R is said to be B-measurable or Borel measurable if

{ζ ∈ R : g(ζ) ≤ b} ∈ B.

If z is a random variable defined on (Ω, F, IP), then g(z) is also a random variable defined
on the same probability space provided that g is Borel measurable. Note that the func-
tions we usually encounter (e.g., continuous functions and integrable functions) are Borel
measurable. Similarly, for the d-dimensional random vector z, g(z) is a random variable
provided that g is B d -measurable.

Recall from Section 2.1 that the joint distribution function of z is the non-decreasing,
right-continuous function Fz such that for ζ = (ζ1 . . . ζd )0 ∈ Rd ,

Fz (ζ) = IP{ω ∈ Ω : z1 (ω) ≤ ζ1 , . . . , zd (ω) ≤ ζd },

c Chung-Ming Kuan, 2007–2010



108 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

with

lim Fz (ζ) = 0, lim Fz (ζ) = 1.


ζ1 →−∞, ..., ζd →−∞ ζ1 →∞, ..., ζd →∞

The marginal distribution function of the i th component of z is such that

Fzi (ζi ) = IP{ω ∈ Ω : zi (ω) ≤ ζi } = Fz (∞, . . . , ∞, ζi , ∞, . . . , ∞).

Note that while IP is a set function defined on F, the distribution function of z is a point
function defined on Rd .
Two random variables y and z are said to be (pairwise) independent if, and only if, for
any Borel sets B1 and B2 ,

IP(y ∈ B1 and z ∈ B2 ) = IP(y ∈ B1 ) IP(z ∈ B2 ).

This immediately leads to the standard definition of independence: y and z are independent
if, and only if, their joint distribution is the product of their marginal distributions, as in
Section 2.1. A sequence of random variables {zi } is said to be totally independent if
!
\ Y
IP {zi ∈ Bi } = IP(zi ∈ Bi ),
all i all i

for any Borel sets Bi . In what follows, a totally independent sequence will be referred to
an independent sequence or a sequence of independent variables for convenience. For an
independent sequence, we have the following generalization of Lemma 2.1.

Lemma 5.1 Let {zi } be a sequence of independent random variables and hi , i = 1, 2, . . .,


be Borel-measurable functions. Then {hi (zi )} is also a sequence of independent random
variables.

5.1.3 Moments and Norms

The expectation of the i th element of z is


Z
IE(zi ) = zi (ω) d IP(ω),

where the right-hand side is a Lebesgue integral. In view of the distribution function defined
above, a change of ω causes the realization of z to change so that
Z Z
IE(zi ) = ζi dFz (ζ) = ζi dFzi (ζi ),
Rd R

where Fzi is the marginal distribution function of the i th component of z, as defined in


Section 2.2. For the Borel measurable function g of z,
Z Z
IE[g(z)] = g(z(ω)) d IP(ω) = g(ζ) dFz (ζ).
Ω Rd

c Chung-Ming Kuan, 2007–2010



5.1. PROBABILITY SPACE AND RANDOM VARIABLES 109

Other moments, such as variance and covariance, can also be defined as Lebesgue integrals
with respect to the probability measure; see Section 2.2.

A function g is said to be convex on a set S if for any a ∈ [0, 1] and any x, y in S,



g ax + (1 − a)y ≤ ag(x) + (1 − a)g(y);

g is concave on S if the inequality above is reversed. For example, g(x) = x2 is convex,


and g(x) = log x for x > 0 is concave. The result below is concerned with convex (concave)
transformations of random variables.

Lemma 5.2 (Jensen) For the Borel measurable function g that is convex on the support
of the integrable random variable z, suppose that g(z) is also integrable. Then,

g(IE(z)) ≤ IE[g(z)];

the inequality reverses if g is concave.

For the random variable z with finite p th moment, let kzkp = [IE(|z|p )]1/p denote its
Lp -norm. Also define the inner product of two square integrable random variables zi and
zj as their cross moment:

hzi , zj i = IE(zi zj ).

Then, L2 -norm can be obtained from the inner product as kzi k2 = hzi , zi i1/2 . It is easily
seen that for any c > 0 and p > 0,
Z Z
p p
c IP(|z| ≥ c) = c 1{ζ:|ζ|≥c} dFz (ζ) ≤ |ζ|p dFz (ζ) ≤ IE |z|p ,
{ζ:|ζ|≥c}

where 1{ζ:|ζ|≥c} is the indicator function which equals one if |ζ| ≥ c and equals zero other-
wise. This establishes the following result.

Lemma 5.3 (Markov) Let z be a random variable with finite p th moment. Then,

IE |z|p
IP(|z| ≥ c) ≤ ,
cp
where c is a positive real number.

For p = 2, Lemma 5.3 is also known as the Chebyshev inequality. If c is small such that
IE |z|p /cp > 1, Markov’s inequality is trivial. When c becomes large, the probability that z
assumes very extreme values will be vanishing at the rate c−p .

Another useful result in probability theory is stated below without proof.

c Chung-Ming Kuan, 2007–2010



110 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

Lemma 5.4 (Hölder) Let y be a random variable with finite p th moment (p > 1) and z
a random variable with finite q th moment (q = p/(p − 1)). Then, IE |yz| ≤ kykp kzkq .

For p = 2, we have IE |yz| ≤ kyk2 kzk2 . By noting that | IE(yz)| < IE |yz|, we immediately
have the next result; cf. Lemma 2.3.

Lemma 5.5 (Cauchy-Schwartz) Let y and z be two square integrable random variables.
Then, | IE(yz)| ≤ kyk2 kzk2 .

Let y = 1 and x = z p . Then for q > p and r = q/p, Hölder’s inequality also ensures that

IE |z p | ≤ kxkr kykr/(r−1) = [IE(z pr )]1/r = [IE(z q )]p/q .

This shows that when a random variable has finite q th moment, it must also have finite
p th moment for any p < q, as stated below.

Lemma 5.6 (Liapunov) Let z be a random variable with finite q th moment. Then for
p < q, kzkp ≤ kzkq .

The inequality below states that the Lp -norm of a finite sum is less than the sum of
individual Lp -norms.

Lemma 5.7 (Minkowski) Let zi , i = 1, . . . , n, be random variables with finite p th mo-


ment (p ≥ 1). Then, k ni=1 zi kp ≤ ni=1 kzi kp .
P P

When there are only two random variables in the sum, this is just the triangle inequality
for Lp -norms; see also Exercise 5.2.

5.2 Conditional Distributions and Moments


Given two events A and B in F, if it is known that B has occurred, the outcome space is
restricted to B, so that the outcomes of A must be in A ∩ B. The likelihood of A is thus
characterized by the conditional probability

IP(A | B) = IP(A ∩ B)/ IP(B),

for IP(B) 6= 0. It can be shown that IP(·|B) satisfies the axioms for probability measures;
see Exerise 5.3. This concept is readily extended to construct conditional density function
and conditional distribution function.

c Chung-Ming Kuan, 2007–2010



5.2. CONDITIONAL DISTRIBUTIONS AND MOMENTS 111

5.2.1 Conditional Distributions

Let y and z denote two integrable random vectors such that z has the density function fz .
For fy (η) 6= 0, define the conditional density function of z given y = η as

fz,y (ζ, η)
fz|y (ζ | y = η) = ,
fy (η)

which is clearly non-negative whenever it is defined. This function also integrates to one
on Rd because
1 1
Z Z
fz|y (ζ | y = η) dζ = fz,y (ζ, η) dζ = f (η) = 1.
Rd fy (η) Rd fy (η) y

Thus, fz|y is a legitimate density function. For example, the bivariate density function of
two random variables z and y forms a surface on the zy-plane. By fixing y = η, we obtain a
cross section (slice) under this surface. Dividing the joint density by the marginal density
fy (η) amounts to adjusting the height of this slice so that the resulting area integrates to
one.

Given the conditional density function fz|y , we have for A ∈ B d ,


Z
IP(z ∈ A | y = η) = fz|y (ζ | y = η) dζ.
A

Note that this conditional probability is defined even when IP(y = η) may be zero. In
particular, when

A = (−∞, ζ1 ] × · · · × (−∞, ζd ],

we obtain the conditional distribution function:

Fz|y (ζ | y = η) = IP(z1 ≤ ζ1 , . . . , zd ≤ ζd | y = η).

When z and y are independent, the conditional density (distribution) simply reduces to
the unconditional density (distribution).

5.2.2 Conditional Moments

Analogous to unconditional expectation, the conditional expectation of the integrable ran-


dom variable zi given the information y = η is
Z
IE(zi | y = η) = ζi dFz|y (ζi | y = η);
R

the conditional expectation of the random vector z is IE(z | y = η) which is defined


elementwise. By allowing y to vary across all possible values η, we obtain the conditional

c Chung-Ming Kuan, 2007–2010



112 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

expectation function IE(z | y) whose value depends on η, the realization of y. Thus,


IE(z | y) is a function of y and hence also a random vector.

More generally, the conditional expectation can be defined by taking a suitable σ-algebra
as the conditioning set. Let G be a sub-σ-algebra of F. The conditional expectation
IE(z | G) is the integrable and G-measurable random variable satisfying
Z Z
IE(z | G) d IP = z d IP, ∀G ∈ G.
G G

This definition basically says that the conditional expectation with respect to G is such that
its weighted sum is the same as that of z over any G in G. Suppose that G is the trivial
σ-algebra {Ω, ∅}, i.e., the smallest σ-algebra that contains no extra information from any
random vectors. For the conditional expectation with respect to the trivial σ-algebra, it is
readily seen that it must be a constant c with probability one so as to be measurable with
respect to {Ω, ∅}. Then,
Z Z
IE(z) = z d IP = c d IP = c.
Ω Ω

That is, the conditional expectation with respect to the trivial σ-algebra is the unconditional
expectation IE(z). Consider now G = σ(y), the σ-algebra generated by y. We also write

IE(z | y) = IE[z | σ(y)],

which is interpreted as the prediction of z given all the information associated with y.

Similar to unconditional expectations, conditional expectations are monotonic: if z ≥ x


with probability one, then IE(z | G) ≥ IE(x | G) with probability one; in particular, if z is
non-negative with probability one, then IE(z | G) ≥ 0 with probability one. Moreover, If z
is independent of y, then IE(z | y) = IE(z). For example, when z is a constant vector c
which is independent of any random variable, IE(z | y) = c. The linearity result below is
analogous to Lemma 2.2 for unconditional expectations.

Lemma 5.8 Let z (d × 1) and y (c × 1) be integrable random vectors and A (n × d) and


B (n × c) be non-stochastic matrices. Then with probability one,

IE(Az + By | G) = A IE(z | G) + B IE(y | G).

If b (n × 1) is a non-stochastic vector, IE(Az + b | G) = A IE(z | G) + b with probability


one.

From the definition of conditional expectation, we immediately have


Z Z
IE[IE(z | G)] = IE(z | G) d IP = z d IP = IE(z);
Ω Ω

c Chung-Ming Kuan, 2007–2010



5.2. CONDITIONAL DISTRIBUTIONS AND MOMENTS 113

this is known as the law of iterated expectations. This result also suggests that if conditional
expectations are taken sequentially with respect to a collection of nested σ-algebras, only
the smallest σ-algebra matters. For example, for k random vectors y 1 , . . . , y k ,

IE[IE(z | y 1 , . . . , y k ) | y 1 , . . . , y k−1 ] = IE(z | y 1 , . . . , y k−1 ).

This result is formally stated below.

Lemma 5.9 (Law of Iterated Expectations) Let G and H be two sub-σ-algebras of F


such that G ⊆ H. Then for the integrable random vector z,

IE[IE(z | H) | G] = IE[IE(z | G) | H] = IE(z | G);

in particular, IE[IE(z | G)] = IE(z).

If z is G-measurable, all the information resulted from z is already contained in G so that


z can be treated as “known” in IE(z | G) and taken out from the conditional expectation.
That is, IE(z | G) = z with probability one. Hence,

IE(zx0 | G) = z IE(x0 | G).

In particular, z can be taken out from the conditional expectation when z itself is a con-
ditioning variable. This result is generalized as follows.

Lemma 5.10 Let z be a G-measurable random vector. Then for any Borel-measurable
function g,

IE[g(z)x | G] = g(z) IE(x | G),

with probability one.

Two square integrable random variables z and y are said to be orthogonal if their inner
product IE(zy) = 0. This definition allows us to discuss orthogonal projection in the space
of square integrable random vectors. Let z be a square integrable random variable and z̃
be a G-measurable random variable. Then, by Lemma 5.9 (law of iterated expectations)
and Lemma 5.10,
   h   i
IE z − IE(z | G) z̃ = IE IE z − IE(z | G) z̃ | G
 
= IE IE(z | G)z̃ − IE(z | G)z̃

= 0.

c Chung-Ming Kuan, 2007–2010



114 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

That is, the difference between z and its conditional expectation IE(z | G) must be or-
thogonal to any G-measurable random variable. It can then be seen that for any square
integrable, G-measurable random variable z̃,

IE(z − z̃)2 = IE[z − IE(z | G) + IE(z | G) − z̃]2

= IE[z − IE(z | G)]2 + IE[IE(z | G) − z̃]2

≥ IE[z − IE(z | G)]2 ,

where in the second equality the cross-product term vanishes because both IE(z | G) and z̃
are G-measurable and hence orthogonal to z − IE(z | G). That is, among all G-measurable
random variables that are also square integrable, IE(z | G) is the closest to z in terms of
the L2 -norm. This shows that IE(z | G) is the orthogonal projection of z onto the space of
all G-measurable, square integrable random variables.

Lemma 5.11 Let z be a square integrable random variable. Then

IE[z − IE(z | G)]2 ≤ IE(z − z̃)2 ,

for any G-measurable random variable z̃.

In particular, let G = σ(y), where y is a square integrable random vector. Lemma 5.11
implies that
 2 2
IE z − IE z | σ(y) ≤ IE z − h(y) ,

for any Borel-measurable function h such that h(y) is also square integrable. Thus, IE[z |
σ(y)] minimizes the L2 -norm kz − h(y)k2 , and its difference from z is orthogonal to any
function of y that is also square integrable. We may then say that, given all the information
generated from y, IE[z | σ(y)] is the “best approximation” of z in terms of the L2 -norm (or
simply the best L2 predictor).

The conditional variance-covariance matrix of z given y is

var(z | y) = IE [z − IE(z | y)][z − IE(z | y)]0 | y




= IE(zz 0 | y) − IE(z | y) IE(z | y)0 .

Similar to unconditional variance-covariance matrix, we have for non-stochastic matrices A


and b,

var(Az + b | y) = A var(z | y) A0 ,

c Chung-Ming Kuan, 2007–2010



5.3. MODES OF CONVERGENCE 115

which is nonsingular provided that A has full row rank and var(z | y) is positive definite.
It can also be shown that

var(z) = IE[var(z | y)] + var IE(z | y) ;

see Exercise 5.4. That is, the variance of z can be expressed as the sum of two components:
the mean of its conditional variance and the variance of its conditional mean. This is also
known as the decomposition of analysis of variance.

Example 5.12 Suppose that (y 0 x0 )0 is distributed as a multivariate normal random vector:


" # " # " #!
y µy Σy Σ0xy
∼N , .
x µx Σxy Σx

It is easy to see that the conditional density function of y given x, obtained from dividing
the multivariate normal density function of y and x by the normal density of x, is also
normal with the conditional mean

IE(y | x) = µy − Σ0xy Σ−1


x (x − µx ),

and the conditional variance-covariance matrix

var(y | x) = var(y) − var IE(y | x) = Σy − Σ0xy Σ−1



x Σxy .

Note that IE(y | x) is a linear function of x and that var(y | x) does not vary with x.

5.3 Modes of Convergence


Consider now a sequence of random variables {zn (ω)}n=1,2,... defined on the probability
space (Ω, F, IP). For a given ω, {zn } is a realization (a sequence of sample values) of the
random element ω with the index n, and that for a given n, zn is a random variable which
assumes different values depending on ω. In this section we will discuss various modes of
convergence for sequences of random variables.

5.3.1 Almost Sure Convergence

We first introduce the concept of almost sure convergence (convergence with probability
one). Suppose that {zn } is a sequence of random variables and z is a random variable,
all defined on the probability space (Ω, F, IP). The sequence {zn } is said to converge to z
almost surely if, and only if,

IP(ω : zn (ω) → z(ω) as n → ∞) = 1,

c Chung-Ming Kuan, 2007–2010



116 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

a.s.
denoted as zn −→ z or zn → z a.s. Note that for a given ω, the realization zn (ω) may
or may not converge to z(ω). Almost sure convergence requires that zn (ω) → z(ω) for
almost all ω ∈ Ω, except for those ω in a set with probability zero. That is, almost all the
realizations zn (ω) will be eventually close to z(ω) for all n sufficiently large; the event that
zn will not approach z is improbable. When z n and z are both Rd -valued, almost sure
convergence is defined elementwise. That is, z n → z a.s. if every element of z n converges
almost surely to the corresponding element of z.
The following result shows that continuous transformation preserves almost sure con-
vergence.

Lemma 5.13 Let g : R 7→ R be a function continuous on Sg ⊆ R.


a.s. a.s.
[a] If zn −→ z, where z is a random variable such that IP(z ∈ Sg ) = 1, then g(zn ) −→
g(z).
a.s. a.s.
[b] If zn −→ c, where c is a real number at which g is continuous, then g(zn ) −→ g(c).

Proof: Let Ω0 = {ω : zn (ω) → z(ω)} and Ω1 = {ω : z(ω) ∈ Sg }. Thus, for ω ∈ (Ω0 ∩ Ω1 ),


continuity of g ensures that g(zn (ω)) → g(z(ω)). Note that

(Ω0 ∩ Ω1 )c = Ωc0 ∪ Ωc1 ,

which has probability zero because IP(Ωc0 ) = IP(Ωc1 ) = 0. It follows that Ω0 ∩ Ω1 has
probability one. This proves that g(zn ) → g(z) with probability one. The second assertion
is just a special case of the first result. 2
a.s.
Lemma 5.13 is easily generalized to Rd -valued random variables. For example, z n −→ z
implies
a.s.
z1,n + z2,n −→ z1 + z2 ,
a.s.
z1,n z2,n −→ z1 z2 ,
2 2 a.s.
z1,n + z2,n −→ z12 + z22 ,
where z1,n , z2,n are two elements of z n and z1 , z2 are the corresponding elements of z. Also,
provided that z2 6= 0 with probability one, z1,n /z2,n → z1 /z2 a.s.

5.3.2 Convergence in Probability

A convergence concept that is weaker than almost sure convergence is convergence in prob-
ability. A sequence of random variables {zn } is said to converge to z in probability if for
every  > 0,

lim IP(ω : |zn (ω) − z(ω)| > ) = 0,


n→∞

c Chung-Ming Kuan, 2007–2010



5.3. MODES OF CONVERGENCE 117

or equivalently,

lim IP(ω : |zn (ω) − z(ω)| ≤ ) = 1,


n→∞

IP
denoted as zn −→ z or zn → z in probability. We also say that z is the probability limit of
zn , denoted as plim zn = z. In particular, if the probability limit of zn is a constant c, all
the probability mass of zn will concentrate around c when n becomes large. For Rd -valued
random variables z n and z, convergence in probability is also defined elementwise.

In the definition of convergence in probability, the events Ωn () = {ω : |zn (ω)−z(ω)| ≤ }


vary with n, and convergence is referred to the probabilities of such events: pn = IP(Ωn ()),
rather than the random variables zn . By contrast, almost sure convergence is related
directly to the behaviors of random variables. For convergence in probability, the event Ωn
that zn will be close to z becomes highly likely when n tends to infinity, or its complement
(zn will deviate from z by a certain distance) becomes highly unlikely when n tends to
infinity. Whether zn will converge to z is not of any concern in convergence in probability.

More specifically, let Ω0 denote the set of ω such that zn (ω) converges to z(ω). For
ω ∈ Ω0 , there is some m such that ω is in Ωn () for all n > m. That is,
∞ \
[ ∞
Ω0 ⊆ Ωn () ∈ F.
m=1 n=m

As ∩∞
n=m Ωn () is also in F and non-decreasing in m, it follows that

∞ \ ∞ ∞
! !
[ \ 
IP(Ω0 ) ≤ IP Ωn () = lim IP Ωn () ≤ lim IP Ωm () .
m→∞ m→∞
m=1 n=m n=m

This inequality proves that almost sure convergence implies convergence in probability, but
the converse is not true in general. We state this result below.
a.s. IP
Lemma 5.14 If zn −→ z, then zn −→ z.

The following well-known example shows that when there is convergence in probability,
the random variables themselves may not even converge for any ω.

Example 5.15 Let Ω = [0, 1] and IP be the Lebesgue measure (i.e., IP{(a, b]} = b − a
for (a, b] ⊆ [0, 1]). Consider the sequence {In } of intervals [0, 1], [0, 1/2), [1/2, 1], [0, 1/3),
[1/3, 2/3), [2/3, 1], . . . , and let zn = 1In be the indicator function of In : zn (ω) = 1 if ω ∈ In
and zn = 0 otherwise. When n tends to infinity, In shrinks toward a singleton which has
the Lebesgue measure zero. For 0 <  < 1, we then have

IP(|zn | > ) = IP(In ) → 0,

c Chung-Ming Kuan, 2007–2010



118 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

IP
which shows zn −→ 0. On the other hand, it is easy to see that each ω ∈ [0, 1] must be
covered by infinitely many intervals. Thus, given any ω ∈ [0, 1], zn (ω) = 1 for infinitely
many n, and hence zn (ω) does not converge to zero. Note that convergence in probability
permits zn to deviate from the probability limit infinitely often, but almost sure convergence
does not, except for those ω in the set of probability zero. 2

Intuitively, when zn has finite variance such that var(zn ) vanishes asymptotically, the
distribution of zn would shrink toward its mean IE(zn ). If, in addition, IE(zn ) tends to
a constant c (or IE(zn ) = c), then zn ought to be degenerate at c in the limit. These
observations suggest the following sufficient conditions for convergence in probability; see
Exercises 5.5 and 5.6. In many cases, it is easier to establish convergence in probability by
verifying these conditions.

Lemma 5.16 Let {zn } be a sequence of square integrable random variables. If IE(zn ) → c
IP
and var(zn ) → 0, then zn −→ c.

Analogous to Lemma 5.13, continuous functions also preserve convergence in probability.

Lemma 5.17 Let g : R 7→ R be a function continuous on Sg ⊆ R.

IP IP
[a] If zn −→ z, where z is a random variable such that IP(z ∈ Sg ) = 1, then g(zn ) −→
g(z).
IP
[b] (Slutsky) If zn −→ c, where c is a real number at which g is continuous, then
IP
g(zn ) −→ g(c).

Proof: By the continuity of g, for each  > 0, we can find a δ > 0 such that

{ω : |zn (ω) − z(ω)| ≤ δ} ∩ {ω : z(ω) ∈ Sg }

⊆ {ω : |g(zn (ω)) − g(z(ω))| ≤ }.

Taking complementation of both sides and noting that the complement of {ω : z(ω) ∈ Sg }
has probability zero, we have

IP(|g(zn ) − g(z)| > ) ≤ IP(|zn − z| > δ).

As zn converges to z in probability, the right-hand side converges to zero and so does the
left-hand side. 2

c Chung-Ming Kuan, 2007–2010



5.3. MODES OF CONVERGENCE 119

IP
Lemma 5.17 is readily generalized to Rd -valued random variables. For instance, z n −→
z implies
IP
z1,n + z2,n −→ z1 + z2 ,
IP
z1,n z2,n −→ z1 z2 ,
2 2 IP
z1,n + z2,n −→ z12 + z22 ,

where z1,n , z2,n are two elements of z n and z1 , z2 are the corresponding elements of z. Also,
IP
provided that z2 6= 0 with probability one, z1,n /z2,n −→ z1 /z2 .

5.3.3 Convergence in Distribution

Another convergence mode, known as convergence in distribution or convergence in law,


concerns the behavior of the distribution functions of random variables. Let Fzn and Fz be
the distribution functions of zn and z, respectively. A sequence of random variables {zn }
D
is said to converge to z in distribution, denoted as zn −→ z, if

lim Fzn (ζ) = Fz (ζ),


n→∞

for every continuity point ζ of Fz . That is, regardless the distributions of zn , convergence
in distribution ensures that Fzn will be arbitrarily close to Fz for all n sufficiently large.
The distribution Fz is thus known as the limiting distribution of zn . We also say that zn is
A
asymptotically distributed as Fz , denoted as zn ∼ Fz .
D
For random vectors {z n } and z, z n −→ z if the joint distributions Fzn converge to Fz
for every continuity point ζ of Fz . It is, however, more cumbersome to show convergence
in distribution for a sequence of random vectors. The so-called Cramér-Wold device allows
us to transform this multivariate convergence problem to a univariate one. This result is
stated below without proof.

Lemma 5.18 (Cramér-Wold Device) Let {z n } be a sequence of random vectors in Rd .


D D
Then z n −→ z if and only if α0 z n −→ α0 z for every α ∈ Rd such that α0 α = 1.

There is also a uni-directional relationship between convergence in probability and con-


vergence in distribution. To see this, note that for some arbitrary  > 0 and a continuity
point ζ of Fz , we have

IP(zn ≤ ζ) = IP({zn ≤ ζ} ∩ {|zn − z| ≤ }) + IP({zn ≤ ζ} ∩ {|zn − z| > })

≤ IP(z ≤ ζ + ) + IP(|zn − z| > ).

c Chung-Ming Kuan, 2007–2010



120 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

Similarly,

IP(z ≤ ζ − ) ≤ IP(zn ≤ ζ) + IP(|zn − z| > ).

IP
If zn −→ z, then by passing to the limit and noting that  is arbitrary, the inequalities
above imply

lim IP(zn ≤ ζ) = IP(z ≤ ζ).


n→∞

That is, Fzn (ζ) → Fz (ζ). The converse is not true in general, however.

When zn converges in distribution to a real number c, it is not difficult to show that zn


also converges to c in probability. In this case, these two convergence modes are equivalent.
To be sure, note that a real number c can be viewed as a degenerate random variable with
the distribution function:
(
0, ζ < c,
F (ζ) =
1, ζ ≥ c,

D
which is a step function with a jump point at c. When zn −→ c, all the probability mass
IP
of zn will concentrate at c as n becomes large; this is precisely what zn −→ c means. More
formally, for any  > 0,

IP(|zn − c| > ) = 1 − [Fzn (c + ) − Fzn ((c − )− )],

D
where (c − )− denotes the point adjacent to and less than c − . Now, zn −→ c implies
that Fzn (c + ) − Fzn ((c − )− ) converges to one, so that IP(|zn − c| > ) converges to zero.
We summarizes these results below.

IP D IP
Lemma 5.19 If zn −→ z, thenzn −→ z. For a constant c, zn −→ c is equivalent to
D
zn −→ c.

The continuous mapping theorem below asserts that continuous functions preserve con-
vergence in distribution; cf. Lemmas 5.13 and 5.17.

Lemma 5.20 (Continuous Mapping Theorem) Let g : R 7→ R be a function continu-


D
ous almost everywhere on R, except for at most countably many points. If zn −→ z, then
D
g(zn ) −→ g(z).

For example, if zn converges in distribution to the standard normal random variable, the
limiting distribution of zn2 is χ2 (1). Generalizing this result to Rd -valued random variables,

c Chung-Ming Kuan, 2007–2010



5.4. STOCHASTIC ORDER NOTATIONS 121

we can see that when z n converges in distribution to the d-dimensional standard normal
random variable, the limiting distribution of z 0n z n is χ2 (d).

Two sequences of random variables {yn } and {zn } are said to be asymptotically equiv-
alent if their differences yn − zn converge to zero in probability. Intuitively, the limiting
distributions of two asymptotically equivalent sequences, if exist, ought to be the same.
This is stated in the next result without proof.

IP
Lemma 5.21 Let {yn } and {zn } be two sequences of random vectors such that yn − zn −→
D D
0. If zn −→ z, then yn −→ z.

The next result is concerned with two sequences of random variables such that one converges
in distribution and the other converges in probability.

Lemma 5.22 If yn converges in probability to a constant c and zn converges in distribution


D D D
to z, then yn + zn −→ c + z, yn zn −→ cz, and zn /yn −→ z/c if c 6= 0.

5.4 Stochastic Order Notations


It is typical to use order notations to describe the behavior of a sequence of numbers,
whether it converges or not. Let {cn } denote a sequence of positive real numbers.

1. Given a sequence {bn }, we say that bn is (at most) of order cn , denoted as bn = O(cn ),
if there exists a ∆ < ∞ such that |bn |/cn ≤ ∆ for all sufficiently large n. When cn
diverges, bn cannot diverge faster than cn ; when cn converges to zero, the rate of
convergence of bn is no slower than that of cn . For example, the polynomial a + bn
is O(n), and the partial sum of a bounded sequence ni=1 bi is O(n). Note that an
P

O(1) sequence is a bounded sequence.

2. Given a sequence {bn }, we say that bn is of smaller order than cn , denoted as bn =


o(cn ), if bn /cn → 0. When cn diverges, bn must diverge slower than cn ; when cn
converges to zero, the rate of convergence of bn should be faster than that of cn . For
example, the polynomial a + bn is o(n1+δ ) for any δ > 0, and the partial sum ni=1 αi ,
P

|α| < 1, is o(n). Note that an o(1) sequence is a sequence that converges to zero.

If bn is a vector (matrix), bn is said to be O(cn ) (o(cn )) if every element of bn is O(cn )


(o(cn )). It is also easy to verify the following results; see Exercise 5.8.

Lemma 5.23 Let {an } and {bn } be two non-stochastic sequences.

(a) If an = O(nr ) and bn = O(ns ), then an bn = O(nr+s ) and an + bn = O(nmax(r,s) ).

c Chung-Ming Kuan, 2007–2010



122 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

(b) If an = o(nr ) and bn = o(ns ), then an bn = o(nr+s ) and an + bn = o(nmax(r,s) ).

(c) If an = O(nr ) and bn = o(ns ), then an bn = o(nr+s ) and an + bn = O(nmax(r,s) ).

The order notations can be easily extended to describe the behavior of sequences of
random variables. A sequence of random variables {zn } is said to be Oa.s. (cn ) (or O(cn )
almost surely) if zn /cn is O(1) a.s., and it is said to be OIP (cn ) (or O(cn ) in probability) if
for every  > 0, there is some ∆ such that

IP(|zn |/cn ≥ ∆) ≤ ,

a.s.
for all n sufficiently large. Similarly, {zn } is oa.s. (cn ) (or o(cn ) almost surely) if zn /cn −→ 0,
IP
and it is oIP (cn ) (or o(cn ) in probability) if zn /cn −→ 0.

If {zn } is Oa.s. (1) (oa.s (1)), we say that zn is bounded (vanishing) almost surely; if {zn } is
OIP (1) (oIP (1)), zn is bounded (vanishing) in probability. Note that Lemma 5.23 also holds
for stochastic order notations. In particular, if a sequence of random variables is bounded
almost surely (in probability) and another sequence of random variables is vanishing almost
surely (in probability), the products of their corresponding elements are vanishing almost
surely (in probability). That is, yn = Oa.s. (1) and zn = oa.s. (1), then yn zn is oa.s. (1).
D
When zn −→ z, we know that zn does not converge in probability to z in general, but
more can be said about the behavior of zn . Let ζ be a continuity point of Fz . Then for any
D
 > 0, we can choose a sufficiently large ζ such that IP(|z| > ζ) < /2. As zn −→ z, we can
also choose n large enough such that

IP(|zn | > ζ) − IP(|z| > ζ) < /2,

which implies IP(|zn | > ζ) < . This leads to the following conclusion.

D
Lemma 5.24 Let {zn } be a sequence of random vectors such that zn −→ z. Then zn =
OIP (1).

5.5 Law of Large Numbers


We first discuss the law of large numbers which is concerned with the averaging behavior
of random variables. Intuitively, a sequence of random variables obeys a law of large
numbers when its sample average essentially follows its mean behavior; random irregularities
(deviations from the mean) are “wiped out” in the limit by averaging. When a law of large
numbers holds almost surely, it is a strong law of large numbers (SLLN); when a law of large

c Chung-Ming Kuan, 2007–2010



5.5. LAW OF LARGE NUMBERS 123

numbers holds in probability, it is a weak law of large numbers (WLLN). For a sequence of
random vectors (matrices), a SLLN (WLLN) is defined elementwise.

There are different versions of the SLLN (WLLN) for various types of random variables.
Below is a well known SLLN for i.i.d. random variables.

Lemma 5.25 (Kolmogorov) Let {zt } be a sequence of i.i.d. random variables with mean
µo . Then,
T
1 X a.s.
zt −→ µo .
T
t=1

This result asserts that, when zt have a finite, common mean µo , the sample average of
zt is essentially close to µo , a non-stochastic number. Note, however, that i.i.d. random
variables need not obey Kolmogorov’s SLLN if they do not have a finite mean; for instance,
Lemma 5.25 does not apply to i.i.d. Cauchy random variables. As almost sure convergence
implies convergence in probability, the same condition in Lemma 5.25 ensures that {zt }
also obeys a WLLN.

When {zt } is a sequence of independent random variables with possibly heterogeneous


distributions, it may still obey a SLLN (WLLN) under a stronger condition.

Lemma 5.26 (Markov) Let {zt } be a sequence of independent random variables with non-
degenerate distributions such that for some δ > 0, IE |zt |1+δ is bounded for all t. Then,
T
1X a.s.
[zt − IE(zt )] −→ 0,
T
t=1

Comparing to Kolmogorov’s SLLN, Lemma 5.26 requires a stronger moment condition:


bounded (1 + δ) th moment, yet zt need not have a common mean. This SLLN indicates
that the sample average of zt eventually behaves like the average of IE(zt ). Note that the
average of IE(zt ) may or may not converge; if it does converge to, say, µ∗ ,
T T
1 X a.s. 1X
zt −→ lim IE(zt ) =: µ∗ .
T T →∞ T
t=1 t=1

Finally, as non-stochastic numbers can be viewed as independent random variables with


degenerate distributions, it is understood that a non-stochastic sequence obeys a SLLN if
its sample average converges.

The following example shows that a sequence of correlated random variables may also
obey a WLLN.

c Chung-Ming Kuan, 2007–2010



124 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

Example 5.27 Suppose that yt is generated as a weakly stationary AR(1) process:

yt = αo yt−1 + ut , |αo | < 1,

where ut are i.i.d. random variables with mean zero and variance σu2 . In view of Section 4.3,
we have IE(yt ) = 0, var(yt ) = σu2 /(1 − αo2 ), and

σu2
cov(yt , yt−j ) = αoj .
1 − αo2
PT
These results imply that IE(T −1 t=1 yt ) = 0 and

T T T −1
!
X X X
var yt = var(yt ) + 2 (T − τ ) cov(yt , yt−τ )
t=1 t=1 τ =1
T
X T
X −1
≤ var(yt ) + 2T | cov(yt , yt−τ )|
t=1 τ =1

= O(T ).
 
The latter result shows that var T −1 Tt=1 yt = O(T −1 ) which converges to zero as T
P

approaches infinity. It follows from Lemma 5.16 that


T
1X IP
yt −→ 0;
T
t=1

that is, {yt } obeys a WLLN. It can be seen that a key condition in the proof above is
PT
that the variance of t=1 yt does not grow too rapidly (it is O(T )). The facts that yt
has a constant variance and that cov(yt , yt−j ) goes to zero exponentially fast as j tends to
infinity are sufficient for this condition. This WLLN result is readily generalized to weakly
stationary AR(p) processes. 2

The example above shows that it may be quite cumbersome to establish a WLLN for
weakly stationary processes. The lemma below gives a strong law for correlated random
variables and is convenient in practice; see Davidson (1994, p. 326) for a more general result.

Lemma 5.28 Let yt = ∞


P
i=0 πi ut−i , where ut are i.i.d. random variables with mean zero
P∞ a.s.
and variance σu . If i=−∞ |πi | < ∞, then T −1 Tt=1 yt −→ 0.
2
P

P∞ i
P∞ i
In Example 5.27, yt = i=0 αo ut−i with |αo | < 1, so that i=0 |αo | < ∞. Hence,
Lemma 5.28 ensures that the average of yt also converges to its mean (zero) almost surely.
If yt = zt − µo , then the average of zt converges to IE(zt ) = µo almost surely. Comparing

c Chung-Ming Kuan, 2007–2010



5.5. LAW OF LARGE NUMBERS 125

to Example 5.27, Lemma 5.28 is quite general and applicable to any process that can be
expressed as an MA process with absolutely summable weights.

From Lemmas 5.25, 5.26 and 5.28 we can see that a SLLN (WLLN) does not always
hold. The random variables in a sequence must be “well behaved” (i.e., satisfying certain
regularity conditions) to ensure a SLLN (WLLN). In particular, the sufficient conditions for
a SLLN (WLLN) usually regulate the moments and dependence structure of random vari-
ables. Intuitively, random variables without certain bounded moment may exhibit aberrant
behavior so that their random irregularities cannot be completely averaged out. For random
variables with strong correlations over time, the variation of their partial sums may grow
too rapidly and cannot be eliminated by simple averaging. More generally, it is also possible
for a sequence of weakly dependent and heterogeneously distributed random variables to
obey a SLLN (WLLN). This usually requires even stronger conditions on their moments
and dependence structure. To avoid technicality, we will not give a SLLN (WLLN) for such
general sequences but refer to White (2001) and Davidson (1994) for details. The following
examples illustrate why a SLLN (WLLN) may fail to hold.

Example 5.29 Consider the sequences {t} and {t2 }, t = 1, 2, . . .. It is well known that
T
X T
X
t = T (T + 1)/2, t2 = T (T + 1)(2T + 1)/6.
t=1 t=1
PT PT
Hence, T −1
t=1 t and T −1
t=1 t
2 both diverge. 2

Example 5.30 Suppose that ut are i.i.d. random variables with mean zero and variance
a.s.
σu2 . Thus, T −1 Tt=1 ut −→ 0 by Kolmogorv’s SLLN (Lemma 5.25). Consider now {tut }.
P

This sequence does not have bounded (1 + δ) th moment because IE |tut |1+δ grows with t
and therefore does not obey Markov’s SLLN (Lemma 5.26). Moreover, note that
T T
!
X X T (T + 1)(2T + 1)
var tut = t2 var(ut ) = σu2 .
6
t=1 t=1
PT PT
By Exercise 5.9, t=1 tut = OIP (T 3/2 ). It follows that T −1 t=1 tut = OIP (T 1/2 ), which
shows that {tut } does not obey a WLLN. 2

Example 5.31 Suppose that yt is generated as a random walk:

yt = yt−1 + ut , t = 1, 2, . . . ,

with y0 = 0, where ut are i.i.d. random variables with mean zero and variance σu2 . Clearly,
t
X
yt = ui ,
i=1

c Chung-Ming Kuan, 2007–2010



126 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

which has mean zero and unbounded variance tσu2 . For s < t, write
t
X
yt = ys + ui = ys + vt−s ,
i=s+1
Pt
where vt−s = i=s+1 ui is independent of ys . We then have

cov(yt , ys ) = IE(ys2 ) = sσu2 ,

for t > s. Consequently,


T T T −1 X
T
!
X X X
var yt = var(yt ) + 2 cov(yt , yt−τ ).
t=1 t=1 τ =1 t=τ +1

It can be verified that the first term on the right-hand side is


T
X T
X
var(yt ) = tσu2 = O(T 2 ),
t=1 t=1

and that the second term is


T
X −1 T
X T
X −1 T
X
2 cov(yt , yt−τ ) = 2 (t − τ )σu2 = O(T 3 ).
τ =1 t=τ +1 τ =1 t=τ +1
PT PT
Thus, var( t=1 yt ) = O(T 3 ), so that t=1 yt = OIP (T 3/2 ) by Exercise 5.9. This shows that
T
1X
yt = OIP (T 1/2 ),
T
t=1

which diverges in probability. This shows that when {yt } is a random walk, it does not obey
a WLLN. In this case, yt have unbounded variances and strong correlations over time. Due
to these correlations, the variation of the partial sum of yt grows much too fast. (Recall
that the variance of Tt=1 yt is only O(T ) in Example 5.27.) The conclusions above will not
P

be altered when {ut } is a white noise or a weakly stationary process. 2

Example 5.32 Suppose that yt is generated as a random walk:

yt = yt−1 + ut , t = 1, 2, . . . ,

with y0 = 0, as in Example 5.31. Then, the sequence {yt−1 ut } has mean zero and

2
var(yt−1 ut ) = IE(yt−1 ) IE(u2t ) = (t − 1)σu4 .

More interestingly, it can be seen that for s < t,

cov(yt−1 ut , ys−1 us ) = IE(yt−1 ys−1 us ) IE(ut ) = 0.

c Chung-Ming Kuan, 2007–2010



5.6. UNIFORM LAW OF LARGE NUMBERS 127

We then have
T T T
!
X X X
var yt−1 ut = var(yt−1 ut ) = (t − 1)σu4 = O(T 2 ),
t=1 t=1 t=1

PT
= OIP (T ). Note, however, that var(T −1 Tt=1 yt−1 ut ) converges to σu4 /2,
P
and t=1 yt−1 ut
rather than 0. Thus, T −1 Tt=1 yt−1 ut cannot behave like a non-stochastic number in the
P

limit. This shows that {yt−1 ut } does not obey a WLLN, even though its partial sums are
OIP (T ). 2

In the asymptotic analysis of ecnometric estimators and test statistics, we usually en-
counter functions of several random variables, e.g., the product of two random variables.
In some cases, it is easy to find sufficient conditions ensuring a SLLN (WLLN) for these
functions. For example, suppose that zt = xt yt , where {xt } and {yt } are two mutually
independent sequences of independent random variables, each with bounded (2 + δ) th mo-
ment. Then, zt are also independent random variables and have bounded (1 + δ) th moment
by the Cauchy-Schwartz inequality. Lemma 5.26 then provides the SLLN for {zt }. When
{xt } and {yt } are two sequences of correlated (or weakly dependent) random variables, it
is more cumbersome to find suitable conditions on xt and yt that ensure a SLLN (WLLN).

In what follows, a sequence of integrable random variables zt is said to obey a SLLN if

T
1X a.s.
[zt − IE(zt )] −→ 0; (5.1)
T
t=1

it is said to obey a WLLN if the almost sure convergence above is replaced by convergence
in probability. When IE(zt ) is a constant µo , (5.1) simplifies to

T
1 X a.s.
zt −→ µo .
T
t=1

In our analysis, we may only invoke this generic SLLN (WLLN).

5.6 Uniform Law of Large Numbers


It is also common to deal with functions of random variables and model parameters. For
example, q(zt (ω); θ) is a random variable for a given parameter θ, and it is a function
of θ for a given ω. When θ is fixed, we may impose conditions on q and zt such that
{q(zt (ω); θ)} obeys a SLLN (WLLN), as discussed in Section 5.5. When θ assumes values
in the parameter space Θ, a SLLN (WLLN) that does not depend on θ is then needed.

c Chung-Ming Kuan, 2007–2010



128 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

More specifically, suppose that {q(zt ; θ)} obeys a SLLN for each θ ∈ Θ:
T
1X a.s.
QT (ω; θ) = q(zt (ω); θ) −→ Q(θ),
T
t=1

where Q(θ) is a non-stochastic function of θ. As this convergent behavior may depend on θ,


Ωc0 (θ) = {ω : QT (ω; θ) 6→ Q(θ)} varies with θ. When Θ is an interval of R, ∪θ∈Θ Ωc0 (θ) is an
uncountable union of non-convergence sets and hence may not have probability zero, even
though each Ωc0 (θ) does. Thus, the event that QT (ω; θ) → Q(θ) for all θ, i.e., ∩θ∈Θ Ω0 (θ),
may occur with probability less than one. In fact, the union of all Ωc0 (θ) may not even be
in F (only countable unions of the elements in F are guaranteed to be in F). If so, we
cannot conclude anything regarding the convergence of QT (ω; θ). Worse still is when θ also
depends on T , as in the case where θ is replaced by an estimator θ̃T . There may not exist
a finite T ∗ such that QT (ω; θ̃T ) are arbitrarily close to Q(ω; θ̃T ) for all T > T ∗ .

These observations suggest that we should study convergence that is uniform on the
parameter space Θ. In particular, QT (ω; θ) converges to Q(θ) uniformly in θ almost surely
(in probability) if the largest possible difference:

sup |QT (θ) − Q(θ)| → 0, a.s. (in probability).


θ∈Θ

In what follows we always assume that this supremum is a random variables for all T .
The example below, similar to Example 2.14 of Davidson (1994), illustrates the difference
between uniform and pointwise convergence.

Example 5.33 Let zt be i.i.d. random variables with zero mean and

1
 T θ,

 0 ≤ θ ≤ 2T ,
qT (zt (ω); θ) = zt (ω) + 1 1
1 − T θ, 2T < θ ≤ T ,

1
T < θ < ∞.
 0,

Observe that for θ ≥ 1/T and θ = 0,


T T
1X 1X
QT (ω; θ) = qT (zt ; θ) = zt ,
T T
t=1 t=1

which converges to zero almost surely by Kolmogorov’s SLLN. Thus, for a given θ, we can
a.s.
always choose T large enough such that QT (ω; θ) −→ 0, where 0 is the pointwise limit. On
the other hand, it can be seen that Θ = [0, ∞) and
a.s.
sup |QT (ω; θ)| = |z̄T + 1/2| −→ 1/2,
θ∈Θ

so that the uniform limit is different from the pointwise limit. 2

c Chung-Ming Kuan, 2007–2010



5.6. UNIFORM LAW OF LARGE NUMBERS 129

Let zT t denote the t th random variable in a sample of T variables. These random


variables are indexed by both T and t and form a triangular array. In this array, there is
only one random variable z11 when T = 1, there are two random variables z21 and z22 when
T = 2, there are three random variables z31 , z32 and z33 when T = 3, and so on. If this
array does not depend on T , it is simply a sequence of random variables. We now consider
a triangular array of functions qT t (z t ; θ), t = 1, 2, . . . , T , where z t are integrable random
vectors and θ is the parameter vector taking values in the parameter space Θ ∈ Rm . For
notation simplicity, we will not explicitly write ω in the functions. We say that {qT t (z t ; θ)}
obeys a strong uniform law of large numbers (SULLN) if

T
1X a.s.
sup [qT t (z t ; θ) − IE(qT t (z t ; θ))] −→ 0, (5.2)
θ∈Θ T
t=1

cf. (5.1). Similarly, {qT t (z t ; θ)} is said to obey a weak uniform law of large numbers
(WULLN) if the convergence condition above holds in probability. If qT t is Rm -valued
functions, the SULLN (WULLN) is defined elementwise.

We have seen that pointwise convergence alone does not imply uniform convergence.
An interesting question one would ask is: What are the additional conditions required to
guarantee uniform convergence? Let

T
1X
QT (θ) = [qT t (z t ; θ) − IE(qT t (z t ; θ))].
T
t=1

Suppose that QT satisfies the following Lipschitz-type continuity requirement: for θ and
θ † in Θ,

|QT (θ) − QT (θ † )| ≤ CT kθ − θ † k a.s.,

where k · k denotes the Euclidean norm, and CT is a random variable bounded almost surely
and does not depend on θ. Under this condition, QT (θ † ) can be made arbitrarily close to
QT (θ), provided that θ † is sufficiently close to θ. Using the triangle inequality and taking
supremum over θ we have

sup |QT (θ)| ≤ sup |QT (θ) − QT (θ † )| + |QT (θ † )|.


θ∈Θ θ∈Θ

Let ∆ denote an almost sure bound of CT . Then given any  > 0, choosing θ † such that
kθ − θ † k < /(2∆) implies

 
sup |QT (θ) − QT (θ † )| ≤ CT ≤ ,
θ∈Θ 2∆ 2

c Chung-Ming Kuan, 2007–2010



130 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

uniformly in T . Moreover, because QT (θ) converges to 0 almost surely for each θ in Θ,


|QT (θ † )| is also less than /2 for sufficiently large T . Consequently,

sup |QT (θ)| ≤ ,


θ∈Θ

for all T sufficiently large. As these results hold almost surely, we have a SULLN for QT (θ);
the conditions ensuring a WULLN are analogous.

Lemma 5.34 Suppose that for each θ ∈ Θ, {qT t (z t ; θ)} obeys a SLLN (WLLN) and that
for θ, θ † ∈ Θ,

|QT (θ) − QT (θ † )| ≤ CT kθ − θ † k a.s.,

where CT is a random variable bounded almost surely (in probability) and does not depend
on θ. Then, {qT t (z t ; θ)} obeys a SULLN (WULLN).

Lemma 5.34 is quite convenient for establishing a SULLN (WULLN) because it requires
only two conditions. First, the random functions must obey a standard SLLN (WLLN) for
each θ in the parameter space. Second, the function qT t must satisfy a Lipschitz-type
continuity condition which amounts to requiring qT t to be sufficiently “smooth” in the
second argument. Note, however, that CT being bounded almost surely may imply that
the random variables in qT t are also bounded almost surely. This requirement is much too
restrictive in applications. Hence, a SULLN may not be readily obtained from Lemma 5.34.
On the other hand, a WULLN is practically more plausible because the requirement that CT
is OIP (1) is much weaker. For example, the boundedness of IE |CT | is sufficient for CT being
OIP (1) by Markov’s inequality. For more specific conditions ensuring these requirements we
refer to Gallant and White (1988) and Bierens (1994).

5.7 Central Limit Theorem


The central limit theorem (CLT) is another important result in probability theory. When a
CLT holds, the distributions of suitably normalized averages of random variables are close
to the standard normal distribution in the limit, regardless of the original distributions of
these random variables. This is a very powerful result in applications because, as far as
the approximation of normalized sample averages is concerned, only the standard normal
distribution matters.

There are also different versions of CLT for various types of random variables. The
following CLT applies to i.i.d. random variables.

c Chung-Ming Kuan, 2007–2010



5.7. CENTRAL LIMIT THEOREM 131

Lemma 5.35 (Lindeberg-Lévy) Let {zt } be a sequence of i.i.d. random variables with
mean µo and variance σo2 > 0. Then,

T (z̄T − µo ) D
−→ N (0, 1).
σo

A sequence of i.i.d. random variables need not obey this CLT if they do not have a finite
variance, e.g., random variables with t(2) distribution. Comparing to Lemma 5.25, one
can immediately see that the Lindeberg-Lévy CLT requires a stronger condition (i.e., finite
variance) than does Kolmogorov’s SLLN.

Remark: In this example, z̄T converges to µo in probability, and its variance σo2 /T vanishes
when T tends to infinity. To prevent a degenerate distribution in the limit, it is natural
to consider the normalized average T 1/2 (z̄T − µo ), which has a constant variance σo2 for all
T . This explains why the normalizing factor T 1/2 is needed. For a normalizing factor T a
with a < 1/2, the normalized average still converges to zero because its variance vanishes
in the limit. For a normalizing factor T a with a > 1/2, the normalized average diverges. In
both cases, the resulting normalized averages cannot have a well-behaved, non-degenerate
distribution in the limit. Thus, when {zt } obeys a CLT, z̄T is said to converge to µo at the
rate T −1/2 .

Independent random variables may also have the effect of a CLT. Below is a a version
of Liapunov’s CLT for independent (but not necessarily identically distributed) random
variables.

Lemma 5.36 Let {zT t } be a triangular array of independent random variables with mean
µT t and variance σT2 t > 0 such that
T
1X 2
σ̄T2 = σT t → σo2 > 0.
T
t=1

If for some δ > 0, IE |zT t |2+δ are bounded for all t, then

T (z̄T − µ̄T ) D
−→ N (0, 1).
σo

Note that this result requires a stronger condition (bounded (2 + δ) th moment) than does
Markov’s SLLN, Lemma 5.26. Comparing to Lindeberg-Lévy’s CLT, Lemma 5.36 allows
mean and variance to vary with t at the expense of a stronger moment condition.

The sufficient conditions for a CLT are similar to but usually stronger than those for
a WLLN. In particular, the random variables that obey a CLT have bounded moment up

c Chung-Ming Kuan, 2007–2010



132 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

to some higher order and are asymptotically independent with dependence vanishing suf-
ficiently fast. Moreover, every random variable must also be asymptotically negligible, in
the sense that no random variable is influential in affecting the partial sums. Although
we will not specify the regularity conditions explicitly, we note that weakly stationary AR
and MA processes obey a CLT in general. A sequence of weakly dependent and heteroge-
neously distributed random variables may also obey a CLT, depending on its moment and
dependence structure. The following examples show that a CLT may not always hold.

Example 5.37 Suppose that {ut } is a sequence of independent random variables with
mean zero, variance σu2 , and bounded (2 + δ) th moment. From Example 5.29, we know
var( Tt=1 tut ) is O(T 3 ), which implies that variance of T −1/2 Tt=1 tut is diverging at the
P P

rate O(T 2 ). On the other hand, observe that


T
!
1 X t T (T + 1)(2T + 1) 2 σu2
var u = σ → .
T 1/2 T t 6T 3 u
3
t=1

It follows from Lemma 5.36 that


√ T
3 X t D
1/2
ut −→ N (0, 1).
T σu T
t=1

These results show that {(t/T )ut } obeys a CLT, whereas {tut } does not. 2

Example 5.38 Suppose that yt is generated as a random walk:

yt = yt−1 + ut , t = 1, 2, . . . ,

with y0 = 0, where ut are i.i.d. random variables with mean zero and variance σu2 . From
Example 5.31 we have seen that yt have unbounded variances and strong correlations over
time. Hence, they do not obey a CLT. Example 5.32 also suggests that {yt−1 ut } does not
obey a CLT. 2

In many applications, we usually encounter an array of functions of random variables


and would like to know if it obeys a CLT. Let {zT t } denote a triangular array of functions
of random variables. Establishing a CLT may not be too difficult when {zT t } is determined
by sequences of independent random variables, but it is technically more involved when
{zT t } depends on sequences of correlated (or weakly dependent) random variables. In what
follows, the array of square integrable random variables zT t is said to obey a CLT if
T √
1 X T (z̄T − µ̄T ) D
√ [zT t − IE(zT t )] = −→ N (0, 1), (5.3)
σo T t=1 σo

c Chung-Ming Kuan, 2007–2010



5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 133

PT
where z̄T = T −1 t=1 zT t , µ̄T = IE(z̄T ), and

T
!
X
σT2 = var T −1/2 zT t → σo2 > 0.
t=1

Note that this definition requires neither IE(zT t ) nor var(zT t ) to be a constant. If IE(zT t )
is the constant µo , (5.3) would read:

T (z̄T − µo ) D
−→ N (0, 1),
σo

as we usually seen in other textbooks.

Consider an array of square integrable random vectors z T t in Rd . Let z̄ T denote the


average of z T t , µ̄T = IE(z̄ T ), and
T
!
1 X
ΣT = var √ zT t → Σo ,
T t=1

a positive definite matrix. Using the Cramér-Wold device (Lemma 5.18), {z T t } is said to
obey a multivariate CLT, in the sense that
T √
1 X D
Σ−1/2
o √ [z T t − IE(z T t )] = Σ−1/2
o T (z̄ T − µ̄T ) −→ N (0, I d ),
T t=1

if {α0 z T t } obeys a CLT, for any α ∈ Rd such that α0 α = 1.

5.8 Functional Central Limit Theorem


In this section, we consider a generalization of the concept of random variables and discuss
the related limit theorem.

5.8.1 Stochastic Processes

Let T be a nonempty set of R and (Ω, F, IP) be the probability space on which the Rd -
valued random variables z t , t ∈ T , are defined. Also let (Rd )T denote the collection of
all Rd -valued functions on T , which is also a product space of copies of Rd . For example,
when d = 1 and T = {1, . . . , k}, a real function on T is just a k-tuple (z1 , . . . , zk ), i.e.,
(R){1,...,k} = Rk ; when d = 1 and T is an interval, (R)T contains all real functions on that
interval. A d-dimensional stochastic process with the index set T is a measurable mapping
z : Ω 7→ (Rd )T such that

z(ω) = {z t (ω), t ∈ T }.

c Chung-Ming Kuan, 2007–2010



134 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

For each t ∈ T , z t (·) is a Rd -valued random variable; for each ω, z(ω) is a sample path
(realization) of z, which is a Rd -valued function on T . Therefore, a stochastic process is
understood as a collection of random variables or a random function on the index set. The
random sequence encountered in the preceding sections is just a stochastic process whose
index set is the set of integers.

In what follows, for the stochastic process z, we will write z(t, ·) or simply z(t) in place
of z t (·). Thus, z with a subscript (say, z n ) denotes a process in a sequence of stochastic
processes. To signify the index set T , we may also write z as {z(t, ·), t ∈ T }. The finite-
dimensional distributions of {z(t, ·), t ∈ T } is

IP(z t1 ≤ a1 , . . . , z tn ≤ an ) = Ft1 ,...,tn (a1 , . . . , an ),

where {t1 , . . . , tn } is any subset of T and ai ∈ Rd . A stochastic process is said to be


stationary if its finite-dimensional distributions are invariant under index displacement:

Ft1 +s,...,tn +s (a1 , . . . , an ) = Ft1 ,...,tn (a1 , . . . , an ).

A stochastic process is said to be Gaussian if its finite-dimensional distributions are all


(multivariate) normal distributions.

The process {w(t), t ∈ [0, ∞)} is the standard Wiener process (also known as the
standard Brownian motion) if it has continuous sample paths almost surely and satisfies
the following properties.

(i) IP w(0) = 0 = 1.

(ii) For 0 ≤ t0 ≤ t1 ≤ · · · ≤ tk ,
 Q 
IP w(ti ) − w(ti−1 ) ∈ Bi , i ≤ k = i≤k IP w(ti ) − w(ti−1 ) ∈ Bi ,

where Bi are Borel sets.

(iii) For 0 ≤ s < t, w(t) − w(s) is normally distributed with mean zero and variance t − s.

By (i), this process must start from the origin with probability one. The second property
requires non-overlapping increments of w being independent. By the property (iii), every
increment of w is normally distributed with variance depending on the time difference; in
particular, w(t) is normally distributed with mean zero and variance t. This implies that
for r ≤ t,

cov w(r), w(t) = IE w(r) w(t) − w(r) + IE w(r)2 = r,


    

where IE[w(r)(w(t) − w(r))] = 0 because of independent increments.

c Chung-Ming Kuan, 2007–2010



5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 135

The d-dimensional, standard Wiener process w is the process consisting of d mutu-


ally independent, standard Wiener processes. Thus, w still starts from the origin with
probbability one, has independent increments, and

w(t) − w(s) ∼ N (0, (t − s) I d ).

In view of the preceding paragraph, we have the following results for w.

Lemma 5.39 Let w be the d-dimensional, standard Wiener process.

(i) w(t) ∼ N (0, t I d ).

(ii) cov(w(r), w(t)) = min(r, t) I d .

We also note that, although the sample paths of the Wiener process are a.s. continuous,
they are highly irregular. To see this, define wc (t) = w(c2 t)/c for c > 0. It can be shown
that wc is also a standard Wiener process (Exercise 5.11). Note that wc (1/c) = w(c)/c,
where w(c)/c is the slope of the chord between w(c) and w(0). If we choose a c large enough
such that w(c)/c > 1, then the slope of the chord between wc (1/c) and wc (0) is

wc (1/c) w(c)/c
= = w(c) > c.
1/c 1/c

This shows that the sample path of wc has a large slope c and hence must experience a
large change on a very small interval (0, 1/c). In fact, it can be shown that almost all the
sample paths of w are nowhere differentiable; see e.g., Billingsley (1979, p. 450). Intuitively,
the difference quotient [w(t + h) − w(t)]/h is distributed as N (0, 1/|h|). As its variance
diverges to infinity when h tends to zero, the difference quotient can not converge to a finite
limit with a positive probability.

We may also construct different processes using the standard Wiener process. In par-
ticular, the process w0 on [0, 1] with w0 (t) = w(t) − tw(1) is known as the Brownian
bridge or the tied down Brownian motion. It is easily seen that w0 (0) = w0 (1) = 0 with
probability one so that the Brownian bridge starts from zero and must return to zero at
t = 1. Moreover, IE[w0 (t)] = 0, and for r < t,

cov w0 (r), w0 (t) = cov w(r) − rw(1), w(t) − tw(1)


 

= r(1 − t) I d ;

in particular, var(w0 (t)) = t(1 − t) I d which reaches the maximum at t = 1/2.

c Chung-Ming Kuan, 2007–2010



136 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

5.8.2 Weak Convergence

Let S be a metric space and S be the Borel σ-algebra generated by the open sets in S. If
for every bounded, continuous real function f on S we have
Z Z
f (s) dIPn (s) → f (s) d IP(s),

where {IPn } and IP are probability measures on (S, S), we say that IPn converges weakly
to IP and write IPn ⇒ IP. For the random elements z n and z in S with the distributions
induced by IPn and IP, respectively, we say that {z n } converges in distribution to z, also
D
denoted as z n −→ z, if IPn ⇒ IP. Note that z n and z here may be random functions.
When z n and z are all Rd -valued random variables, IPn ⇒ IP reduces to the usual notion
of convergence in distribution, as in Section 5.3.3. When z n and z are d-dimensional
D
stochastic processes, z n −→ z implies that all the finite-dimensional distributions of z n
converge to the corresponding distributions of z. To distinguish between the convergence
in distribution of random variables and that of random functions, we shall, in what follows,
denote the latter as z n ⇒ z.

Let S and S 0 be two metric spaces with respective Borel σ-algebras S and S 0 . Also let
g : S 7→ S 0 be a measurable mapping. Then each probability measure IP on (S, S) induces
a unique probability measure IP∗ on (S 0 , S 0 ) via

IP∗ (A0 ) = IP(g −1 (A0 )), A0 ∈ S 0 .

If g is continuous almost everywhere on S, then for every bounded, continuous f on S 0 ,


f ◦ g is also bounded and continuous on S. IPn ⇒ IP now implies that
Z Z
f ◦ g(s) dIPn (s) → f ◦ g(s) d IP(s),

which is equivalent to
Z Z
f (a) dIPn (a) → f (a) dIP∗ (a).

This proves that IP∗n ⇒ IP∗ . This result is also known as the continuous mapping theorem;
cf. Lemma 5.20.

Lemma 5.40 (Continuous Mapping Theorem) Let g : Rd 7→ R be a function contin-


uous almost everywhere on Rd , except for at most countably many points. If z n ⇒ z, then
g(z n ) ⇒ g(z).

For example, when zn ⇒ z and h(z) = sup0≤t≤1 z(t),

sup zn (t) ⇒ sup z(t),


0≤t≤1 0≤t≤1

c Chung-Ming Kuan, 2007–2010



5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 137

R1
and when h(z) = 0 z(t) dt,
Z 1 Z 1
zn (t) dt ⇒ z(t) dt.
0 0

5.8.3 Functional Central Limit Theorem

A sequence of random variables {ζi } is said to obey a functional central limit theorem
(FCLT) if its normalized partial sums zn converge in distribution to the standard Wiener
process w, i.e., zn ⇒ w. The FCLT, also known as the invariance principle, ensures that the
limiting behavior of the normalized partial sums of ζi is governed by the standard Wiener
process, regardless of the original distributions of ζi .

To see how the FCLT works, we consider the i.i.d. sequence {ζi } with mean zero and
variance σ 2 . The partial sum of ζi is sn = ζ1 + · · · + ζn , and it can be normalized as

zn (i/n) = (σ n)−1 si . For t ∈ [(i − 1)/n, i/n), define the constant interpolations of zn (i/n)
as
1
zn (t) = zn ((i − 1)/n) = √ s[nt] ,
σ n

where [nt] is the the largest integer less than or equal to nt, so that [nt] = i − 1. It
can be seen that the sample paths of zn are right continuous with left-hand limits, i.e.,
zn (t+) = zn (t) and zn (t−) = limr↑t zn (r). Such sample paths are also known as cadlag (an
abreviation of the French “continue à droite, limite à gauche”) functions. The interpolated
process zn is thus a random element of D[0, 1], the space of all cadlag functions. In view of
the discussion of Section 5.8.2, we may study the weak convergence property of {zn }.

We shall only discuss convergence of the finite-dimensional distributions of zn . First


note that, as n tends to infinity, we have from Lindeberg-Lévy’s CLT that
1/2
D √

1 [nt] 1
√ s[nt] = p s[nt] −→ t N (0, 1),
σ n n σ [nt]
D
which is just N (0, t), the distribution of w(t). That is, zn (t) −→ w(t). For r < t, we have
D 
(zn (r), zn (t) − zn (r)) −→ w(r), w(t) − w(r) ,
D
from which we deduce that (zn (r), zn (t)) −→ (w(r), w(t)). Proceeding along the same line
we can show that all the finite-dimensional distributions of zn converge to the corresponding
distributions of the standard Wiener process. Although merely proving convergence of
finite-dimensional distributions is not sufficient for zn ⇒ w, it should help understanding
the intuition of the FCLT. To arrive at zn ⇒ w, it is also required the probability measures
induced by zn being “well behaved;” we omit the details.

c Chung-Ming Kuan, 2007–2010



138 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

In view of the discussion above, we are now ready to state an FCLT for i.i.d. random
variables.

Lemma 5.41 (Donsker) Let ζt be i.i.d. random variables with mean µo and variance
σo2 > 0 and zT be the stochastic process with
[T r]
1 X
zT (r) = √ (ζt − µo ), r ∈ [0, 1].
σo T t=1

Then, zT ⇒ w as T → ∞.

We observe from Lemma 5.41 that, when r = 1,


√ 
T ζ̄T − µo D
zT (1) = −→ N (0, 1),
σo

where ζ̄T = Tt=1 ζt /T . This is precisely the conclusion of Lemma 5.35 and shows that
P

Donsker’s FCLT can be viewed as a generalization of Lindeberg-Lévy’s CLT. The FCLT


below applies to independent random variables and is a generalization of Liapunov’s CLT
(Lemma 5.36); see White (2001).

Lemma 5.42 Let ζt be independent random variables with mean µt and variance σt2 > 0
such that
T
1X 2
σ̄T2 = σt → σo2 > 0.
T
t=1

Also let zT be the stochastic process with


[T r]
1 X 
zT (r) = √ ζt − µt , r ∈ [0, 1].
σo T t=1

If for some δ > 0, IE |ζt |2+δ are bounded for all t, then zT ⇒ w as T → ∞.

More generally, let ζt be (possibly dependent and heterogeneously distributed) random


variables with mean µt and variance σt2 > 0. Define the long-run variance of ζt as
T
!
2 1 X
σ∗ = lim var √ ζt ,
T →∞ T t=1

and assume σ∗2 exists and is positive. We say that {ζt } obeys an FCLT if zT ⇒ w as T → ∞,
where zT is the stochastic process with
[T r]
1 X 
zT (r) = √ ζt − µ t , r ∈ [0, 1].
σ∗ T t=1

c Chung-Ming Kuan, 2007–2010



5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 139

When ζt are independent random variables, cov(ζt , ζs ) = 0 for all t 6= s, so that σ∗2 = σo2 .
Then the generic FCLT above leads to the conclusion of Lemma 5.42.

Let ζ t are d-dimensional random variables with mean µt and variance-covariance ma-
trices Σ2t . Define the long-run variance-covariance matrix of ζ t as

T
! T !0 
1  X X
Σ∗ = lim IE (ζ t − µt ) (ζ t − µt )  ,
T →∞ T
t=1 t=1

and assume that Σ∗ exists and is positive definite. We say that {ζ t } obeys a (multivariate)
FCLT if z T ⇒ w as T → ∞, where z T is the d-dimensional stochastic process with

[T r]
1 −1/2 X 
z T (r) = √ Σ∗ ζ t − µt , r ∈ [0, 1],
T t=1

and w is the d-dimensional, standard Wiener process. Although no sufficient conditions


will be provided, we note that a FCLT may hold for weakly dependent and heterogeneously
distributed data, provided that they satisfy some regularity conditions; see Davidson (1994)
and White (2001) for details.

Example 5.43 Suppose that yt is generated as a random walk:

yt = yt−1 + ut , t = 1, 2, . . . ,

with y0 = 0, where ut are i.i.d. random variables with mean zero and variance σu2 . As {ut }
P[T r]
obeys Donsker’s FCLT and y[T r] = t=1 ut is a partial sum of ut , we have

T T Z (t+1)/T
1 X X 1
yt = σ u √ y[T r] dr
T 3/2 t=1 t=1 t/T T σu
Z 1
⇒ σu w(r) dr,
0
PT
where the right-hand side is a random variable. This result also verifies that t=1 yt is
OIP (T 3/2 ), as stated in Example 5.31. Similarly,
T T Z 1
1 X 2 1 X  y t 2 2
yt = √ ⇒ σu w(r)2 dr,
T2 T T 0
t=1 t=1
PT 2
so that t=1 yt is OIP (T 2 ). It is clear that these results remain valid, as long as ut obey a
FCLT (but need not be i.i.d. or independent). 2

c Chung-Ming Kuan, 2007–2010



140 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

Exercises

5.1 Let C be a collection of subsets of Ω. Show that the intersection of all the σ-algebras
on Ω that contain C is the smallest σ-algebra containing C.

5.2 Let x and y be two random variables with finite p th moment (p > 1). Prove the
following triangle inequality:

kx + ykp ≤ kxkp + kykp .

Hint: Write IE |x + y|p = IE(|x + y||x + y|p−1 ) and apply Hölder’s inequality.

5.3 In the probability space (Ω, F, IP) suppose that we know the event B in F has oc-
curred. Show that the conditional probability IP(·|B) satisfies the axioms for proba-
bility measures.

5.4 Prove that for the square integrable random vectors z and y,

var(z) = IE[var(z | y)] + var(IE(z | y)).

5.5 A sequence of square integrable random variables {zn } is said to converge to a random
variable z in L2 (in quadratic mean) if

IE(zn − z)2 → 0.

Prove that L2 convergence implies convergence in probability.


Hint: Apply Chebychev’s inequality.

5.6 Show that a sequence of square integrable random variables {zn } converges to a
constant c in L2 if and only if IE(zn ) → c and var(zn ) → 0.
A A
5.7 Prove that z T ∼ N (0, I) if, and only if, λ0 z T ∼ N (0, 1) for all λ0 λ = 1.

5.8 Prove Lemma 5.23.

5.9 Suppose that IE(zn2 ) = O(cn ), where {cn } is a sequence of positive real numbers. Show
1/2
that zn = OIP (cn ).

5.10 Suppose that yt is generated as a Gaussian random walk:

yt = yt−1 + ut , t = 1, 2, . . . ,

with y0 = 0, where ut are i.i.d. normal random variables with mean zero and variance
σu2 . Show that Tt=1 yt2 is OIP (T 2 ).
P

c Chung-Ming Kuan, 2007–2010



5.8. FUNCTIONAL CENTRAL LIMIT THEOREM 141

5.11 Let w be a standard Wierner process and define wc as wc (t) = w(c2 t)/c, where c > 0.
Show that wc is also a standard Wierner process.

5.12 Let w be a standard Wiener process and w0 a Brownian bridge. Suppose that x(t) =
w(t + r) − w(r) for a given r > 0 and y(t) = (1 + t) w0 (t/(1 + t)), t ∈ [0, ∞). Show
that both x and y are standard Wiener processes.

References

Ash, Robert B. (1972). Real Analysis and Probability, New York, NY: Academic Press.

Bierens, Herman J. (1994). Topics in Advanced Econometrics, New York, NY: Cambridge
University Press.

Billingsley, Patrick (1979). Probability and Measure, New York, NY: John Wiley and Sons.

Davidson, James (1994). Stochastic Limit Theory, New York, NY: Oxford University Press.

Gallant, A. Ronald (1997). An Introduction to Econometric Theory, Princeton, NJ: Prince-


ton University Press.

Gallant, A. Ronald and Halbert White (1988). A Unified Theory of Estimation and Infer-
ence for Nonlinear Dynamic Models, Oxford, UK: Basil Blackwell.

White, Halbert (2001). Asymptotic Theory for Econometricians, revised edition, Orlando,
FL: Academic Press.

c Chung-Ming Kuan, 2007–2010



142 CHAPTER 5. ELEMENTS OF PROBABILITY THEORY

c Chung-Ming Kuan, 2007–2010

You might also like