Regression Analysis: Properties of Least Squares Estimators
EC1630 Handout, Chapter 18
Abstract
We consider the regression model: Yi = Xi0 + ui , i = 1, . . . , n. This note summa that makes
rizes the results for asymptotic analysis of the least squares estimator ,
and (ii) hyit possible to test: (i) hypotheses about the individual coecients of ;
potheses that involve several coecients, such as R = r, where R is a known matrix
and r is a known vector.
Regression Analysis and Least Squares Estimators
Consider the regression model
Yi = 0 + 1 x1i + . . . + k xki + ui = Xi0 + ui ,
[Matrix: Y = X + U.]
(1)
0
0
0
0
with = ( 0 1 . . . k ) , Xi = (1 x1i . . . xki ) , Y = (Y1 . . . Yn ) , X = (X1 . . . Xn ) . The least
squares estimator (LS) is given as the solution to
min S() = min
n
X
i=1
u2i = min
n
X
2
Yi 0 Xi ,
i=1
t = 1, . . . , n,
[Matrix: min(Y X)0 (Y X).]
The least squares estimator results by solving the first order condition for a minimum:
Pn
2
Y
X
=
0
i
i
i=1
0 S() = 0
Pn
=0
S(
)
=
0
Y
x
X
2
1i
i
i
i=1
1
..
..
.
.
n
=0
k S() = 0
2 i=1 xki Yi Xi0
(2)
Pn
Pn
i=1 Yi =
i=1 Xi
Pn x1i Yi = Pn x1i X 0
Pn
Pn
i
i=1
i=1
i=1 Xi Yi = i=1 Xi Xi0
.
..
Pn
Pn
0
i=1 xki Yi =
i=1 xki Xi
1
and the least squares estimator reads
n
!1 n
X
X
0
=
Xi X
Xi Yi .
= X0 X 1 X0 Y],
[Matrix:
i=1
i=1
(3)
by substituting (1) into (3) we find that
= X0 X 1 X0 (X + u) = + X0 X 1 X0 U,
such that
n
X
i=1
Xi Xi0
!1
n
X
= X0 X 1 X0 U].
[Matrix:
Xi ui ,
i=1
(4)
A1 E(U |X) = 0.
:
The expectation of
n
!1 n
X
X
= +
E()
Xi X 0
E(Xi ui ) = ,
= + X0 X 1 E(X0 U) =].
[Matrix: E()
i=1
i=1
(5)
Since we can not always obtain
We want to use the true finite sample distribution of .
this, we resort to the asymptotic distribution, which is likely to be a good approximation
of the unknown finite sample distribution, if n is large. We consider the distribution of
as n , where is the LS estimator (note that depends on n, although it is not clear
from the definition).
When applying asymptotic results one should have the following in mind. The esti , has some (unknown) finite-sample distribution, which is approximated by its
mator,
n
asymptotic distribution. The finite-sample distribution and the asymptotic distribution are
(most likely) dierent, so we are making an error when we use the asymptotic distribution.
However, when n is large, the dierence between the two distributions is likely to be small,
and the asymptotic distribution is then a good approximation.
Asymptotic Analysis of
In addition to Assumption A1 we assume:
A2 (Yi , X1i , . . . , Xki ), i = 1, . . . , n are iid.
A3 (ui , X1i , . . . , Xki ) have finite fourth moments.
A4 X0 X has full rank. [Equivalent to: No perfect multicollinearity, or that det(X0 X) 6= 0.]
2
We have already used A4 when we implicitly assumed that the inverse of X0 X was well
defined. The two other assumptions will be used in the asymptotic analysis.
You are familiar with the law of large numbers and the central limit theorem for scalar
variables. The results that concerns vectors and matrices are straightforward extensions.
Theorem 1 (Multivariate LLN) Let {Mi }, i = 1, 2, . . . be a sequence of matrices, whose
P
p
elements are iid random variables. Then it holds that n1 ni=1 Mi E(Mi ).
So the univariate LLN is just a special case of the multivariate version. The same is
true for the multivariate CLT.
Theorem 2 (Multivariate CLT) Let {Vi } be a sequence of m-dimensional random vectors that are iid, with mean V = E(Vi ) and covariance matrix V = var(Vi ) = E[(Vi
P
d
V )(Vi V )0 ]. Then it holds that 1n ni=1 (Vi V ) Nm (0, V ).
Since {Xi } is a sequence of iid random variables (vectors), it follows that {Xi Xi0 } is
a sequence of iid random variables (matrices). So by the multivariate LLN it holds that
1 Pn
0 p
0
i=1 Xi Xi QX E(Xi Xi ). Similarly, {Xi ui } is a sequence of iid random variables
n
(vectors), with expected value E[Xi ui ] = E[E(Xi ui |Xi )] = E[Xi E(ui |Xi )] = E[Xi 0] = 0.
P
d
So by the multivariate central limit theorem we have that 1n ni=1 Xi ui Nk+1 (0, v ),
where v var(Xi ui ) = E(Xi Xi0 u2i ). Here we are implicitly using Assumption A3 that
guarantees that the expected values E(Xi Xi0 ) and E(Xi Xi0 u2i ) are well defined (finite).
Theorem 3 (Linear Transformation of Gaussian Variables) Let Z Nm (, ), for
some vector, , (m 1), and some matrix , (m m). Let A be a l m matrix and b be
a l 1 vector. Define the l-dimensional random variable Z = AZ + b. Then it holds that
Z Nl (A + b, AA0 ).
d
Theorem 4 (Asymptotic Linear Transformation of Gaussian Variables) Let Zn
p
Nm (, ), for some vector, , (m 1), and some matrix , (m m). Let An A and
p
bn b for some constant l m matrix, A, and some constant l 1 vector b. Define the
d
l-dimensional random variable Zn = An Zn +bn . Then it holds that Zn Nl (A+b, AA0 ).
P
1
P
In the present context, we can set An = n1 ni=1 Xi Xi0
, bn = 0, and Zn = 1n ni=1 Xi ui ,
and from the theorem it follows that
n
!1
n
1 X
1X
d
1
0
n( ) =
Xi Xi
Xi ui Nk+1 (0, Q1
X v QX ).
n
n
| i=1 {z
}| i=1
{z
}
p
Q1
X
Nk+1 (0,v )
1
1
vQ
1 ,
The covariance matrix n()
Q1
Q
X v QX , can be estimated by n(
X
X
n )
where
n
n
X
X
1
0
1 1
X
X
and
Xi Xi0 u
2i .
Q
i i
v
X
n
nk1
i=1
i=1
How can we be sure that
? We have already
n( n ) is consistent for n(
n )
p
X QX and since {Xi , Yi } is iid (Assumption A2), also {Xi X 0 (Yi
established that Q
i
P
p
Xi0 )2 } = {Xi Xi0 u2i } is iid, such that a LLN gives us n1 ni=1 Xi Xi0 u2i E(Xi Xi0 u2i ). To
p
v
v , we first note that
establish that
n
i=1
i=1
1X
1X
1X
Xi Xi0 u
2i =
Xi Xi0 u2i +
Xi Xi0 (
u2i u2i ),
n
n
n
i=1
Pn
and it can be shown that n1 i=1 Xi Xi0 (
u2i u2i ) 0, (beyond the scope of EC163), such
P
n
2 p
0 2
v = n 1 n Xi X 0 u
that
i i E(Xi Xi ui ) = v , using nk1 1 as n . Since
i=1
nk1 n
p
1
the mapping from {Qx , v } 7 Q1
x v Qx is continuous, we know that QX QX and
p
1
1 1 p 1
v v implies that
, as we wanted
n( n ) = Qx v Qx Qx v Qx = n(
n )
to show.
is normally
) by 1n and adding , shows that asymptotically
Multiplying n(
. In practice
distributed about , with a covariance matrix that is given by n1 n()
1
we will use the estimate,
.
2.1
n( n )
Test About a Single Regression Coecient
Consider the vector of regression coecients, = ( 0 , . . . , k )0 , and suppose that we are
interested in the jth coecient, j . We can let d = (0, . . . , 0, 1, 0, . . . , 0)0 denote the jth
unit-vector (the vector which has 1 as its jth element and zero otherwise). Then we note
that
) = n(d0
j j ),
d0 ) = n(
d0 n(
d
0
0
d). So
and by Theorem 4 it follows that n(
j j ) = d n( ) N1 (0, d n()
for large n it holds that
A
d),
N1 (0, d0
j
j
which allows us to construct the t-statistic of the hypothesis H0 : j = c. It is given by
c
j
,
t j =c = q
d
d0
which for large n, is approximately distributed as a standard normal, N (0, 1). (For moderate
values of n, it is typically better to use the t-distribution with nk 1 degrees of freedom.)
4
2.2
Test About a Multiple Regression Coecients
To test hypotheses that involve multiple coecients we need the following result.
Theorem 5 Let Z Nm (, ), for some vector, , (m 1), and some (full rank) matrix
, (m m). Then it holds that
(Z )0 1 (Z ) 2m .
Here we use 2m to denote the chi-squared distribution with m degrees of freedom. In
our asymptotic analysis the result we need is the following.
d
Theorem 6 Let Zn Nm (, ) for some vector, , (m 1), and some (full rank) matrix
p
p
d
1 (Zn
, (mm). Suppose that
and that
. Then it holds that (Zn
)0
)
2m .
p
d
In our setting we have established that n()
Nk+1 (0, n()
) and
n()
n()
. Thus the theorem tells us that
1
h
i1
0
0 1
)
n( ) n()
n( ) = ( )
(
n n()
h i1
d
)0
)
= (
(
2k+1 .
This enables us to test the hypothesis that the vector of regression parameters equals a
h i1
o 0
o)
)
(
particular vector, e.g., H0 : = o . All we need to do is to compute (
and compare this (scalar) number to the quantile (e.g. the 95%-quantile) of the 2k+1 distribution.
An important distribution that is closely related to the 2 -distribution is the Fq, distribution. It is defined as follows. Suppose that Z 2q , then U = Z/q Fq, . So an
Fq, is simply a 2q that has been divided by its degrees of freedom. So, should we prefer
to use an F -test to test H0 : = o , we would simply use that
F= o =
h i1
o )0
o)
(
(
k+1
Fk+1, ,
where F= o denotes the test-statistic and Fk+1, represents the F -distribution with (k +
1, ) degrees of freedom. (An F -distribution has two degrees of freedom).
Typically we are interested in more complicated hypotheses than j = c or = o . A
general class of hypotheses can be formulated as H0 : R = r, for some q k + 1 matrix,
R, and some q 1 vector, r.
5
How can we test hypotheses of this kind? First we note that Theorem 4 gives us that,
d
)
Nq (R 0, Rn()
R0 ) = Nq (0, Rn()
R0 ).
R n(
R) = n[(R
r)
) = n(R
The left hand side can be rewritten as R n(
(R r)] which equals n(R r) if the null hypothesis is true. So if we divide by n
A
r)
Nq (0, n1 Rn()
R0 ) = Nq (0, R R0 ). Thus by using Theorem
we get that (R
6 we can construct a 2 -test of the hypothesis H0 : R = r, using the test statistic:
h
i1
d
r)
r)0 R
R0
(R
2 , or the equivalent F -test, which is based on the
(R
q
statistics
FR=r =
h
i1
r)0 R
r)
R0
(R
(R
Fq, .
Tables with critical values for the 2m -distribution can be found in S&W on page 645.
The Fm1 ,m2 -distribution is tabulated on pages 647-649, and the Fm, -distribution (which
you will use most frequently) is tabulated on page 646, and (conveniently) on the very last
page of the book.
2.3
A Simple Example
Suppose that we have estimated
4.0
= 2.5
1.5
and
and wanted to test the following hypotheses:
801
40
27
8
27
8
3
4
9
8
6
98 ,
15
8
1. H1 : 3 = 0. We note that H1 is equivalent
R1 = r1 ,
where R1 = (0, 0, 1) and r1 = 0.
We find that
= 1.5
R1
and
R10 =
R1
15
,
8
such that the F -statistic of H1 is given by
h
i1
r1 )0 R1
r1 )
R10
(R1
F 3 =0 = (R1
1
15
A
= (1.5)
(1.5) = 1.2 F1, .
8
6
Note that when we test a single restriction, the F -statistic is the square of the tstatistic
2
2
0
.
F 3 =0 = t 3 =0 = q 3
)
var(
c
3
2. H2 : 2 = 3 . This hypothesis is equivalent to 2 3 = 0, so we set
R2 = (0, 1, 1) and r2 = 0,
and find
= 2.5 (1.5) = 4,
R2
R20 =
R2
and
39
.
8
So the F -test is given by
F 2 = 3
1
h
i1
39
A
0
= R2 R2 R2
R2 = 4
4 = 3.2821 F1, .
8
3. H3 : 2 + 3 = 0? Here we set
R3 = (0, 1, 1) and r3 = 0,
We find
= 2.5 1.5 = 1,
R3
R30 = 3 ,
R3
and
and the F -test is given by
F 2 + 3 =0
1
3
A
=1
1 = 2.6666 F1, .
8
4. H4 : 2 = 3 = 0. Now we set
R4 =
Then
=
R4
2.5
1.5
0 1 0
0 0 1
and
0
0
R40 =
R4
, r4 =
3
4
98
98
15
8
such that
F 2 = 3 =0 =
2.5 1.5
3
4
98
98
15
8
2.5
1.5
= 17.666 F2, .
Calculus with Conditional Expectations
When Xi is random, the concept of conditional expectation is an important tool in the
analysis.
A conditional expected value, is written E(Y |X = x), and denotes the expected value
Y, when we know that X = x. (So E(Y ) is the expected value of Y when we know nothing
about other random variables). When we write E(Y |X), we think of it as a function of X.
Since X is random then E(Y |X) is also random (in the general case).
Some properties of conditional expected values are listed below.
1. If X and Y are independent, then E(Y |X) = E(Y ).
2. If cov(X, Y ) 6= 0, then E(Y |X) depends on the X, in which case E(Y |X) is a random
variable (as X is random). So it is meaningful to talk about the expected value of a
conditional expected value.
3. E(Y f (X)|X) = E(Y |X)f (X), for any function f. So (functions of) the variable we
condition on, can be taken outside the conditional expectation.
4. E(Y ) = E[E(Y |X)].
The properties of conditional expectations are very useful for our analysis. We shall
make use of arguments such as the following example.
Example 7 Suppose that E(u|X) = 0. Then we have that
(4)
(3)
E(uX) = E[E(uX|X)] = E[E(u|X)X] = E[0X] = 0,
where the numbers refer to the properties listed above.