Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views96 pages

Mult 2023 Final 1

The document provides an overview of multivariate analysis focusing on sampling distributions, linear algebra concepts, and matrix properties. It discusses topics such as Gram-Schmidt orthogonalization, spectral decomposition of symmetric matrices, and the characterization of projection matrices. Additionally, it addresses rank-nullity generalization and the implications of eigenvalues in determining matrix definiteness and projections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views96 pages

Mult 2023 Final 1

The document provides an overview of multivariate analysis focusing on sampling distributions, linear algebra concepts, and matrix properties. It discusses topics such as Gram-Schmidt orthogonalization, spectral decomposition of symmetric matrices, and the characterization of projection matrices. Additionally, it addresses rank-nullity generalization and the implications of eigenvalues in determining matrix definiteness and projections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

MULTIVARIATE ANALYSIS: SAMPLING DISTRIBUTIONS

References:

1. C. R. Rao: Linear Statistical Inference & Its Applications

2. T. W. Anderson: An Introduction To Multivariate Statistical Analysis

3. N. L. Johnson & S. Kotz: Disributions in Statistics

4. R. A. Johnson & D. W. Wichern: Applied Multivariate Statistical Analysis

General notation: When K is a finite set, |K| denotes its cardinality; when a
real number, its modulus; when a non-scalar square matrix, its determinant.

1 Linear Algebra Recap:

First, a fixed notation: for p ∈ N, we shall write the standard (orthonormal) basis of
1 , 0, . . . , 0)′ for 1 ≤ j ≤ p.
Rp as {e1 , . . . , ep }, where ej = (0, 0, . . . , |{z}
˜ ˜ ˜ th
j
1. Nonsingular matrices with one row/column given: An often invoked fact we get
out of the way first. Recall that any linearly independent set of vectors in Rp
can be extended into a basis of Rp . So, if we start with one non-null vector
x1 ∈ Rp and extend the set {x1 } into a basis {x1 , x2 , . . . , xp }, then the matrix
˜ ˜ ˜ ˜ ˜
A = [x1 |x2 |· · · |xp ] (respectively, its transpose) is nonsingular with the given x1
˜ ˜ ˜ ˜
as its first row (respectively, column).

Obviously, if instead of one we had started with k linearly independent vectors


x1 , x2 , . . . , xk and extended their set into a basis, then using vectors in the basis
˜ ˜ ˜
as rows/columns would yield a matrix with initial k rows/columns given.

1
2. Gram-Schmidt orthogonalization: if x1 , x2 , . . . , xk are (ordered) linearly inde-
˜ ˜ ˜
pendent vectors in Rp (clearly p ≥ k), then the following (ordered) sequence
of orthogonal vectors is its Gram-Schmidt orthogonalization: u1 = x1 ; and
˜ ˜
recursively for 2 ≤ j ≤ k,
j−1
X ui
uj = xj − x′j ui · p˜
˜ ˜ i=1
˜ ˜ ∥ui ∥
˜

with the norm ∥·∥ being the Euclidean one: ∥u∥ := u′ u.
˜ ˜˜
The set has the same span as the original set. In fact, sp({u1 , . . . , uj }) =
˜ ˜
sp({x1 , . . . , xj }) ∀j = 1, . . . , k. That is because actually, if we write sp({x1 , . . . , xj })
˜ ˜ ˜ ˜
as Sj for 1 ≤ j ≤ k, then in fact for 2 ≤ j ≤ k,
j−1
X ui
x′j ui · p˜ = PSj−1 (xj ) and uj = xj − PSj−1 (xj ) = PSj−1 ⊥ (xj )

i=1
˜ ˜ ∥ui ∥ ˜ ˜ ˜ ˜ ˜
˜
so that inductively, sp{ui : 1 ≤ i ≤ j} = sp{Sj−1 ∪ {uj }} = sp{Sj−1 ∪ {xj }} =
˜ ˜ ˜
Sj .
n o
u
In effect, ∥˜uii ∥ : 1 ≤ j ≤ n is an orthonormal basis for Sj , 1 ≤ j ≤ k.
˜
3. Orthogonal matrices with one row/column given: Given one vector x1 ∈ Rp
˜
with norm ∥x1 ∥= 1, extend {x1 } into a basis and then run the Gram-Schmidt
˜ ˜
process on it to yield an orthonormal basis {x1 , x2 , . . . , xp }, of Rp . Then the
˜ ˜ ˜
matrix P with x1 , x2 , . . . , xp as rows (respectively, columns) is orthogonal with
˜ ˜ ˜
the given x1 as its first row (respectively, column).
˜
4. Spectral decomposition of symmetric matrices: If Ap×p is symmetric, then it is
expressible as PΛP′ where P is orthogonal and Λ is diagonal. Then, AP =
PΛ = [λ1 P1 |λ2 P2 | . . . |λp Pp ] where P = [P1 |P2 | . . . |Pp ] and Λ = diag(λ1 , λ2 , . . . , λp )
˜ ˜ ˜ ˜ ˜ ˜

2
which in particular means that for every j, APj = λj Pj ; so that λj is an eigen-
˜ ˜
value of A and Pj is an eigenvector for λj .
˜
Another way to express the same fact is that every symmetric matrix is diago-
nalizable by an orthogonal matrix.

In particular, each eigenvalue of A has same geometric and algebraic multiplic-


ities; and further, A possesses a complete set of orthogonal eigenvectors.

The rank of A is of course the number of non-zero eigenvalues; counted with


multiplicities.

5. Quadratic forms: definiteness of A as a quadratic form is determined by the


signs of the eigenvalues: if (1) all are non-negative then A is non-negative
definite (nnd); in that case if (1a) all are positive then A is positive definite
(pd) while if (1b) at least one is 0 then A is positive semidefinite (psd).

Non-positive definiteness, negative definiteness and negative semidefiniteness


are determined analogously: respectively as (2) λj ≤ 0 ∀ j, with either (2a)
λj < 0 ∀ j; or (2b) λj ≤ 0 ∀ j with λj = 0 for at east one j.

If (3) A has positive as well as negative eigenvalues then A is indefinite.

The only matrix that is nnd as well as npd is the null matrix 0.

6. “Square roots” of non-negative definite matrices: A symmetric matrix Bp×p is


nnd iff there exists a matrix C such that B = C′ C. Clearly r(C) = r(C′ C) =
r(B) for any such choice.

The “if” part is trivial: if so, then for any x ∈ Rp , x′ Bx = (Cx)′ (Cx) =
˜ ˜ ˜ ˜ ˜
∥Cx∥2 ≥ 0. For the “only if” part, let B = P′ ΛP be the spectral decomposition
˜
of B where P is orthogonal and Λ = diag(λ1 , λ2 , . . . , λp ) with λj ≥ 0 ∀ j (the

3
1 √ √ p 1
eigenvalues of B); then defining Λ 2 = diag( λ1 , λ2 , . . . , λp ) and C = Λ 2 P,
1
we get C′ C = P′ (Λ 2 )2 P = B.
1
More generally, we can choose C = QΛ 2 P for any orthogonal Q; the choice
Q = Ip gives the above example, but other useful choices exist. For instance,
the choice Q = P′ actually makes C symmetric and in fact nnd itself: in this
case C′ = C and B = C′ C = C2 .
1
We shall call C the square root of B and denote it by B 2 . Note that if B is pd
then so is C.

7. Quadratically constrained maximization of rank 1 quadratic forms: for a sym-


metric matrix B, we write B > 0 to indicate that B is p.d. For fixed Bp×p > 0
and y ∈ Rp \ {0p },
˜ ˜
(ℓ′ y)2
max ˜ ˜ = y ′ B−1 y, attained when ℓ ∝ B−1 y
ℓ∈R p \{0p } ℓ′ Bℓ ˜ ˜ ˜ ˜
˜ ˜ ˜ ˜
To prove this, we write B = C′ C where C is nonsingular. Now, take y 1 =
˜
(C−1 )′ y and define ℓ1 = Cℓ so that
˜ ˜ ˜
(ℓ′ y)2 (ℓ′ C′ (C−1 )′ y)2 (ℓ′1 y 1 )2
˜′ ˜ = ˜ = ˜ ′˜ ≤ y ′1 y 1
ℓ Bℓ ℓ′ C′ ℓ ˜ ℓ1 ℓ1 ˜ ˜
˜ ˜ ˜ ˜ ˜˜
by Cauchy-Schwartz inequality. For the equality case put ℓ1 ∝ y 1 = (C−1 )′ y
˜ ˜ ˜
yielding ℓ = C−1 ℓ1 ∝ (C′ C)−1 y = B−1 y.
˜ ˜ ˜ ˜
8. Projections: A p × p matrix B represents an (orthogonal) projection onto a
subspace S of Rp if for each x ∈ Rp , Bx ∈ S and x − Bx = (Ip − B)x ∈ S ⊥ ;
˜ ˜ ˜ ˜ ˜
so that every vector x ∈ Rp is decomposable as x = Bx + (I − B)x as the sum
˜ ˜ ˜ ˜
of two orthogonal vectors Bx and (I − B)x in S and S ⊥ respectively. Clearly
˜ ˜
then the matrix I − B represents the orthogonal projection onto S ⊥ .

4
Of course, if B is the projection matrix onto S, then S = {Bx : x ∈ Rp } equals
˜ ˜
what is called the column space of B and is denoted usually by C(B).

How to characterize projections? Since ∀ x, y ∈ Rp , Bx ∈ S and (I − B)y ∈ S ⊥ ,


˜ ˜ ˜ ˜
we have Bx ⊥ (I − B)y or equivalently, (Bx)′ (I − B)y = 0 ⇔ x′ (B′ − B′ B)y = 0
˜ ˜ ˜ ˜ ˜ ˜
and in particular, for every i, j ∈ {1, 2, . . . , p} taking x = ei and y = ej we see
˜ ˜ ˜ ˜
that the matrix B′ − B′ B has each entry 0 so that it is the null matrix.

Thus B′ = B′ B is symmetric, i.e. B = B′ = B′ B = B2 and thus B is also


idempotent.

Conversely, if B is symmetric and idempotent, then since ∀ x ∈ Rp , Bx ⊥


˜ ˜
(I − B)x, it follows that B is the projection onto S := C(B).
˜
Yet another characterization of projections is in terms of eigenvalues: if λ is an
eigenvalue of B then because λ2 is always an eigenvalue of B2 with the same
eigenvectors, we have when B = B2 that λ = λ2 or λ = 0 or 1. Conversely, if a
symmetric matrix B has only 0 and 1 as eigenvalues, then the diagonal matrix Λ
in its spectral representation B = PΛP′ is idempotent, so B2 = PΛP′ PΛP′ =
PΛ2 P′ = PΛP′ = B i.e. B is idempotent too, hence a projection matrix.

As a convention we shall gather the 1 eigenvalues first and 0 eigenvalues last in


the spectral decomposition of a projection matrix: if B is one suchwith rank
Ir 0r×(p−r)
r, we shall take the representation B = P   P′ where
0(p−r)×r 0(p−r)×(p−r)
P is orthogonal, as its spectral decomposition.

A general fact true for idempotent matrices of course holds for projection ma-
trices B: since rank of B is the number of non-zero eigenvalues (counted with
multiplicity) and the eigenvalues are just 0 and 1, the rank is just the multi-

5
plicity of 1, or the sum of the eigenvalues which equals its trace.

The projection onto a subspace S of Rn will be denoted by PS .

9. Given a general matrix A, the projection into C(A) equals B = A(A′ A)− A′
where the the g-inverse is symmetric: clearly B is symmetric; easy to show
B3 = B2 whence B has only 0 and 1 as eigenvalues whence B is idempotent.

So to show C(B) = C(A). Clearly ⊆ holds. So equality of dimensions i.e.


r(B) = r(A) is enough. But r(B) = tr(B) = tr[A′ A(A′ A)− ] = r (A′ A(A′ A)− )
because the last matrix is easily seen to be idempotent too; now whenever M
is a matrix then for any choice of g-inverse, r(M) ≥ r(MM− ) ≥ r(MM− M) =
r(M) so equality throughout; applying to M = A′ A we get r(B) = r(A′ A) =
r(A).

10. Rank-nullity generalization: Suppose S ⊆ Rn is a subspace and Tm×n is a matrix


with null space N (T ). Then, the dimension of the range of T |S (which is always
a subspace of Rm ) equals dim(S) − dim (S ∩ N (T )).

In the language of linear transformations: if T : Rn → Rm is linear and S is a


subspace of Rn with image T (S) = W (a subspace of Rm ) under T , then

dim(S) = dim(W ) + dim (S ∩ N (T ))

(S = Rn is the Rank-nullity theorem.)

To see why this holds, first treating the case when S ∩ N (T ) = {0n } (in Rn ),
˜
we easily conclude in this case that T |S is one-to-one, so that S and W are
isomorphic, hence have the same dimensions. So let S ∩ N (T ) be non-trivial,
say dim(S ∩ N (T )) = k ≥ 1 and let {x1 , x2 , . . . , xk } be a basis for it. Now
˜ ˜ ˜
discarding the trivial case W = {0m } when S = N (T ), let dim(W ) = r ≥ 1
˜
6
and let {w1 , . . . , wr } be a basis of W . For 1 ≤ j ≤ r, let y j ∈ S be such that
˜ ˜ ˜
T (y j ) = wj . Then we claim that B := {x1 , . . . , xk , y 1 , . . . , y r } forms a basis for
˜ ˜ ˜ ˜ ˜ ˜
S.

To check linear independence, first, clearly all vectors in B are distinct – no


xi can equal any y j for T (xi ) = 0m and T (y j ) = wj ̸= 0m . Now, if a linear
˜ P ˜ P˜ ˜ ˜ ˜ ˜
combination ki=1 ai xi + rj=1 bj y j = 0n , then since applying T we have 0m =
˜ P ˜ ˜ ˜
0m + rj=1 bj T (y j ) = r
P
j=1 b j w j so that b j = 0 ∀ j = 1, . . . , r. Now linear
˜ ˜ ˜
independence of {x1 , x2 , . . . , xk } forces ai = 0 ∀ i = 1, . . . , k also.
˜ ˜ ˜
To verify that B spans S, let x ∈ S; then T (x) ∈ W so ∃ scalars c1 , . . . , cr such
˜ ˜
that T (x) = rj=1 cj wj so that if we define y = rj=1 cj y j then T (x) = T (y).
P P
˜ ˜ ˜ ˜ ˜ ˜
But then this means T (x − y) = 0m ⇒ x − y ∈ S ∩ N (T ) and consequently,
˜ ˜ ˜ ˜ ˜
∃ scalars d1 , . . . , dk such that x − y = ki=1 di xi ; therefore x = ki=1 di xi +
P P
Pr ˜ ˜ ˜ ˜ ˜
c y
j=1 j j ∈ sp(B) as required.
˜
11. Kronecker products of matrices: recall that if Am×n and Bp×q are any two
matrices of arbitrary orders, then their Kronecker product A ⊗ B is defined as
the (mp × nq) matrix partitioned as
 
a B a12 B · · · a1n B
 11 
 a21 B a22 B · · · a2n B 
 
A ⊗ B = ((aij B)) = 
 .. .. . .. 

 . . . . . 
 
am1 B am2 B · · · amn B

Properties of Kronecker products we shall require:

(a) Distributivity over addition: If A and C can be added, then (A + C) ⊗


B = (A ⊗ B) + (C ⊗ B); while if B and C add, then A ⊗ (B + C) =
(A ⊗ B) + (A ⊗ C).

7
(b) Transposes: (A ⊗ B)′ = A′ ⊗ B′ .

(c) Products: if AC and BD are defined, then (A ⊗ B) · (C ⊗ D) = (AC) ⊗


(BD).

An outline of the justification: the (i, j)th submatrix of both sides is


P P
k aik Bckj D = ( k aik ckj )BD for every i and j.

12. Determinant andinverse of


 partitioned matrices: if a square matrix M is par-
A B
titioned as M =   where A and C are square, then
D C

|A||C − DA−1 B| if A is nonsingular

|M|= .
|C||A − BC−1 D| if C is nonsingular

Suppose M is symmetric and nonsingular; then both A and C are so also


(of appropriate orders); and D = B′ . Write F = A−1 B, G = BC−1 , E =
C − B′ F = C − B′ A−1 B and H = A − GB′ = A − BC−1 B′ . Note also that
since |M|= |A|·|E|= |H|·|C|, E and H are also nonsingular. Then
   
−1 −1 ′ −1 −1 −1
A + FE F −FE H −H G
M−1 =  = .
−E−1 F′ E−1 −G′ H−1 C−1 + G′ H−1 G

2 Random Vectors and Multivariate Distributions

We know that an Rp -valued random vector X = (X1 , X2 , . . . , Xp )′ has random vari-


˜
ables as components. We consider their distributions, which are probability measures
Q on (Rp , B(Rp )).

Let us use P as a common notation for underlying probabilities on the space where
relevant random variables, vectors etc. are defined.

8
2.1 Joint, marginal and conditional densities

In this course, we shall only encounter distributions Q with densities; meaning a


function f such that for every Borel A ⊆ Rp ,
RR R
Q(A) = P (X ∈ A) = ··· f (y1 , y2 , . . . , yp ) dy1 dy2 · · · dyp .
˜
A

In the language of absolute continuity and singularity, distributions with density are
absolutely continuous with respect to Lebesgue measure.

If X ∼ Q, the marginals of Q are the distributions of subsets of components of


˜
X. In particular, for 1 ≤ j ≤ p, the function
˜ Z Z Z
fXj (xj ) = · · · fX (x1 , x2 , . . . , xp ) dx1 · · · dxj−1 dxj+1 . . . dxp
˜

is the (1-dimensional) marginal density of Xj . More generally, for a subset {Xj1 , Xj2 , . . . , Xjm }
of {X1 , X2 , . . . , Xp }, the function
Z Z Z
fXj1 ,Xj2 ,...,Xjm (xj1 , xj2 , . . . , xjm ) = ··· fX (x1 , x2 , . . . , xp ) dxi1 dxi2 · · · dxip−m
˜

where {i1 , i2 , . . . , ip−m } = {1, 2, . . . , p}\{j1 , j2 , . . . , jm }; is the m-dimensional marginal


density for {Xj1 , Xj2 , . . . , Xjm }.

When fXj1 ,Xj2 ,...,Xjm (xj1 , xj2 , . . . , xjm ) > 0 we call, with {i1 , i2 , . . . , ip−m } = {1, 2, . . . , p}\
{j1 , j2 , . . . , jm }, the function defined on Rp−m by

(xi1 , xi2 , . . . , xip−m ) 7→ fXi ,Xi2 ,...,Xik−m |Xj1 ,Xj2 ,...,Xjm


(xi1 , xi2 , . . . , xip−m |xj1 , xj2 , . . . , xjm )
1

fX (x1 , x2 . . . , xp )
= ˜
fXj1 ,Xj2 ,...,Xjm (xj1 , xj2 , . . . , xjm )

as conditional density of (Xi1 , Xi2 , . . . , Xip−m ) given that {(Xj1 , Xj2 , . . . , Xjm ) = (xj1 , xj2 , . . . , xjm )};
which is indeed a density, and the distribution with this density we define as the

9
conditional distribution of (Xi1 , Xi2 , . . . , Xip−m ) given that {(Xj1 , Xj2 , . . . , Xjm ) =
(xj1 , xj2 , . . . , xjm )}. What is meant by that is this: for subsets B of Rp−m , we take
the quantity
Z

B
fXi ,Xi ,...,Xi
1 2 p−m |Xj1 ,Xj2 ,...,Xjm (xi1 , xi2 , . . . , xip−m |xj1 , xj2 , . . . , xjm ) dxi1 dxi2 · · · dxip−m

as the definition of the conditional probability that {(Xi1 , Xi2 , . . . , Xip−m ) ∈ B}, given
that {(Xj1 , Xj2 , . . . , Xjm ) = (xj1 , xj2 , . . . , xjm )}.

Of course, it is understood that this does not represent conditional probabilities


in the usual sense because the conditioning event itself has 0 probability. However,
the conditional distribution is a bonafide (p − m)-dimensional distribution.

2.2 Moments

We call a vector µ = (µ1 , . . . , µp )′ ∈ Rp the mean vector of X if E(Xj ) = µj for


˜ ˜
1 ≤ j ≤ p, it being tacit that each Xj is integrable. We denote the mean vector X
˜
by E(X).
˜ !!
When each Xj has finite second moment, the matrix Σ = σij with
1≤i,j≤p

σij = Cov(Xi , Xj ) = E[(Xi − E(Xi ))(Xj − E(Xj ))] = E[Xi Xj ] = µi µj

is called the dispersion matrix/covariance matrix of X and denoted by D(X). Specif-


˜ ˜
ically, the diagonal entries are the variances and we write σii as σi2 to denote the s.d.
of Xi by σi .

Since ∀ i, j, σij is the expected value of the (i, j)-th entry of the (random) matrix
(X − E(X)) (X − E(X))′ , we also represent D(X) as E (X − E(X)) (X − E(X))′
 
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
with the obvious interpretation of entrywise expectations; i.e. Σ = E[(X − µ)(X − µ)′ ].
˜ ˜ ˜ ˜
10
A couple of immediate observations we can make are that for a k × p matrix
A = ((aij )), if we define Y k×1 = (Y1 , . . . , Yk )′ = AX then E(Y ) = Aµ = AE(X)
˜ ˜ ˜ ˜ ˜
since E(Yi ) = E[ pj=1 aij Xj ] = pj=1 aij E(Xj ) for 1 ≤ i ≤ k and D(AX) = AΣA′ =
P P
˜
AD(X)A′ since for 1 ≤ t, s ≤ k,
˜
p p p
X X X
Cov(Ys , Yt ) = Cov( asi Xi , atj Xj ) = asi atj Cov(Xi , Xj ).
i=1 j=1 i,j=1

In particular, with E(X) = µ = (µ1 , . . . , µp )′ and D(X) = Σ = ((σij )), taking k = 1,


˜ ˜ ˜
for ℓ = (l1 , . . . , lp )′ ∈ Rp ,
˜
p
X
′ ′
E(ℓ X) = ℓ E(X) = lj µj and
˜˜ ˜ ˜ j=1
p
X
V (ℓ′ X) = ℓ′ D(X)ℓ = li lj σij .
˜˜ ˜ ˜ ˜ i,j=1

These two properties obviously characterize the mean vector and dispersion matrix
in terms of means and variances of (one-dimensional) random variables: taking ℓ = ej
˜ ˜
yields µj = E(Xj ) and σjj = V (Xj ); while for 1 ≤ i ̸= j ≤ p, taking ℓ = ei + ej ,
˜ ˜ ˜
observe 2σij = (σii + σjj + 2σij ) − (σii + σjj ) = ℓ′ Σℓ − (e′i Σei + e′j Σej ) = V (Xi + Xj ) −
˜ ˜ ˜ ˜ ˜ ˜
(V (Xi ) + V (Xj )) = 2Cov(Xi , Xj ).

More generally, for a random vector consisting of a subset of the components


of X, the mean vector and dispersion matrix are just obtained by reading off the
˜
corresponding entries of µ and Σ. Specifically, for 1 ≤ j1 < j2 < · · · < jk ≤ p; just
˜
observe that

(Xj1 , Xj2 , . . . , Xjk )′ = J X where Jk×p = [ej1 |ej2 | . . . |ejk ]′


˜ ˜ ˜ ˜

and J µ = (µj1 , . . . , µjk )′ while JΣJ ′ = [[σji ,jr ]]1≤i,r≤k .


˜

11
It follows that the dispersion matrix Σ is always non-negative definite. In fact
in cases where it is psd, X does not possess a density. That is because then Σ is
˜
singular, so ∃ ℓ ∈ Rp \ {0p } such that Σℓ = 0p ⇒ V (ℓ′ X) = ℓ′ Σℓ = 0; so that
˜ ˜ ˜ ˜ ˜˜ ˜ ˜
ℓ′ X ≡ E(ℓ′ X) = ℓ′ µ with probability 1. In other words, the distribution of X is
˜˜ ˜˜ ˜˜ ˜
concentrated on the set
{x ∈ Rp : ℓ′ x = ℓ′ µ},
˜ ˜˜ ˜˜
a hyperplane in Rp of dimension p − 1, having p-dimensional Lebesgue measure 0.

In other words, singularity of the dispersion matrix implies that the multivariate
distribution is itself singular with respect to Lebesgue measure.

If X and Y are jointly distributed respectively p- and q-dimensional random vec-


˜ ˜
tors with all components of both square-integrable, then the p × q matrix
 

 = E (X − E(X)) (Y − E(Y ))′


   
Cov(Xi , Yj )

1≤i≤p
 ˜ ˜ ˜ ˜
1≤j≤q

is sometimes called the covariance/cross-covariance matrix of X and Y and denoted


˜ ˜
Cov(X, Y ). Note that Cov(Y , X) = Cov(X, Y )′ . Both matrices are null when X and
˜ ˜ ˜ ˜ ˜ ˜ ˜
Y independent.
˜
When X and Y have the same dimensions, E(X ± Y ) = E(X)±E(Y ) and D(X ±
˜ ˜ ˜ ˜ ˜ ˜ ˜
Y ) = D(X) + D(Y ) ± (Cov(X, Y ) + Cov(Y , X)) and when X and Y independent,
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
D(X ± Y ) = D(X) + D(Y ).
˜ ˜ ˜ ˜
Clearly these observations generalize to any finite sum of random vectors of the
same dimension.

12
2.3 Correlation and regression

σij
When σi , σj > 0, the quantity ρij = σi σj
is called the correlation coefficient between
Xi and Xj ; intended to measure the degree of linear relationship between them. It lies
between -1 and 1. If ρij has a high absolute value, then using a linear (affine) function
of one of Xi and Xj to predict the other is advisable. The best affine functions turn
out to be
xj = µj + βji (xi − µi ) and xi = µi + βij (xj − µj )
σj
respectively for predicting Xj from Xi , and conversely, where βji = ρij σi
and βij is
defined analogously.

Note that ρij depends only upon σij , σii and σjj ; hence only upon the joint
marginal distribution of (Xi , Xj ) and the other components do not play any role.

However, sometimes we wish to use linear (affine) combinations of several variables


to predict one. To simplify notation, we assume our aim is to find scalars α, β2 , . . . , βp
to maximize ρX1 ,α+Ppj=2 βj Xj = ρX1 ,Ppj=2 βj Xj , discounting the trivial case that X1
equals such an affine function identically (i.e. with probability 1). In fact, we assume
Σ > 0 that rules out this, as well as other difficulties that could arise later. But then,
the correlation equals
Pp Pp s
Cov(X1 , j=2 βj Xj ) j=2 βj σ1j ±1 (β ′ Σ21 )2
ρX1 ,Ppj=2 βj Xj = q = q P = √
V (X1 ){V ( pj=2 βj Xj )}
P
σ11 pi,j βi βj σij σ11 β˜ ′ Σ22 β
˜ ˜
 
σ11 Σ12 1×(p−1)
where β = (β2 , . . . , βp )′ and Σ has been partitioned as Σ =  .
˜ Σ21 (p−1)×1 Σ22 (p−1)×(p−1)
Therefore, appealing to our earlier result, the maximum occurs when β ∝ Σ−1
22 Σ21 =:
−1 ˜
βopt , say; and the square of the maximum equals Σ12 Σ 22 Σ21
σ11
.
˜
13
This quantity is called the (squared) multiple correlation coefficient between X1
and (X2 , . . . , Xp ); denoted ρ21.23...p and the optimal solution βopt called the vector of
˜
(multiple) regression coefficients of X1 on X2 , . . . , Xp .

An alternative interpretation of the regression coefficients is that they minimize


p
X
V [X1 − βj Xj ] = V [(1, −β2 , . . . , −βp )X] = σ11 − 2β ′ Σ21 + β ′ Σ22 β
j=2
˜ ˜ ˜ ˜

[Exercise: show that the gradient vector of the map g : X 7→ X ′ AX where A is


˜ ˜ ˜
symmetric, is ∇g(X) = 2AX.]
˜ ˜
Setting the partial derivatives to 0 and writing the resulting normal equations in
vector form yields

0 = −2Σ21 + 2Σ22 β solving to β = Σ−1
22 Σ21 = βopt = (β2 opt, . . . , βp opt) , say.
˜ ˜ ˜ ˜
While the computations above dictate the choice of β2 , . . . , βp , there is also an
optimal choice for α that has great significance too: if we actually set out to min-
imize E[X1 − (α + pj=2 βj Xj )]2 ; then an additional normal equation would arise
P

from the partial derivative wrt α that would yield α = µ1 − pj=2 βj µj so that
P

E[X1 − (α + pj=2 βj Xj )] = 0; implying that with the optimal choice, E[X1 − (α +


P
Pp 2
Pp
j=2 βj opt Xj )] = V [X1 − j=2 βj opt Xj ].
Pp
Thus, µ1 + j=2 βj opt (Xj − µj ) is the best linear (really, affine) predictor of X1
based on X2 , . . . , Xp ; in the sense of minimizing the mean squared deviation between

X1 and such a function. It is convenient to write it as µ1 + βopt (X 2 − µ2 ); where
˜ ˜ ˜
X 2 := (X2 , . . . , Xp )′ and µ2 = E(X 2 ) = (µ2 , . . . , µp )′ .
˜ ˜ ˜
In fact, analogously to the bivariate case, the equation
p
X
x 1 = µ1 + βj opt (xj − µj )
j=2

14
is called the multiple linear regression equation of X1 on X2 , . . . , Xp ; and the variable

(X 2 − µ2 ) = X1 − µ1 + Σ12 Σ−1 −1 −1
  
X1 − µ1 + βopt 22 (X 2 − µ2 ) = X1 −Σ12 Σ22 X 2 − µ1 − Σ12 Σ22 µ2
˜ ˜ ˜ ˜ ˜ ˜ ˜
sometimes called the residual.

Note that the residual and the best linear predictor are actually uncorrelated:
′ ′
  
Cov X1 − µ1 + βopt (X 2 − µ2 ) , µ1 + βopt ( X 2 − µ2 )
˜ ˜  ˜ ˜ ˜ ˜
−1 −1
= Cov X1 , Σ12 Σ22 X 2 − V Σ12 Σ22 X 2
˜ ˜
−1 −1 −1
= Σ12 Σ22 Σ21 − Σ12 Σ22 Σ22 Σ22 Σ21 = 0

Further, the variance of the best linear predictor equals βopt ′


Σ22 βopt = Σ12 Σ−1
22 Σ21 ;
˜ ˜
and that of the residual is σ11 − Σ12 Σ−1
22 Σ21 . It is a good idea to take a relook at the

multiple correlation coefficient at this point.

2.4 Linear regression vs. conditional expectation

As indicated, linear regression is appropriate to predict a component X1 , say, from


the rest i.e. X2 , . . . , Xp of X using only affine functions; but what would we do if
˜
arbitrary functions were allowed instead? The hint is provided by the one of the
considerations that prompted the choice of the predictor: minimization of the mean
squared residual. So we would like to find a function h such that

E (X1 − h(X2 , . . . , Xp ))2 = min E (X1 − g(X2 , . . . , Xp ))2


g

assuming E(X12 ) < ∞, where g varies over all measurable functions with E (g(X2 , . . . , Xp ))2 <
∞.

Clearly, if we were allowed only constants for g, the minimum would occur when
h ≡ E(X1 ). Since when X2 , . . . , Xp are known, any function g(X2 , . . . , Xp ) of X2 , . . . , Xp

15
essentially acts like a constant so far as the conditional distribution of X1 given
X2 , . . . , Xp is concerned, applying that principle to it we see that if we take as h
the function
Z
h(x2 , . . . , xp ) = E(X1 |X2 = x2 , . . . , Xp = xp ) = xfX1 |X2 ,...,Xp (x|x2 , . . . , xp ) dx

when the conditional distribution has a density; then it follows that

E (X1 − h(X2 , . . . , Xp ))2 |X2 , . . . , Xp = min E (X1 − g(X2 , . . . , Xp ))2 |X2 , . . . , Xp .


   
g

In fact the LHS is called the conditional variance of X1 given X2 , . . . , Xp . Now using
the well-known fact that expectation of conditional expectations is the unconditional
expectation, it follows that h satisfies the requirement.

It is important to record that while for computing the best linear predictor, we
only need the mean vector and dispersion matrix of X, for computing the conditional
˜
mean we require typically the whole conditional distribution, which is really obtained
from the joint distribution. We shall see that in the extremely important multivariate
normal model, the two predictors actually coincide, so that the best overall predictor
of X1 actually turns out to be an affine function of X2 , . . . , Xp .

2.5 Partial correlation

Occasionally we like to consider what are known as partial correlation coefficients


between pairs of coordinate variables eliminating the effects of certain other variables.
For simplicity of notation, we take 2 ≤ q < p, 1 ≤ i ̸= j ≤ q and describe the
partial correlation coefficient between between Xi and Xj eliminating the effects of
Xq+1 , . . . , Xp . The natural approach is to regress both Xi and Xj on Xq+1 , . . . , Xp ,
and obtain the correlation coefficient between the two residuals.

16
Clearly the other variables than Xi , Xj and Xq+1 , . . . , Xp have no role to play. Let
us now call (Xq+1 , . . . , Xp )′ as X 2 , its mean vector as µ2 , and construct the dispersion
˜ ˜
matrix of the relevant coordinates:
 
σii σij σi,(q+1) σi,(q+2) . . . σip
 
σij σjj σj,(q+1) σj,(q+2) . . . σjp  
  

  σii σij Σi2 
 

 σi,(q+1) σj,(q+1)  
= σ 
, say,
  ij σjj Σj2 


 σi,(q+2) σ j,(q+2)  

 .. .. ((σst ))2≤s,t≤p 
 Σ2i Σ2j Σ22
 . . 
 
σip σjp

noting that Σ22 = D(X 2 ). Then we know that the two residuals are Ti = Xi − (µi +
˜
Σi2 Σ22 (X 2 − µ2 )) and Tj = Xj − (µj + Σj2 Σ−1
−1
22 (X 2 − µ2 )), with respective variances
˜ ˜ ˜ ˜
σii − Σi2 Σ−1
22 Σ 2i and σ jj − Σ j2 Σ−1
22 Σ 2j ; and their covariance is

Cov(Ti , Tj ) = Cov Xi − Σi2 Σ−1 −1



22 X 2 , Xj − Σj2 Σ22 X 2
˜ ˜
= σij − 2Σi2 Σ22 Σ2j + Σi2 Σ22 Σ2j = σij − Σi2 Σ−1
−1 −1
22 Σ2j

whence their correlation coefficient, called the partial correlation coefficient between
Xi and Xj fixing or eliminating Xq+1 , . . . , Xp is

σij − Σi2 Σ−1 22 Σ2j


ρij.(q+1),(q+2),...,p = q .
σii − Σi2 Σ−1 −1

22 Σ 2i · σ jj − Σ Σ
j2 22 Σ 2j

It is worthwhile to note that Cov(Ti , Tj ) equals exactly the (i, j)th entry of the q × q
matrix Σ11.2 := Σ22 − Σ12 Σ−1
22 Σ21 , where Σ has been partitioned as
 
Σ11 q×q Σ12 q×(p−q)
 .
Σ21 (p−q)×q Σ22 (p−q)×(p−q)

17
The entries of Σ11.2 are written as {σst.(q+1),(q+2),...,p : 1 ≤ s, t ≤ q} and thus
σij.(q+1),(q+2),...,p
ρij.(q+1),(q+2),...,p = √ .
σii.(q+1),(q+2),...,p σjj.(q+1),(q+2),...,p
In other words, ρij.(q+1),(q+2),...,p is obtained in the same way from Σ11.2 as σij is from
Σ.

It will be seen that Σ11.2 has a special significance in the context of multivariate
normal distributions: namely, it is the dispersion matrix of the conditional distribu-
tion of (X1 , . . . , Xq ) given Xq+1 , . . . , Xp , so that in that case the partial correlation
coefficients have a particular meaning: as the correlation coefficient of the conditional
distribution.

It is sometimes useful, particularly from computational considerations, to elimi-


nate the effects of the variables not all at one go but one by one; certain recurrence
relations arise that achieve this but we postpone their discussion.

2.6 Characteristic functions, uniqueness and continuity the-


orems

For a random vector X = (X1 , . . . , Xp ), the function ϕX : Rp → C defined by



˜  ˜
ϕX (t) = E(ei˜t X˜ ) = E ei(t1 X1 +···+tp Xp ) , where t = (t1 , . . . , tp ) ∈ Rp , is called the
˜˜ ˜
characteristic function, or chf for short, of X. Actually, the function takes values
˜
only in the unit disk {z : |z|≤ 1} in C; and can in fact be related to chfs of (1-
dimensional) random variables by

ϕX (t) = ϕt′ X (1)


˜˜ ˜ ˜

which relation also yields the following consequences of the corresponding 1-dimensional
results that we state without proof:

18
Theorem 2.1 (Uniqueness Theorem) If X and Y have the same chfs, then Y ∼
˜ ˜ ˜
X.
˜

While the definition and theory of convergence in distribution will not be covered
in this course, we nevertheless state the following analogue to the corresponding 1-
dimensional theorem:

Theorem 2.2 (Continuity Theorem) If the sequence (X n )n≥1 of p-dimensional


˜
random vectors have respective chfs determining the sequence (ϕn )n≥1 , then (X n )
˜ ˜
converges in distribution iff (ϕn ) converges pointwise to a function ϕ that is continuous
˜ ˜
at 0. In this case ϕ is the chf of the limiting distribution.
˜ ˜

We also refer to convergence in distribution as weak convergence. A nice way to


connect one-dimensional and multidimensional weak convergence is provided by the

Theorem 2.3 (Cramer-Wold Device) If X, (X n )n≥1 are p-dimensional random


d
˜d ˜
vectors then X n →X iff for every ℓ ∈ Rp , ℓ′ X n →ℓ′ X.
˜ ˜ ˜ ˜˜ ˜˜

The multivariate continuity theorem follows from its univariate version and the Cramer-
Wold device, and conversely.

3 Multivariate normal distribution

X = (X1 , X2 , . . . , Xp )′ a p-dimensional random vector, µ = (µ1 , µ2 , . . . , µp )′ ∈ Rp ,


˜ ˜
Σp×p = ((σij )) nnd.

Definition 1. X ∼ Np (µ, Σ) if ∀ ℓ ∈ Rp , ℓ′ X ∼ N ℓ′ µ, ℓ′ Σℓ .

˜ ˜ ˜ ˜˜ ˜˜ ˜ ˜

19
The Np (0, Ip ) is often referred to as the standard p-variate normal distribution.
˜
Some immediate consequences:

(1) Obviously, µ and Σ are the mean vector and dispersion matrix of X. In partic-
˜ ˜
ular, Σ is non-negative definite.

(2) The distribution is singular when Σ is positive semi-definite.



(3) ∀ j = 1, 2, . . . , p, Xj ∼ N (µj , σjj ). We often write σj = σjj .

(4) More generally, ∀ k ≥ 1, Ck×p real matrix, C X ∼ Nk C µ, CΣC ′ : ∀ ℓ ∈ Rk ,



˜ ˜ ˜
ℓ′ C X = (C ′ ℓ)′ X is a normal variable with mean (C ′ ℓ)′ µ = ℓ′ C µ and variance
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
(C ′ ℓ)′ Σ(C ′ ℓ) = ℓ′ (CΣC ′ )ℓ.
˜ ˜ ˜ ˜
(5) Marginals: In particular, whenever 1 ≤ j1 < · · · < jk ≤ p,
 
(Xj1 , Xj2 , . . . , Xjk )′ = [ej1 |ej2 | . . . |ejk ]′ X ∼ Nk (µj1 , . . . , µjk )′ , [[σji ,jr ]]1≤i,r≤k
˜ ˜ ˜ ˜

(6) Of course, prior reference to µ and Σ is really unnecessary, and we can give the
˜
equivalent
Definition 2. X has a multivariate normal distribution if ∀ ℓ ∈ Rp , ℓ′ X has a
˜ ˜ ˜˜
univariate normal distribution.

(7) So we can now say X has a multivariate normal distribution iff it follows
˜
Np (E(X), D(X)).
˜ ˜
(8) Even more generally, if Ck×p is a matrix and b a fixed vector in Rk , then C X+b ∼
˜ ˜ ˜
Nk C µ + b, CΣC ′ .

˜ ˜
(9) The chf of Np (µ, Σ) is: ϕX (t) = ϕt′ X (1) = exp it′ µ − 21 t′ Σt

˜ ˜˜ ˜ ˜ ˜˜ ˜ ˜
In particular, we claim that

20
(10) For 1 ≤ k ̸= j ≤ p, Xk and Xj are independent iff σkj = 0. For then,
 
1 2 2
ϕXk ,Xj (t1 , t2 ) = exp iµk t1 + iµj t2 − (σkk t1 + σjj t2 ) = ϕXk (t1 )ϕXj (t2 ).
2

Likewise, more generally, all the components of X are independent iff Σ is


˜
diagonal; for then,
p p p
!
X 1X 2
Y
ϕX (t) = exp i tj µj − σjj tj = ϕXj (tj )
˜ ˜
j=1
2 j=1 j=1

for the If part where t = (t1 , . . . , tp )′ . Clearly, only if part is trivial.


˜
(We’re using the multivariate version of the uniqueness theorem for character-
istic functions here.)

(11) Let us show that when Σ is positive definite, the distribution actually has
a strictly positive density on Rp (which would mean that the distribution is
actually mutually absolutely continuous with Lebesgue measure on Rp , B p )).
So the distribution is singular iff Σ is singular.

We assume the multivariate transformation of densities formula. Since Σ is


pd, ∃ B non-singular such that Σ = BB ′ . Define Y = B −1 X; then Y ∼
˜ ˜ ˜
Np B −1 µ, B −1 Σ(B −1 )′ ; in particular D(Y ) = B −1 BB ′ (B −1 )′ = Ip . This

˜ ˜
means that the components Y1 , . . . , Yp of Y are independent each with vari-
˜
ance 1. Further, if we write ν = B µ, then ν = (ν1 , . . . , νp )′ = E(Y ). Hence
−1
˜ ˜ ˜ ˜
Y has a joint density given by
˜
p    
Y 1 1 2 1 1 ′
fY (y) = √ exp − (yj − νj ) = p exp − (y − ν) (y − ν)
˜ ˜
j=1
2π 2 (2π) 2 2 ˜ ˜ ˜ ˜

and the transformation g : X 7→ Y = B −1 X is one-to-one on Rp with inverse


˜ ˜ ˜ 1
−1
given by g (y) = B y with Jacobian |B|≡ |Σ| 2 , clearly non-vanishing. Hence
˜ ˜
21
X has joint density
˜
 
−1 −1 1 1 −1 ′ −1
fX (x) = fY (B x) · |B| = 1 p exp − (B x − ν) (B x − ν)
˜ ˜ ˜ ˜ |Σ| 2 (2π) 2 2 ˜ ˜ ˜ ˜
 
1 1 −1 −1 ′ −1 −1
= 1 p exp − (B x − B µ) (B x − B µ)
|Σ| 2 (2π) 2 2 ˜ ˜ ˜ ˜
 
1 1
= 1 p exp − (x − µ)′ (BB ′ )−1 (x − µ)
|Σ| 2 (2π) 2 2 ˜ ˜ ˜ ˜
 
1 1 ′ −1
= 1 p exp − (x − µ) Σ (x − µ)
|Σ| 2 (2π) 2 2 ˜ ˜ ˜ ˜

which is obviously strictly positive on all of Rp .

(12) Conditional distributions: When Σ > 0, any principal submatrix (meaning one
keeping the same rows as columns) is also so. For if Σ1 := ((σij ))i,j∈J , where
∅ =
̸ J = {j1 , j2 , . . . , jk } ⊆ {1, 2, . . . , p} is one such with k < p, then for any
z = (z1 , z2 , . . . , zk )′ ∈ Rk \ {0k }, z ′ Σz = x′ Σx > 0 where x = (x1 , . . . , xp ) ∈ Rp
˜ ˜ ˜ ˜ ˜ ˜ ˜
is obtained by putting zi as xji for 1 ≤ i ≤ k; and taking 0 for xi if i ∈ / J. We
are assuming tacitly that j1 < · · · < jk are in increasing order.

Now, since Σ1 defined above is precisely the dispersion matrix of (Xj1 , Xj2 , . . . , Xjk )′
whose components are a subset of the components of X, it follows that any such
˜
subset has a nonsingular normal distribution in the appropriate dimension.

We wish to derive the conditional distribution of some q components of X given


˜
the remaining (p − q) components, where 1 ≤ q ≤ p − 1. Wlg assume those q
coordinates are X1 , . . . , Xq so the given variables are Xq+1 , . . . , Xp .

For convenience of notation, we shall write X, µ and Σ in partitioned form as


˜ ˜
     
X 1 q×1 µ1 q×1 Σ11 q×q Σ12 q×(p−q)
X= ˜ , µ= ˜ , Σ= 
˜ X 2 (p−q)×1 ˜ µ2 (p−q)×1 Σ21 (p−q)×q Σ22 (p−q)×(p−q)
˜ ˜

22
1 1 ′ −1

Now, X 2 has (marginal) density fX 2 (x2 ) = 1 p−q exp − 2 (x2 − µ2 ) Σ22 (x2 − µ2 ) ;
˜ ˜ ˜ |Σ22 | 2 (2π) 2 ˜ ˜ ˜ ˜
therefore given X 2 = x2 ∈ Rp−q , we have a conditional density for X 1 given by
˜ ˜  ˜
fX  x 1 
˜ ˜x2
fX |X (x1 |x2 ) = fX ˜(x )
˜1 ˜2 ˜ ˜
2
1 p
˜2 ˜ 1 p−q
= |Σ|− 2 (2π)− 2 exp − 2 (x − µ)′ Σ−1 (x − µ) |Σ22 | 2 (2π) 2 exp 21 (x2 − µ2 )′ Σ−1
1
 
22 (x2 − µ2 )
˜ ˜ ˜ ˜ ′
   ˜ ˜ ˜ ˜ 

|Σ|
− 12  x −µ x −µ 
= |Σ (2π) − 2q
exp − 1  ˜ 1 ˜ 1  Σ−1  ˜ 1 ˜ 1 −(x2 − µ2 )′ Σ−1 22 ( x 2 − µ 2 ) 
22 | 2
x2 − µ 2 x2 − µ 2 ˜ ˜ ˜ ˜ 
˜ ˜ ˜ ˜
 
x
for x1 ∈ Rq , where x =  ˜ 1 .
˜ ˜ x2
˜
By the expression for the determinant and inverse of a partitioned matrix in
Section 1, we know |Σ|= |Σ11 − Σ12 Σ−1
22 Σ21 |·|Σ22 |= |Σ11.2 ||Σ22 |, writing Σ11 −

Σ12 Σ−1 −1
22 Σ21 as Σ11.2 ; and further writing Σ12 Σ22 as B,
 
−1 −1
Σ11.2 −Σ11.2 B
Σ−1 =  
′ −1 −1 ′ −1
−B Σ11.2 Σ22 + B Σ11.2 B
so the exponent in the conditional density becomes, temporarily writing x1 − µ1
˜ ˜
as y and x2 − µ2 as z,
˜ ˜" ˜ ˜
 ′  #
1 y
− Σ−1 y − z ′ Σ−1 22 z
2 ˜z ˜z ˜ ˜
˜ ˜
1
= − y ′ Σ−1 ′ −1 ′ −1 ′ ′ −1 ′ −1

11.2 y − 2y Σ11.2 Bz + z Σ22 z + z B Σ11.2 Bz − z Σ22 z
2 ˜ ˜ ˜ ˜ ˜i ˜ ˜ ˜ ˜ ˜
1h ′ −1 
=− y − Bz Σ11.2 y − Bz
2 ˜ ˜ ˜ ˜
1 h ′ −1 i
=− x − µ1 − B(x2 − µ2 ) Σ11.2 x1 − µ1 − B(x2 − µ2 )
2 ˜1 ˜ ˜ ˜ ˜ ˜ ˜ ˜
and hence the conditional density becomes
 
− 21 − 2q 1 ′ −1 
|Σ11.2 | (2π) exp − x − {µ1 + B(x2 − µ2 )} Σ11.2 x1 − {µ1 + B(x2 − µ2 )}
2 ˜1 ˜ ˜ ˜ ˜ ˜ ˜ ˜
23

which is nothing but the density given for the Nq µ1 + B(X 2 − µ2 ), Σ11.2 dis-
˜ ˜ ˜
tribution.

However, the final result suggests an alternative derivation that turns out to
involve far fewer calculations with partitioned matrices and avoids inverting
them altogether: we define
      
Y 1 q×1 X − BX 2 Ip −B X
Y = ˜ = ˜1 ˜ =   ˜ 1  = C X, say,
˜ Y2 X2 0(p−q)×q Ip−q X2 ˜
˜ (p−q)×1 ˜ ˜
 
µ1 − Bµ2
when Y ∼ Np C µ, CΣC ′ . Now, we write C µ as  ˜

˜ ; and simplify
˜ ˜ ˜ µ2
D(Y ) = CΣC ′ as ˜
˜      
Ip −B Σ Σ12 I 0q×(p−q)
  ·  11 · p 
0(p−q)×q Ip−q Σ21 Σ22 −B′ Ip−q
   
Σ11 − BΣ21 Σ12 − BΣ22 I 0q×(p−q)
=  · p 
Σ21 Σ22 −B′ Ip−q
 
′ ′
Σ11 − BΣ21 − Σ12 B + BΣ22 B Σ12 − BΣ22
=  

Σ21 − Σ22 B Σ22

and observe that BΣ21 = Σ12 Σ−1 ′ ′ −1 −1


22 Σ21 = Σ12 B and BΣ22 B = Σ12 Σ22 Σ22 Σ22 Σ21 =

Σ12 Σ−1 ′
22 Σ21 too, while Σ12 − BΣ22 = Σ12 − Σ12 = 0q×(p−q) and Σ21 − Σ22 B =

0(p−q)×q ; so that D(Y ) finally turns out as


˜
   
−1
Σ − Σ12 Σ22 Σ21 0 Σ 0
 11  =  11.2 
0 Σ22 0 Σ22

which means that Y 1 ∼ Nq µ1 − Bµ2 , Σ11.2 independently of Y 2 . So the con-
˜ ˜ ˜ ˜
ditional distribution of Y 1 = X 1 − BX 2 given X 2 would be exactly its uncondi-
˜ ˜ ˜ ˜
24

tional distribution; namely, Nq µ1 − Bµ2 , Σ11.2 so that that of X 1 = Y 1 +BX 2
˜ ˜ ˜ ˜ ˜
would be Nq BX 2 + µ1 − Bµ2 , Σ11.2 or Nq µ1 + B(X 2 − µ2 ), Σ11.2 .
˜ ˜ ˜ ˜ ˜ ˜
Note that this alternative method works even if Σ11 is singular, only Σ22 needs
to be nonsingular.

We have already encountered the matrix B when q = 1: as the vector of (linear)


regression coefficients of X1 on X2 , . . . , Xp . In analogy with that, B is called
the (linear) regression matrix of X 1 on X 2 .
˜ ˜
The fact that E(X 1 |X 2 ) = µ1 + B(X 2 − µ2 ) now has the interpretation, referred
˜ ˜ ˜ ˜ ˜
to earlier in Subsection 2.4, that for the multivariate normal distribution, best
prediction of any component or components using other components is equiva-
lent to best linear prediction.

(13) An important observation is that the conditional dispersion matrix Σ11.2 ; i.e.
that of the conditional distribution given Y 2 , does not involve Y 2 . It is some-
˜ ˜
times called the residual dispersion matrix.

The form of the conditional dispersion matrix imparts the following meaning to
the partial correlation coefficients: e.g. since for 1 ≤ i ̸= j ≤ q, ρij.(q+1),(q+2),...,p
is the correlation coefficient obtained from Σ11.2 in the same way as correlation
coefficients are obtained from dispersion matrices in bivariate distributions, they
can be interpreted as conditional correlation coefficients between Xi and Xj
given Xq+1 , . . . , Xp .

(14) Clearly the regression matrix of X 2 on X 1 would be Σ21 Σ−1 11 and the conditional
˜ ˜
dispersion matrix would be Σ22.1 := Σ22 − Σ21 Σ−1 11 Σ12 . In fact, the conditional

distribution is Np−q µ2 + Σ21 Σ−1



11 (X 1 − µ1 ), Σ22.1 .
˜ ˜ ˜
(15) Let us describe at this point the recursive computational method for the par-

25
tial correlations referred to earlier. Suppose partial correlations given a sub-
set Xq+1 , . . . , Xp of the components of a multivariate normal random vector
X = (X1 , X2 , . . . , Xp )′ are already computed where 3 ≤ q ≤ p. Then for
˜
1 ≤ i ̸= j < q, the partial correlation coefficient ρij.q,(q+1),...,p can be computed
from those already obtained by using the following procedure.

The idea is to extract the conditional joint distribution of Xi and Xj given


Xq , Xq+1 , . . . , Xp , or more precisely, the dispersion matrices of the conditional
distribution, from that of (Xi , Xj , Xq ) given Xq+1 , . . . , Xp . Using an obvious
notation,

fi,j,q,(q+1),...,p fi,j,q,(q+1),...,p /f(q+1),...,p fi,j,q|(q+1),...,p


fij|q,(q+1),...,p = = =
fq,(q+1),...,p fq,(q+1),...,p /f(q+1),...,p fq|(q+1),...,p

In other words, from the trivariate normal conditional distribution of (Xi , Xj , Xq )


given Xq+1 , . . . , Xp , extracting further the conditional bivariate normal distri-
bution of Xi and Xj given Xq yields the required conditional distribution.

We need only the respective conditional dispersion matrices:


 
σ σij.(q+1),...,p σiq.(q+1),...,p
 ii.(q+1),...,p 
D(Xi , Xj , Xq |Xq+1 , . . . , Xp ) =  σij.(q+1),...,p σjj.(q+1),...,p σjq.(q+1),...,p
 

 
σiq.(q+1),...,p σjq.(q+1),...,p σqq.(q+1),...,p

So, it is enough to obtain a relationship between ordinary correlation coefficients


in trivariate distributions and its partial correlation coefficients.
 Consider a 
σ σ σ
 xx xy xz 
trivariate normal random vector (X, Y, Z)′ with dispersion matrix  σxy σyy σyz ;
 
 
σxz σyz σzz
from which the dispersion matrix of the conditional distribution of (X, Y ) given

26
Z has been extracted as
     
2
σxz σxz σyz
σ σ σ σxx − σxy −
 xx xy  −  xz  σzz −1
· [σxz σyz ]′ =  σzz
2
σzz 
σxz σyz σyz
σxy σyy σyz σxy − σzz
σyy − σzz

which yields for the partial correlation coefficient ρxy.z of (X, Y ) given Z the
formula
σxy σyz
σxy − σxz σyz √ − √ σxz · √
σzz σxx σyy σxx σzz σyy σyz ρxy − ρxz ρyz
q q = q q 2
=p p
σxx −
2
σxz
σyy −
2
σyz
1−
2
σxz
· 1−
σyz 1 − ρ2xz 1 − ρ2yz
σzz σzz σxx σzz σyy σzz

ρij.(q+1),...,p −ρiq.(q+1),...,p ρjq.(q+1),...,p


Thus ρij.q,(q+1),...,p = q q .
1−ρ2iq.(q+1),...,p 1−ρ2jq.(q+1),...,p

In fact these recursive relations still hold even if the distribution is not a mul-
tivariate normal.

4 Univariate Sampling Distributions

4.1 noncentral χ2
Pp
Let X ∼ Np µ, Ip . The distribution of Y = X ′ X =
 2
j=1 Xj is said to be a
˜ ˜ ˜ ˜
noncentral χ2 with degrees of freedom p and noncentrality parameter/ncp λ := µ′ µ =
2 ˜˜
µ . It is typically denoted by χ2p,λ .
˜
When µ = 0, the distribution is the central χ2p . Thus the central one is a special
˜ ˜
case of the noncentral one: χ2p = χ2p,0 .

Why does the distribution at all depend only on λ = µ′ µ?


˜˜
Assume µ ̸= 0. Define an orthogonal matrix P with first row proportional to µ′ ,
µ ˜ ˜  √  ˜
i.e. √˜λ . Then Z = PX ∼ Np Pµ, PIp P′ or ( λ, 0, . . . , 0)′ , Ip . So the distribution
˜ ˜ ˜
of Z depends only on λ; hence so does that of Y = X ′ X = Z ′ Z.
˜ ˜ ˜ ˜ ˜
27
This also yields the following important representation of a noncentral χ2 dis-

tribution: note that writing Z = (Z1 , Z2 , . . . , Zp )′ we have Z1 ∼ N ( λ, 1) so that
˜
Y1 := Z12 ∼ χ21,λ and Z2 , . . . , Zp ∼ N (0, 1) and Z1 , Z2 , . . . , Zp are all independent
implying Y2 := pj=2 Zj2 ∼ χ2p−1 independently of Z12 ; since Y = pj=1 Zj2 = Y1 + Y2
P P

it follows that a χ2p,λ r.v. can be expressed as the independent sum of a χ21,λ variable
and a χ2p−1 variable.

One of the most important properties of this distribution is that it can be seen as
a mixture (infinite) of central χ2 distributions. We first prove the same for Y1 := Z12 .

Since Z1 = ± Y 1 , we have

√ dy 1/2 √ d(−y 1/2 )


fY1 (y) = fZ1 ( y) + fZ1 (− y)
dy dy
−1/2 √ 2 √ 2
    
y 1 √ 1 √
= √ exp − ( y − λ) + exp − (− y − λ)
2 2π 2 2
−1/2
 h
y y+λ p p i
= √ exp − exp( yλ) + exp(− yλ)
2 2π 2
 ∞ √
y + λ X ( yλ)2n

1
= √ exp −
2πy 2 n=0
(2n)!
∞ y 1
−λ
X λn e− 2 y n+ 2 −1
=e 2 √
n=0
2π · (2n)!

Now, we use the so-called duplication formula: Γ(2n)Γ( 21 ) = 22n−1 Γ(n)Γ(n + 12 ) for
gamma functions, and get

√ 1 1 1 1
π(2n)! = Γ( )Γ(2n + 1) = 2nΓ(2n)Γ( ) = 22n nΓ(n)Γ(n + ) = 22n n! Γ(n + )
2 2 2 2

28
so that
∞ y 1
−λ
X λn e− 2 y n+ 2 −1
fY1 (y) = e 2 √
n=0
2 · 22n n! Γ(n + 12 )
∞ y 1
X
−λ ( λ2 )n e− 2 y n+ 2 −1
= e 2

n=0
n! 2n+ 12 Γ(n + 1 )
2

X λ ( λ2 )n
= e− 2 f 2 (y)
n=0
n! χ2n+1

Thus the distribution of Y1 is a Poi(λ/2) mixture of χ21 , χ23 , χ25 , . . .

We know Y2 ∼ χ2p−1 ; therefore Y = Y1 + Y2 is a Poi(λ/2) mixture of χ2p , χ2p+2 ,


χ2p+4 , . . . [to see this, fix x > 0, let Yn′ ∼ χ22n+1 independently of Y2 for n ≥ 0; so
that Yn′ + Y2 ∼ χ22n+p . Then

X λ ( λ2 )n
Prob(Y ≤ x) = Prob(Y1 ≤ x − Y2 ) = e− 2 Prob(Yn′ ≤ x − Y2 )
n=1
n!

( λ2 )n

−λ
X
= e 2 Prob(Yn′ + Y2 ≤ x).
n=1
n!

Thus pdf of Y is given by



X λ ( λ2 )n
fY (y) = e− 2 f 2 (y)
n=0
n! χ2n+p
1
Recall that the characteristic function of a χ2n distribution is n . Thus
(1−2it) 2
 n
λ λ 1 itλ
e 2 ( (1−2it) −1)
∞ n −λ ∞
X
−λ (λ/2) 1 e 2 X 2(1−2it) e (1−2it)
ϕY (t) = e 2 p = p = p = p

n=0
n! (1 − 2it)n+ 2 (1 − 2it) 2 n=0 n! (1 − 2it) 2 (1 − 2it) 2
Clearly the function has a power series representation that converges for |2it|= 2|t|<
1 i.e. for t ∈ (− 21 , 12 ). So the above representation also yields that the moment
generating function MY exists on (− 12 , 21 ); and is given by
tλ  
e (1−2t) 1 1
MY (t) = p ; |t|< (actually, at least for t ∈ −∞, )
(1 − 2t) 2 2 2

29
so the cumulant generating function (cgf) equals, at least on − 21 , 21 ,


∞ ∞ ∞
λt p X
n−1 n p X (2t)n X tn n−1  p
γY (t) = ln MY (t) = − ln(1−2t) = λ 2 t + = 2 λ+ n!
1 − 2t 2 n=1
2 n=1
n n=1
n! n

whence the cumulants are

n−1
 p
2 λ+ n! = 2n−1 (n − 1)! (nλ + p)
n

In particular the mean is p + λ and the variance is 2p + 4λ.

Additivity properties: If Y1 ∼ χ2n1 ,λ1 and Y2 ∼ χ2n2 ,λ2 independently, then Y :=


Y1 + Y2 ∼ χ2n1 +n2 ,λ1 +λ2 .
Pnt √
Proof is trivial: write Yt = j=1 Ztj2 where Zt1 ∼ N ( λt , 1) and Ztj ∼ N (0, 1)∀j ̸=
1, t = 1, 2, and the whole collection {Ztj : 1 ≤ j ≤ nt , t = 1, 2} is independent. Then
Y = Y1 + Y2 = t=1,2 nj=1 Ztj2 is a sum of (n1 + n2 ) independent normal variables
P P t

all with variance 1, hence a χ2n1 +n2 ,λ where λ = t=1,2 nj=1 E(Ztj2 ) = λ1 + λ2 .
P P t

Stochastic ordering and monotonicity of tail probabilities

If X and Y are two random variables, we say X is stochastically larger than Y or


equivalently, Y is stochastically smaller than X and write X ≥st Y or Y ≤st X, if

∀ t ∈ R, FX (t) ≤ FY (t) ∀ t ∈ R, F̄X (t) ≥ F̄Y (t)

where FX and FY are the respective cdfs of X and Y ; and F̄ = 1 − F for F = FX , FY .


Essentially, this means that X exceeds any value with at least as much probability
as Y does; thus the name.

Our goal is to see that increasing either the degrees of freedom or the ncp for a
noncentral χ2 variable makes it stochastically larger. First let us state

30
Lemma 4.1 If Y ∼ X + Z where Z is a non-negative r.v. independent of X, then
Y ≥st X.

Proof. For any t ∈ R,


Z ∞
Prob(Y > t) = Prob(X > t − Z) = Prob(X > t − z|Z = z) dFZ (z)
0
Z ∞
= Prob(X > t − z) dFZ (z) by independence
0
Z ∞
≥ Prob(X > t) dFZ (z) = Prob(X > t).
0

We can now conclude

Corollary 4.1 If Yj ∼ χ2nj ,λj for j = 1, 2 with n1 ≥ n2 , λ1 ≥ λ2 , then Y1 ≥st Y2 .

Proof. If n1 > n2 , get Y3 ∼ χ2n1 −n2 ,λ1 −λ2 independent of Y2 ; then Y3 ≥ 0 with
probability 1 and Y1 ∼ Y2 + Y3 by additivity.

So suppose n1 = n2 = n, say. First, consider the special case when n = 1; so that


p
we can write Yj = Zj2 with Zj ∼ N ( λj , 1) for j = 1, 2. Since for t > 0,
√ √ p √ p
Prob(Yj > t) = Prob(|Zj |> t) = Φ(− t − λj ) + 1 − Φ( t − λj ),

let us define, for fixed s > 0 the function g : [0, ∞) → [0, 1] by g(c) = Φ(−s − c) + 1 −
Φ(s−c) and show that it is an increasing function of c: g ′ (c) = −ϕ(−s−c)+ϕ(s−c) =

ϕ(|s − c|) − ϕ(s + c) > 0 since 0 ≤ |s − c|< s + c. Applying to s = t, we obtain the
needed result.

For the case when n > 1, choose Y3 ∼ χ2n−1 independent of Z1 and Z2 . Then
Yj ∼ Zj2 +Y3 . It is an easy exercise to conclude that Z12 ≥st Z22 ⇒ Z12 +Y3 ≥st Z22 +Y3

31
or Y1 ≥st Y2 .

In fact, we often use the special case above. In that case, the stochastic inequality
is actually strict on (0, ∞); i.e. it is implied that if y > 0 then Prob(Y1 > y) >
Prob(Y2 > y).

A nice application concerns testing for equality of means of several normal pop-
ulations with a known common variance: suppose k ≥ 2 and for 1 ≤ j ≤ k,
(Xj1 , . . . , Xj,nj ) is a random sample of size nj > 1 from N (µj , σ 2 ) with µj unknown;
but σ is known. We wish to test for H0 : µ1 = µ2 = · · · = µk against H1 : not H0 at
a given level α ∈ (0, 1). Naturally, the samples are independent.
Pk 1
Pk
Let n := j=1 nj . Define µ̄ = n j=1 nj µj , and for 1 ≤ j ≤ k, the jth sample
mean nj
σ2
 
1 X
X̄j := Xji ∼ N µj ,
nj i=1 nj

and note that X̄1 , X̄2 , . . . , X̄k are independent. So, the overall mean
k k nj
1X 1 XX σ2
X̄ = nj X̄j = Xji ∼ N (µ̄, ).
n j=1 n j=1 i=1 n

Pk ¯ ¯
j=1 nj (Xj −X)2
Our test statistic is Y = σ 2 whose distribution we proceed to derive.
Define
1 √ √ √ ′ 
X := n1 X̄1 , n2 X̄2 , . . . , nk X̄k ∼ Nk µ, Ik
˜ σ ˜
1 √ √ √ ′
where µ := σ n1 µ1 , n2 µ2 , . . . , nk µk .
˜
p ′
, n , . . . , nnk is a vector of norm 1, so we can construct an orthog-
p n1 p n2
Now, n

onal matrix P with this as the first row. Let Z := PX. Then Z ∼ Nk Pµ, Ik ⇒
˜ ˜ ˜ ˜

32
X ′ X = Z ′ Z ∼ χ2k,λ′ , where
˜ ˜ ˜ ˜
X k
λ′ = {E(Zj )}2 = (Pµ)′ (Pµ) = µ′ µ, where Z = (Z1 , Z2 , . . . , Zk )′ .
j=1 ˜ ˜ ˜˜ ˜
In fact, we are actually more interested in
k
X
Zj2 = (Z2 , . . . , Zk )′ (Z2 , . . . , Zk ) ∼ χ2k−1,λ , say,
j=2
Pk
where λ = j=2 {E(Zj )}2 = λ′ − {E(Z1 )}2 ; both of which we now compute:
r r r  k r √ X k √ √ 
n1 n2 nk 1 X nj √ n nj n nµ̄
Z1 = , ,..., X= nj X̄j = X̄j = X̄ ∼ N ,1 ;
n n n ˜ σ j=1 n σ j=1 n σ σ
so that ( k )
k
X 1 X 2
Zj2 = X ′ X − Z12 = 2 nj X̄j 2 − nX̄ = Y ∼ χ2k−1,λ
j=2
˜ ˜ σ j=1

and
 √ 2 k
! Pk
′ nµ̄ 1 X j=1 nj (µj − µ̄)2
λ=µµ− = 2 nj µ2j − nµ̄ 2
= ≥ 0.
˜˜ σ σ j=1
σ2
Then λ = 0 iff H0 is true. Thus under H0 , the distribution of Y becomes central, and
the CR {Y > y} with y := χ2k−1 (α), the upper 100α% cutoff point of the (central)
χ2k−1 distribution, then yields a size α test.

The significance of our work is that when H0 is false, we have λ > 0 =⇒


Prob(Y > y) > α; thus the test is (strictly) unbiased. In fact the power of the
test increases with λ.

4.2 noncentral t

Recall that T ∼ tn if ∃ Z, Y independent with Z ∼ N (0, 1) and Y ∼ χ2n such that


T = √Z . For µ ̸= 0, we say T follows a non-cental t distribution with n df and
Y /n

33
ncp µ, and write T ∼ tn,µ , if

Z √ 1
T = p = nZY − 2
Y /n

where Z ∼ N (µ, 1) and Y ∼ χ2n independently. Clearly, the central case can be
considered a particular case of the non-central one allowing µ = 0.

Various moments of T , e.g., can be worked out directly from the definition. Recall
that if X has a Gamma distribution G(α, λ) then E(X u ) < ∞ iff α + u > 0 and in
this case it equals Γ(α + u)/{λu Γ(u)}. In particular, if Y ∼ χ2n i.e. G n2 , 12 , then

2u Γ( n +u)
EY u = Γ 2n assuming u + n2 > 0.
(2)
This means ET m = nm/2 E(Z m )E(Y −m/2 ) exists for all m such that − m2 + n2 > 0,
Γ( n−m )
or m < n. Then, it equals nm/2 E(Z m ) 2m/2 Γ2 n . Assuming n > 2 and putting m = 1
(2)
and 2, we get

n Γ n−1
r 
2
E(T ) = µ ;
2 Γ n2
n

n Γ − 1 (µ2 + 1)n n(µ2 + 1)
E(T 2 ) = (µ2 + 1) 2 
= = and so
2 Γ n2 2 n2 − 1

n−2
n−1
 !2
2 2 Γ
n(µ + 1) nµ
V (T ) = E(T 2 ) − {E(T )}2 = − 2
.
n−2 2 Γ n2

The most important property of the noncentral t distribution is the monotonicity


of tail probabilities similar to the noncentral χ2 :

Lemma 4.2 If Tµ ∼ tn,µ with n fixed, then (i) ∀ t ∈ R, Prob(Tµ > t) is an increasing
function of µ; and (ii) ∀ t > 0, Prob(|Tµ |> t) is an increasing function of |µ|.

Proof. For (i), recall Tµ = √Zµ as defined (writing Zµ for Z to indicate the role of
Y /n

34
µ). Denoting by fY the density of Y , we see that
r ! Z ∞  r 
Y y
Prob(Tµ > t) = Prob Zµ > t = Prob Zµ > t |Y = y fY (y) dy
n 0 n
Z ∞  r 
y
= Prob Z > t fY (y) dy by independence of Z and Y
0 n
Z ∞ r 
y
= 1−Φ t − µ fY (y) dy
0 n
Z ∞  r 
y
= Φ µ−t fY (y) dy
0 n

and the integrand is clearly an increasing function of µ or every y ∈ [0, ∞); hence so
is the integral over y on [0, ∞). In fact since Φ is strictly increasing on R, if µ1 > µ2
then ∀ x ∈ R, Φ(µ1 − x) > Φ(µ2 − x) so Prob(Tµ1 > t) > Prob(Tµ2 > t) ∀ t ∈ R.

For (ii), we have to first establish that the function in question is indeed a function
of |µ|. Again, for t > 0,
r !
Y
p(µ) := Prob(|Tµ |> t) = Prob |Z|> t
n
Z ∞  r 
y
= Prob |Z|> t Y = y fY (y) dy
0 n
Z ∞  r 
y
= Prob |Z|> t fY (y) dy (by independence)
0 n
Z ∞ r   r 
y y
= 1−Φ t − µ + Φ −t − µ fY (y) dy
0 n n
Z ∞  r   r 
y y
= Φ µ−t +1−Φ µ+t fY (y) dy
0 n n

and the integrand can be seen to be an even function of µ; hence so is p, i.e.


p(−µ) = p(µ); thus it can be considered a function of |µ|. Now to show it is increasing
in |µ|, it is enough to show it is increasing on [0, ∞). Clearly p(·) is differentiable on

35
R, and

Z   r   r 
′ y y
p (µ) = ϕ µ−t −ϕ µ+t fY (y) dy
0 n n
and for µ ≥ 0, since t ≥ 0, so for each y ≥ 0 we have
r r  r   r 
y y y y
µ−t ≤ µ+t =⇒ ϕ µ − t ≥ϕ µ+t
n n n n

meaning the integrand, hence the integral over y on [0, ∞), is also non-negative. So
p is increasing on [0, ∞).

In fact, here too, both the functions can be shown to be strictly increasing. This
we leave as an exercise.

The consequence is similarly in terms of unbiasedness of a test: that of the t-test


for the mean of a normal population with unknown variance: let X1 , . . . , Xn be a
random sample from a N (µ, σ 2 ) population with both µ and σ unknown. We wish to
test for H0 : µ = µ0 vs. H1 : µ > µ0 or H2 : µ < µ0 or H3 : µ ̸= µ0 at a given level
α ∈ (0, 1). The test statistic is

n(X̄ − µ0 )
T :=
S
r
  Pn ¯2
n 2 j=1 (Xj −X)
where X̄ = n1 j=1 Xj ∼ N µ, σn and S := with (n − 1)S 2 /σ 2 ∼
P
n−1

χ2n−1 independently of X̄ so that



n(X̄ − µ0 )/σ
T = ∼ tn−1,µ′
S/σ

n(µ−µ0 )
with µ′ = σ
. As µ′ = 0 under H0 , µ′ > 0 under H1 and µ′ < 0 under H2 , so
we can use the C.R.s {T > tn−1 (α)} for H0 vs. H1 , {T < −tn−1 (α)} for H0 vs. H2
and {|T |> tn−1 (α/2)} for H0 vs. H3 and all three tests are unbiased; in fact strictly.

36
4.3 Noncentral F distribution

In analogy to the (central) F distribution with (m, n) degrees of freedom which is


Y1 /m
defined as that of F = Y2 /n
where Y1 ∼ χ2m and Y2 ∼ χ2n independently, we define
the noncentral F distribution Fm,n,λ with degrees of freedom (m, n) and noncentrality
parameter λ ≥ 0 as that of

Y1 /m
F = where Y1 ∼ χ2m,λ and Y2 ∼ χ2n independently.
Y2 /n

Again, λ = 0 corresponds to the central distribution.

From similar calculations as for the noncentral t, E(F u ) < ∞ iff − m2 < u < n
2
,
and given then by

Γ n2 − u
 n u  n u 
u
E(F ) = E(Y1u )E(Y2−u ) = E(Y1u )
2u Γ n2

m m

In particular,

Γ n2 − 1

n (m + λ)n 1 (m + λ)n
E(F ) = (m + λ) n
 = n
= , n>2
m 2Γ 2 m 2 2 −1 m(n − 2)
n

2 Γ 2 −2
 n 2
2
E(F ) = (2m + 4λ + (m + λ) ) 2 n 
m 2Γ 2
n2 (m2 + 2mλ + λ2 + m + 4λ) 1
= n
 n

m2 4 2
−1 · 2
−2
n2 (m2 + 2mλ + λ2 + 2m + 4λ)
= , n > 4,
m2 (n − 2)(n − 4)
V (F ) = E(F 2 ) − {E(F )}2 , n > 4.

Again, we can derive stochastic ordering and consequences for tests from that.
Suppose m, n are fixed, and Fλ ∼ Fm,n,λ for λ ≥ 0. Then, if λ1 > λ2 ≥ 0 we claim
Fλ1 ≥st Fλ2 . Essentially, this amounts to showing that ∀x > 0, g(λ) := Prob(Fλ > x)

37
Yλ /m
is an increasing function of λ. Writing Fλ = Y /n
with a slight modification of the
notation as in the definition and arguing as in the case of the noncentral t distribution,
we have
  Z ∞  
mxY mxy
g(λ) = Prob Yλ > = Prob Yλ > Y = y fY (y) dy
n 0 n
Z ∞  mxy 
= Prob Yλ > fY (y) dy
0 n

and the integrand being an increasing function of λ for every fixed y > 0, so is the
integral over y on [0, ∞) as earlier. Again, it is left as an exercise to show the function
g is actually strictly increasing on [0, ∞).

The consequence is in terms of the power function of the ANOVA F -test for equal-
ity of means in one-way classified data: let k ≥ 2 and for 1 ≤ j ≤ k, Xj1 , . . . , Xj,nj
a random sample of size nj > 1 from N (µj , σ 2 ) with µj unknown; as is σ. Our aim
is again to test for H0 : µ1 = µ2 = · · · = µk against H1 : not H0 at a given level
α ∈ (0, 1).

As in the case when σ was known, put n := kj=1 nj , µ̄ = n1 kj=1 nj µj , and for
P P
Pnj  2
 P Pnj
1 ≤ j ≤ k, X̄j := n1j i=1 Xji ∼ N µj , σnj , X̄ = n1 kj=1 nj X̄j = n1 kj=1 i=1
P
Xji
Pk ¯ ¯
j=1 nj (Xj −X)2
and the ‘between sum of squares’ Y1 = σ 2 whose distribution we know is
Pk 2
j=1 nj (µj −µ̄)
χ2k−1,λ with λ = σ 2 .

However, σ being unknown, we use the ‘within sum of squares’ to estimate it: for
Pnj S2
1 ≤ j ≤ k, put Sj2 := i=1 (Xji − X̄j )2 so that σj2 ∼ χ2nj −1 independently of X̄j and
among themselves; and so Y2 = σ12 kj=1 Sj2 satisfies
P

Y2 ∼ χ2Pk = χ2n−k
j=1 (nj −1)

independently of X̄1 , . . . , X̄k and hence of Y1 , which is a function of X̄1 , . . . , X̄k .

38
Y1 /(k−1)
Thus F := Y2 /(n−k)
∼ Fk−1,n−k,λ and λ = 0 iff H0 is true. So the C.R. {F > f }
where f = Fk−1,n−k (α), the upper 100α% cutoff point of the central F distribution
with (k − 1, n − k) d.f., is of size α. From what we have justified above, this test
is (in fact strictly) unbiased since any non-null distribution of the test statistic is
a noncentral F with the same d.f.; thus stochastically (strictly) larger than its null
central distribution.

4.4 Fisher-Cochran Theorem


Pp 2 2P
We have already seen that when X ∼ Np (µ, Ip ) then j=1 Xj ∼ χp, pj=1 µ2j with
˜ ˜
notation as before. Of course, it follows that for any choice of k, 1 ≤ k ≤ p and
1 ≤ j1 < · · · < jk ≤ p, ki=1 Xj2i ∼ χ2k,Pk µ2 .
P
i=1 ji

Since whenever P is an orthogonal matrix, Y := PX also has a p-variate normal


˜ ˜
distribution with Ip as dispersion matrix, a similar fact holds for sum of squares of
subsets of the components of Y . Note that such sum of squares are expressible as
˜
quadratic forms in the components of X: if Y = (Y1 , . . . , Yp )′ then for 1 ≤ k ≤ p and
Pk ˜ ˜
1 ≤ j1 < · · · < jk ≤ p, i=1 Yji = (QX) (QX) = X ′ Q′ QX = X ′ BX, say; where
2 ′
˜ ˜ ˜ ˜ ˜ ˜
Qk×p = [P (j1 ) |. . . |P (jk ) ]′ consists of the (j1 , . . . , jk )th rows of P.
˜ ˜
What sort of matrix is B? Clearly it is symmetric; in fact non-negative definite.
Also, since QQ′ = Ik , we have r(B) = r(Q) = k = tr(QQ′ ) = tr(Q′ Q) = tr(B); thus
B is idempotent; i.e. a projection matrix.

The Fisher-Cochran theorem characterizes quadratic forms in X1 , . . . , Xp to have


a χ2 distribution precisely in such cases, inter alia. As a first preparation, a general
fact:

39
Lemma 4.3 If X has a nonsingular p-variate normal distribution and Ap×p is sym-
˜
metric with Prob(X ′ AX ≥ 0) = 1, then A is non-negative definite.
˜ ˜

Proof. Define the function g : Rp → R by g(x) = x′ Ax; then g is continuous on


˜ ˜ ˜
Rp . So if A is not nnd, then ∃ x0 ∈ Rp such that g(x0 ) < 0. But then the open set
˜ ˜
p
G := {x ∈ R : g(x) < 0} is nonempty. Since every nonempty open set has positive
˜ ˜
(Lebesgue) measure, Prob(X ∈ G) > 0 so that Prob(g(X) < 0) = Prob(X ∈ G) > 0,
˜ ˜ ˜
a contradiction.
Actually, only the fact that the distribution has a strictly positive density throughout
on Rp is relevant here and the normality of the distribution is immaterial.

Next, we require a lemma that we may call the identifiability of linear combinations
of independent χ2 variables distributed as a χ2

Lemma 4.4 Let Y1 , Y2 , . . . , Ym be independent, Yj ∼ χ2kj ,λj and Y := m


P
j=1 δj Yj ∼

χ2k,λ with δj ̸= 0 for each j. Then k = m m


P P
j=1 kj , λ = j=1 λj and δ1 = δ2 = · · · =

δm = 1.

Proof. First, from the nonnegativity of Y it follows that the δj > 0 for all j.
 
(If there were a negative δl then Probability of Y < 0 would equal Prob δl Yl < − j̸=l δj Yj
P

which could be evaluated as


" ! # Z !
X X Y
E Prob δl Yl < − δj Yj {Yj : j ̸= l} = Prob. Yl > δl−1 δj y j fYj (yj )dyj > 0
j̸=l j̸=l j̸=l

because the integrand is strictly positive for every choice of {yj : j ̸= l} since the
support of Yl is all of (0, ∞). )

40
Next, comparing cfs,
itδj λj
itλ m
e 1−2it Y e 1−2itδj
=
(1 − 2it)k/2 j=1
(1 − 2itδj )kj /2

So that cross multiplying and squaring, for t ̸= 0 we have


( )
Pm δj λj k
(1 − 2it)k
λ
2i
( 1 −2i
)
− j=1
( 1 −2iδ
) = Q tk 1t − 2i
e t t j
m
= Pm
(1 − 2itδj )kj
kj
t j=1 kj m 1
Q
j=1 j=1 t − 2iδj

Now since the LHS has a finite (in fact real) non-zero limit as |t|→ ∞, so does the
RHS which is a rational function; so the degrees of the polynomials in the numerator
and denominator must match whence k = m
P
j=1 kj . Further, actually evaluating the

limit and taking (natural) logarithms, we see


m m m
Pm
−kj
Y X X
e−λ+ j=1 λj
= δj =⇒ λ − λj = kj ln δj
j=1 j=1 j=1

Further, the cgf of δj Yj is defined, and given by a power series, at least on (− 2δ1j , 2δ1j ).
So at least for |t|< minj 2δ1j ∧ 12 , the cgf γY of Y = m
P
j=1 δj Yj equals

∞ ∞
  m
( m )
X k X X X  kj
2n−1 tn λ + = γY (t) = γYj (δj t) = 2n−1 tn δjn λj +
n=1
n j=1 n=1 j=1
n

and comparing the cumulants by matching coefficients in both sides, we see that for
all n ≥ 1,
m
k X n kj
λ+ = δj (λj + ).
n j=1
n

The LHS is obviously bounded, so must therefore the RHS be, hence each δj must be
at most 1. So let δ1 = δ2 = δr = 1 and δj ∈ (0, 1) for r + 1 ≤ j ≤ m; we shall show
r = m.

41
We have ∀ n,
r   m  
k X kj X
n kj
λ+ = λj + + δj λ j +
n j=1
n j=r+1
n
Pr
and taking limit as n → ∞, we get λ = j=1 λj and plugging back, ∀ n,
r
X m
X
k + nλ = kj + nλ + δjn (kj + nλj )
j=1 j=r+1

Pm
whence, recalling k = j=1 kj , we have ∀ n,
m
X m
X
kj = δjn (kj + nλj ) → 0 as n → ∞.
j=r+1 j=r+1

Pr Pm
It follows that r = m and λ = j=1 λj = j=1 λj .

This immediately yields the extremely useful

Corollary 4.2 If X ∼ Np (µ, Ip ) and Y := X ′ AX ∼ χ2k,λ for some k and λ where A


˜ ˜ ˜ ˜
is symmetric, then A is a projection matrix of rank k and λ = µ′ Aµ.
˜ ˜

Proof. We have from the spectral decomposition of A that A = Pp×m Dm×m P′m×p
where m = r(A), D = diag(d1 , . . . , dm ) with the non-zero eigenvalues of A and
columns of P := [P1 |P2 |. . . |Pm ] are orthonormal. Then with Zm×1 := P′ X =
˜ ˜ ˜ ˜ ˜
(Z1 , . . . , Zm )′ , say, we see that D(Z) = P′ P = Im . Thus, Z1 , Z2 , . . . , Zm are in-
˜
dependent normal each with variance 1 and so Z12 , Z22 , . . . , Zm
2
are independent χ2
variables each with d.f. 1. Now
m
X
Y = Z ′ DZ = dj Zj2 ∼ χ2k,λ
˜ ˜ j=1

42
whence it follows from Lemma 4.4 that k = m = r(A) and d1 = d2 = · · · =
dm = 1 so that A = PP′ is idempotent; hence a projection. Finally we identify
λ= m 2 ′ ′ ′ ′ ′
P
j=1 E(Zj ) = [E(Z)] [E(Z)] = (P µ) (P µ) = µ Aµ.
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜

The next preparatory result concerns decompositions of the total sum of squares:

Lemma 4.5 If In = A1 + · · · + Ak where each Aj is symmetric, then TFAE:


(1) n = kj=1 nj where r(Aj ) = nj for j = 1, 2, . . . , k
P

(2) Each Aj is idempotent (and hence a projection)

(3) Ai Aj = 0 whenever i ̸= j

Proof. (3) =⇒ (2) is the easiest: ∀ j, Aj − A2j = Aj (I − Aj ) = Aj ·


P
i̸=j Ai = 0

For (1) ⇒ (3), get spectral decomposition Aj = Pj Dj P′j for each j; permuted
so that Dj = diag(dj1 , dj2 , . . . , djnj , 0, . . . , 0) whence Aj = Qj ∆j Q′j where Pj =
[Qj |Rj ] and ∆j = diag(dj1 , dj2 , . . . , djnj ) for each j. Note that the columns of Pj are
orthonormal vectors, so that ∀ j, Q′j Qj = Inj .

Define Qn×n := [Q1 |Q2 |· · · |Qk ] and

∆n×n := diag(d11 , d12 , . . . , d1n1 , d21 , d22 , . . . , d2n2 , . . . , dk1 , dk2 , . . . , dknk )
 
∆ 0 ... 0
 1 
 0 ∆2 . . . 0 
 
= 
 .. .. . ..  ,

 . . . . . 
 
0 0 . . . ∆k

so that
X X
In = Aj = Qj ∆j Q′j = Q∆Q′
j j

43
This shows Q is non-singular; so premultiply by Q−1 and postmultiply by Q′ −1 to
get

(Q′ Q)−1 = ∆ =⇒ Q′ Q = ∆−1 = diag(d−1 −1 −1 −1 −1 −1


11 , d12 , . . . , d1n1 , . . . , dk1 , dk2 , . . . , dknk )

i.e.
     
∆−1 0 ... 0 Q′1 Q1 Q′1 Q2 . . . Q′1 Qk In1 Q′1 Q2 . . . Q′1 Qk
 1     
 0 ∆−1
  ′
  Q2 Q1 Q′2 Q2 . . . Q′2 Qk
  ′
... 0   Q2 Q1 In2 . . . Q′2 Qk
 
2 = =

 .. .. .. .. .. .. .. .. ..
 
. .. .. ..
. .
    
 . . .   . . .   . . . 
     
0 0 . . . ∆−1
k Q′k Q1 Q′k Q2 . . . Q′k Qk Q′k Q1 Q′k Q2 . . . Ink

In particular, ∆j = Inj ∀ j and whenever i ̸= j, Q′i Qj = 0 so that Ai Aj =


Qi Q′i Qj Q′j = 0 as required. It is worthwhile to further note that in fact, ∆ = I
and Q is orthogonal.

To show (2)⇒(1), just note that since for each j Aj is idempotent, so r(Aj ) =
tr(Aj ) whence kj=1 r(Aj ) = kj=1 tr(Aj ) = tr( kj=1 Aj ) = tr(In ) = n.
P P P

A significant sidelight that emerges from the proof is that under the hypotheses

44
of the lemma, we have for each j = 1(1)k,
   
′ ′
Q Q
 1   1 
 ..   .. 
 .   . 
   
Q′ Aj Q =  Q′j  [Aj Q1 |· · · |Aj Qj |· · · |Aj Qk ] =  Q′j  [0|· · · |Aj Qj |· · · |0]
   
 ..   .. 
   
 .   . 
   
′ ′
Qk Qk
   
0 · · · Q′1 Aj Qj · · · 0 0 ··· 0 ··· 0
 .. . . .. .. ..   .. . . .. .. .. 
   
 . . . . .   . . . . . 
   

=  0 · · · Qj Aj Qj · · · 0  =  0 · · · Inj · · · 0  .
   
 .. . . .. .. ..   .. . . .. .. .. 
   
 . . . . .   . . . . . 
   

0 · · · Qk Aj Qj · · · 0 0 ··· 0 ··· 0

This is sometimes expressed by saying that the orthogonal matrix Q simultaneously


diagonalises each Aj , j = 1(1)k.

We are ready for


Theorem 4.1 (Fisher-Cochran Theorem) Suppose X ∼ Nn µ, In and In =
˜ ˜
A1 + · · · + Ak with each Aj symmetric. Then (1)–(3) are equivalent to
(4) Yj := X ′ Aj X ∼ χ2nj ,λj for some nj and λj for j = 1, 2, . . . , k. In this case,
˜ ˜
Y1 , Y2 , . . . , Yk are independent, and nj = r(Aj ) while λj = µ′ Aj µ for each j.
˜ ˜

Proof. First we show (1)–(3) =⇒ (4). Writing r(Aj ) = mj and Aj = Qj Q′j where
(Qj )n×mj has orthonormal columns; define as before Qn×n := [Q1 |Q2 |· · · |Qk ] which

45
we now know is orthogonal. Therefore Z := Q′ X ∼ Nn Q′ µ, In . Writing

˜ ˜ ˜
   
(Q′1 X)m1 ×1 (Z )
 ˜   ˜ 1 m1 ×1 
 (Q′2 X)m2 ×1   (Z2 )m2 ×1 
   
Z=  ˜.  =  ˜ ..
   , say; we see for each j,
˜  .
.   .


   

(Qk X)mk ×1 (Zk )mk ×1
˜ ˜
Yj = X ′ Aj X = (Q′j X)′ (Q′j X) = Zj′ Zj ∼ χ2mj ,λj
˜ ˜ ˜ ˜ ˜ ˜
with λj = [E(Zj )]′ [E(Zj )] = (Q′j µ)′ (Q′j µ) = µ′ Aj µ. Independence follows from the
˜ ˜ ˜ ˜ ˜ ˜
fact that for i ̸= j, Cov(Zi , Zj ) = Q′i Qj = 0mi ×mj (∵ Q is orthogonal) so that Zi and
˜ ˜ ˜
Zj are independent, whence so are Yi = Zi′ Zi and Yj = Zj′ Zj .
˜ ˜ ˜ ˜ ˜
(4) =⇒ (2): the idempotence of Aj for each j follows from Corollary 4.2.

Actually, the idempotence of the quadratic form that was seen to be necessary in
Corollary 4.2 we can now prove to be sufficient too, so we have a converse and can
conclude

Corollary 4.3 If X ∼ Np (µ, Ip ) and Y := X ′ AX where A is symmetric, then Y ∼


˜ ˜ ˜ ˜
χ2k,λ for some k iff A is idempotent.

Proof. The ‘Only if’ part was Corollary 4.2. Now if A is idempotent, then so is
Ip − A and I = A + (Ip − A); so the theorem applies with k = 2.

An interesting fact is that the result is in a way symmetric with respect to the two
summands Y1 := X ′ AX and Y2 := X ′ (I − A)X comprising the total sum of squares
˜ ˜ ˜ ˜
X ′ X, a χ2 variable, except for the parameters: Y1 has a χ2 distribution iff so does Y2 .
˜ ˜

46
The same fact also holds for any pair of non-negative random variables adding to
a χ2 :

Corollary 4.4 Suppose X ∼ (µ, Ip ) and Y1 := X ′ A1 X ∼ χ2m,λ1 while Y2 := X ′ A2 X ∼


˜ ˜ ˜ ˜ ˜ ˜
χ2n,λ2 with A1 , A2 symmetric and unequal. If Y1 ≥ Y2 with probability 1, then m > n,
λ1 ≥ λ2 and Y1 − Y2 ∼ χ2m−n,λ1 −λ2 independently of Y2 .

Proof. Since A1 is a projection and r(A1 ) = m, write its spectral decomposition as


  

Im 0m×(p−m) P
 
A1 = PDP′ = P1 p×m P2 p×(p−m)    1  = P1 P′1
0(p−m)×m 0(p−m)×(p−m) P′2
   

P1 X Z1
and put Z = P′ X =  ˜  =  ˜ , say; so that Y1 = Z1′ Z1 . Note that
˜ ˜ P′2 X Z2 ˜ ˜
˜ ˜
Z ∼ Np (P′ µ, Ip ).
˜ ˜  
A m×m C m×(p−m)
Now X = PZ so Y2 = Z ′ P′ A2 PZ = Z ′   Z, say. Then
˜ ˜ ˜ ˜ ˜ C ′
B(p−m)×(p−m) ˜
with probability 1,
 
Im − A −C
0 ≤ Y1 − Y2 = Z ′   Z = Z ′ QZsay,
˜ −C ′
−B ˜ ˜ ˜

which means by Lemma 4.3 that the above matrix Q = ((qij )) is nnd. Since any prin-
cipal submatrix of an nnd matrix has to be nnd, −B is nnd; but then because P′ A2 P
was also a projection matrix and thereby nnd, so is B. So B = 0(p−m)×(p−m) . An easy
conclusion is that C = 0m×(p−m) for if not, say qij ̸= 0 for some 1 ≤ i ≤ m and m+1 ≤
x , 0, 0, . . . , y , 0, 0, . . . , 0)′ ∈ Rp ,
j ≤ p, then because qjj = 0, with z = (0, 0, . . . , |{z}
˜ th
|{z}
i j th
′ 2
z Qz = qii x + 2qij xy cannot be nonnegative for every choice of x and y; in fact for
˜ ˜
qii x2
any choice of x ̸= 0, choosing y < − 2qij x
will do the job.

47
So Y2 = Z1′ AZ1 ∼ χ2n,λ2 ⇒ A is idempotent of rank n, because D(Z1 ) = Im . By
˜ ˜ ˜
assumption Y2 is not identically equal to Y1 = Z1′ Z, so n < m and Im − A is idempo-
˜ ˜
tent of rank m − n; so Y1 − Y2 = Z1′ (Im − A)Z1 ∼ χ2m−n,λ for some λ independently
˜ ˜
of Y2 and as λ2 + λ = λ1 so λ = λ1 − λ2 .

Essentially, the proof rests on establishing that Y2 as a quadratic form only involves
those variables whose sum of squares gives Y1 .

[Exercise. Can you think of a direct proof of this claim: if A, B are projections
and A−B is nnd, then A−B is also a projection? Hint: first show that S := C(B) ⊆
C(A) := T ; for if not, then finding v ∈ S \ T we have Bv = v but u := Av ̸= v ⇒
˜ ˜ ˜ ˜ ˜ ˜
u − v ̸= 0 so that v ′ (A − B)v = v ′ (u − v) = (u − (u − v))′ (u − v) = −∥u − v∥2 < 0,
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
since u ⊥ (v − u).]
˜ ˜ ˜
Recall the simultaneous diagonalizability of matrices yielding independent χ2 vari-
ables as quadratic forms. More generally, just two quadratic forms Y1 = X ′ A1 X and
˜ ˜
Y2 = X ′ A2 X that are independent χ2 variables are given by simultaneously diago-
˜ ˜
nalizable matrices A1 and A2 . To see this, decompose the total sum of squares X ′ X
˜ ˜
as Y1 + Y2 + [X ′ X − (Y1 + Y + 2)] = X ′ [A1 + A2 + (I − A1 − A2 )]X. Of course,
˜ ˜ ˜ ˜
independence simply manifests as A1 A2 = 0.

Extension to several independent χ2 quadratic forms is trivial.

48
4.5 Residual SS in Linear Models; restricted case and testing
linear hypotheses

Before beginning this section let us recall what projections mean geometrically: the
projection y of a vector x on a subspace S represents the vector in S nearest to x; i.e.
2 ˜
˜ ˜
y − x = minz∈S ∥z − x∥2 . For any choice of z ∈ S, since (y − x) ⊥ S, so
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
2 2 2
∥z − x∥2 = (z − y) + (y − x) = (y − x) + (z − y) + 2⟨z − y, y − x⟩
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
2 2
= (y − x) + (z − y) (∵ z − y ∈ S, (z − y) ⊥ (y − x)
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
2
≥ (y − x) .
˜ ˜

In the Gauss-Markov linear model Y n×1 = Xn×k βk×1 + ϵn×1 with ϵ ∼ Nn (0, σ 2 In );
˜ ˜ ˜ ˜ ˜
the LSE of β is
˜
βˆ := arg min∥Y − Xβ∥2
˜ ˜ ˜
2 2
and the minimum R0 := minβ ∥Y − Xβ∥ = ∥Y − Xβ∥ ˆ 2 is the residual sum of squares.
˜ ˜ ˜ ˜
Since {Xβ : β ∈ R } = C(X), the minimum is attained when Xβˆ is the projection
k
˜ ˜ ˜
of Y on C(X). Thus Xβˆ = P (Y ) = X(X′ X)− X′ Y ; and
˜ ˜ C(X) ˜ ˜
2
R02 = (In − P )Y = Y ′ AY , say,
C(X) ˜ ˜ ˜

where A is the projection onto {C(X)}⊥ ⊆ Rn , with rank n − r(X). Since Y ∼


 R2 ˜
Nn Xβ, σ 2 In , σ20 ∼ χ2n−r(X),λ where
˜  
1 ′ 1 ′ ′ 1
Xβ = 2 X′ X − X(X′ X)− X′ X β = 0.

λ = 2 {E(Y )} A{E(Y )} = 2 β X In − P
σ ˜ ˜ σ ˜ C(X) ˜ σ ˜

We sometimes need to the restrict the choice of the parameter β to satisfy a linear
˜
constraint of the form Hβ = γ for some given matrix Hr×k and vector γ ∈ C(H).
˜ ˜
49
Typically, this situation arises when we set this constraint up as a null hypothesis
H0 and wish to test it. Usually, we also assume R(H) ⊆ R(X) ensuring that the
components of Hβ are estimable.
˜
The restricted residual SS is

R12 := min ∥Y − Xβ∥2 .


β∈Rk :Hβ=γ ˜ ˜
˜ ˜ ˜

Now S := {β : Hβ = γ} = β0 + S0 where β0 is any particular solution of Hβ0 = γ,


˜ ˜ ˜ ˜ ˜ ˜ ˜
and S0 = {β : Hβ = 0} = N (H), the null space of H. Thus
˜ ˜ ˜
R12 = min∥Y − Xβ∥2 = min∥Y − Xβ0 − Xβ∥2 .
β∈S ˜ β∈S0 ˜
˜
˜ ˜
˜ ˜

The minimum is attained when Xβˆ is the projection of Y − Xβ0 on the following
˜ ˜ ˜
subspace of Rn :

2
W := {Xβ : Hβ = 0} = {Xβ : β ∈ S0 } with R12 = (In − PW )(Y − Xβ0 ) .
˜ ˜ ˜ ˜ ˜ ˜ ˜
Now Y − Xβ0 can easily be seen to have dispersion matrix σ 2 In too and mean vector
˜ ˜ ν(I −P )ν
ν = Xβ − Xβ0 ; whence R12 /σ 2 follows χ2 with df n − dim(W ) and ncp λ := ˜ n σ2 W ˜ .
˜ ˜ ˜
So we need to identify the dimension of W ; the range of the restrictionof X  to
X
S0 = N (H). Note that S0 ∩ N (X) = {β : Xβ = 0n } ∩ {β : Hβ = 0r } = N  .
˜ ˜ ˜ ˜ ˜ ˜ H
So
 
X
dim(W ) = dim(S0 ) − dim (S0 ∩ N (X)) = n(H) − n  
H
   
X X
= {k − r(H)} − {k − r  } = r   − r(H)
H H

50
which simplifies to r(X) − r(H) if R(H) ⊆ R(X), as is normally assumed. Thus the
df is n − dim(W ) = n − (r(X) − r(H)).

Also, since R12 ≥ R02 so (R12 − R02 )/σ 2 must have χ2 distribution independently of
R02 , from result proved earlier. Its df is n − (r(X) − r(H)) − (n − r(X)) = r(H); and
(R12 −R02 )/r(H) ν(In −PW )ν
it follows that F := R02 /(n−r(X))
has an F distribution; with ncp λ = ˜ σ2 ˜.

Note that in the context of testing, when the null hypothesis H0 : Hβ = γ is true,
˜ ˜
β − β0 ∈ N (H); so ν = X(β − β0 ) ∈ W =⇒ λ = 0. Thus the F -test applies and
˜ ˜ ˜ ˜ ˜
becomes unbiased.

4.6 Fisher-Cochran theorem: Σ ̸= I case

When Σ ̸= Ip but still Σ > 0, writing Σ = BB′ and taking Y = B−1 X ∼


˜ ˜
Np (B−1 µ, Ip ), we see that for a symmetric A, X ′ AX = Y ′ (B′ AB)Y has a χ2 distri-
˜ ˜ ˜ ˜ ˜
bution iff

B′ AB is idempotent ⇔ B′ AB · B′ AB = B′ AB

⇔ AΣA = A ⇔ AΣAΣ = AΣ ⇔ AΣ is idempotent

or equivalently, ΣA is idempotent. Clearly, the df equals r(B′ AB) = r(A) and the
ncp is given by [E(Y )]′ (B′ AB)[E(Y )] = µ′ Aµ.
˜ ˜ ˜ ˜ ˜ ˜
For independence of two such, say X A1 X and X ′ A2 X, note that we need B′ A1 BB′ A2 B =

˜ ˜ ˜ ˜
0 ⇔ A1 ΣA2 = 0.

51
5 Sampling from a normal population

Suppose we have a random sample X1 , . . . , XN from a Np (µ, Σ) population. We shall


˜ ˜ ˜
typically assume Σ > 0 (i.e. Σ is nonsingular) and N > p, for reasons to become
clear in the sequel. Define  
X1

 ˜ 

′ ′ ′ ′
 X2 
Xp×N := (X1 |X2 |. . . |XN ) and XN p×1 := vec(X) = (X1 |X2 |. . . |XN ) =  ˜. ,
˜ ˜ ˜ ˜ ˜ ˜ ˜ 
 .. 

 
XN
˜
i.e. X is the matrix with the sample observations as its columns, and X is the grand
˜
column vector in which they are stacked, one on top of another.

Note that whenever ℓ1 , . . . , ℓN are vectors in Rp , ℓ′1 X 1 , . . . , ℓ′N XN are independent


˜ ˜P ˜ ˜ ˜ ˜
normal variables, so their sum N ℓ ′
j=1 j jX is also normally distributed. Since for any
˜ ˜
N
ℓ ∈ RN p , ℓ′ X = ′ ′ ′ ′
P
j=1 ℓj Xj where ℓ := (ℓ1 , . . . , ℓN ) , we conclude that X has an
˜ ˜˜ ˜ ˜ ˜ ˜ ˜ ˜
N p-dimensional normal distribution.

Let us identify the mean vector and dispersion matrix:


   
µ Σ 0 ... 0
 ˜   
 0 Σ ... 0 
   
 µ 
E(X) =  ˜.  & D(X) =   .. .. . . .. 

˜  ..  ˜ . . 
 
 . .
   
µ 0 0 ... Σ
˜

Thus X ∼ NN p 1N ⊗ µ, IN ⊗ Σ .
˜ ˜ ˜

52
5.1 Distribution of sample mean vector & SSSP matrix: in-
dependence & the Wishart distribution

The sample mean vector is defined as


N
¯ 1 X
X := Xi
˜ N i=1 ˜
and the sample corrected sum of squares and products (SSSP) matrix as
N N
X ′ X
¯ ¯ ¯X
Xj Xj′ − N X ¯′

A := Xi − X Xi − X =
i=1
˜ ˜ ˜ ˜ j=1
˜ ˜ ˜ ˜

A A
The sample dispersion matrix is defined as S = N
or N −1
according to context.
2
Recall that in one dimension, X̄ ∼ N (µ, σN ) and A
σ2
∼ χ2N −1 independently. We
establish analogous properties here too.

Let L be an N × N orthogonal matrix with last column given by √1N 1N and


√ −1 √ ˜
Z = XL. Write Z as (Z1 |Z2 |. . . |ZN ) so that ZN = N X1N = N X; ¯ and likewise,
˜ ˜ ˜ ˜  ˜ ˜
Z
 ˜1 
 
 Z2 
Z := vec(Z) =  ˜. 
 = MX
˜  ..  ˜
 
ZN
˜
where M := L′ ⊗ Ip so that Z also has a (N p)-variate normal distribution with
˜
E Z = ME X = (L′ ⊗ Ip ) · 1N ⊗ µ = (L · 1N ) ⊗ Ip · µ
 
˜ ˜ √ ′ ˜ ˜ ˜ √ ′ ˜
′ ′ ′ ′
= (0, 0, . . . , 0, N ) ⊗ µ = 0p , 0p , . . . , 0p , N µ
˜ ˜ ˜ ˜ ˜
and

D(Z) = MD(X)M′ = M (IN ⊗ Σ) · (L ⊗ Ip ) = M (IN · L) ⊗ (Σ · Ip )


˜ ˜
= (L ⊗ Ip ) · (L ⊗ Σ) = (L′ · L′ ) ⊗ (Ip · Σ) = IN ⊗ Σ

53
√ 
So Z1 , Z2 , . . . , ZN −1 are iid Np (0p , Σ) and ZN ∼ Np N µ, Σ independently
˜ ˜ ˜ ˜ ˜ ˜
with them.

Note that ZZ′ = XLL′ X = XX′ , so A = XX′ − N X ¯X ¯ ′ = ZZ′ − ZN Z ′ =


N
PN −1 ˜ ˜ ˜ ˜
′ ¯
j=1 Zj Zj . Thus A is independent of ZN , or equivalently, of X. Further, the
˜ ˜ ˜ ˜
distribution of A depends only on Σ.

Definition 5.1 If Y 1 , Y 2 , . . . , Y n are iid Np (0, Σ) then the distribution of nj=1 Y j Y ′j


P
˜ ˜ ˜ ˜ ˜ ˜
is called the Wishart distribution with parameters n and Σ, denoted by Wp (n, Σ).

Thus the corrected sample SSSP matrix A of a random sample of size N from a
p-dimensional normal population with mean vector µ and dispersion matrix Σ, has a
˜
Wp (N − 1, Σ) distribution. We shall write n := N − 1 and note that N > p ⇔ n ≥ p.

Before proceeding further we make a crucial change in the notation. Since the
distribution of A depends only on the first N − 1 columns of Z; in fact A is a function
of those alone; we drop the last column and redefine Z as

Zp×n = (Z1 |Z2 |. . . |Zn ) = (Z1 |Z2 |. . . |ZN −1 )


˜ ˜ ˜ ˜ ˜ ˜

now consisting of n iid Np (0p , Σ) distributed columns, so that A = ZZ′ . Thus,


˜
with this changed notation, the definition of Wishart Wp (n, Σ) distribution can be
formulated as that of A = ZZ′ .

A generally useful lemma:

Lemma 5.1 If ℓ1 , ℓ2 , . . . , ℓm are m orthonormal vectors in Rn , then Zℓ1 , Zℓ2 , . . . , Zℓm


˜ ˜ ˜ ˜ ˜ ˜
are iid Np (0p , Σ).
˜

54
Proof. Put L := [ℓ1 , ℓ2 , . . . , ℓm ] and Y j = Zℓj for 1 ≤ j ≤ m so that
˜ ˜ ˜ ˜ ˜
Yp×m := [Y 1 | . . . | Y m ] = ZL.
˜ ˜
If we write Y mp×1 = [Y ′1 | . . . | Y ′m ]′ , then
˜ ˜ ˜
Y = (L′ ⊗ Ip )Z ∼ Nmp (0mp , (L′ ⊗ Ip )(In ⊗ Σ)(L′ ⊗ Ip )′ )
˜ ˜ ˜
i.e. D(Y ) = (L′ ⊗ Ip )(L ⊗ Σ) = (L′ L) ⊗ Σ = Im ⊗ Σ.
˜
[A possibly more transparent alternative proof goes as follows: clearly Y has an
˜
mp-dimensional normal distribution with mean vector 0; to identify the covariances we
   ˜
 with Ytj = n Ztu (ℓj )u ,
    P
write Z =  Ztj  and Y = Ytj
u=1
 
1≤t≤p
 
1≤t≤p ˜
1≤j≤n 1≤j≤m

so that
X X
E(Ytj Ysj ) = E(Ztu (ℓj )u (ℓj )v Zsv ) = (ℓj )2u E(Ztu Zsu ) (∵ u ̸= v ⇒ E(Ztu Zsv ) = 0)
u,v
˜ ˜ u
˜
2
= σst ∥ℓj ∥ = σst (∵ ∥ℓj ∥ = 1) ∀j
˜ ˜
for 1 ≤ s, t ≤ p and independence follows since if i ̸= j then
X X
E(Yti Ysj ) = E(Ztu (ℓi )u (ℓj )v Zsv ) = (ℓi )u (ℓj )u E(Ztu Zsu ) = σts (ℓ′i ℓj ) = 0 (∵ ℓi ⊥ ℓj )
u,v
˜ ˜ u
˜ ˜ ˜˜ ˜ ˜

for 1 ≤ s, t ≤ p.]

5.2 Properties of the Wishart distribution

Recall that if A ∼ Wp (n, Σ) then A = ZZ′ where Zp×n = [Z1 |Z2 |. . . Zn ] with
˜ ˜ ˜
Z1 , Z2 , . . . , Zn ∼ Np (0p , Σ) iid. Assume n ≥ p and Σ > 0.
˜ ˜ ˜ ˜
55
(1) Let us first show that in this case the Wp (n, Σ) Wishart distributed matrix A
is nonsingular, i.e. positive definite, with probability 1 assuming Σ is p.d.

Since A = ZZ′ , rank of A is same as that of Z. We check Prob(r(Z) = p) = 1.


Since Z1 ̸= 0 and because of independence and nonsingularity of Σ, the proba-
˜
bility for each j = 2, . . . , p that Zj will lie in the linear span of {Z1 , Z2 , . . . , Zj−1 }
˜ ˜ ˜ ˜
is 0, therefore with probability 1, {Z1 , Z2 , . . . , Zp } is linearly independent. It
˜ ˜ ˜
follows that with probability 1, C := [Z1 |Z2 |. . . |Zp ] is nonsingular; so its rows
˜ ˜ ˜
are l.i. as well.

But if any non-null linear combination of the rows of Z were to yield null vec-
tor, the same linear combination of the rows of C would do so too, rendering
C singular which happens only with probability 0. So with probability 1, the
rows of Z are l.i. implying Z to have rank p.

Of course, when n < p, r(A) = r(ZZ′ ) = r(Zp×n ) ≤ n < p. Also, when Σ is


singular clearly each Zj lies with probability 1 in a proper subspace S of Rp
˜
defined by S = {x ∈ Rp : ℓ′ x = 0} where ℓ ∈ Rp \ 0 is such that Σℓ = 0p ; thus
˜ ˜˜ ˜ ˜ ˜ ˜
with probability 1, C(Z) ⊆ S so that r(A) = r(Z) = dim(C(Z)) ≤ dim(S) < p
i.e. A is singular with probability 1.

(2) When p = 1, writing Σ as σ 2 we see that the W1 (n, σ 2 ) distribution is that of


a χ2n variable multiplied by σ 2 .

(3) E(A) = nj=1 E(Zj Zj′ ) = nΣ: this means that in the context of estimation of
P
˜ ˜
Σ, An
is unbiased for Σ.

(4) Additivity property: if A ∼ Wp (n, Σ) and B ∼ Wp (m, Σ) are independent, then


A + B ∼ Wp (m + n, Σ). That is because writing A = Zp×n Z′ and B = Yp×m Y′

56
in the usual fashion, we can take Z and Y independent, put W := [Z|Y]p×(m+n)
with iid Np (0, Σ) columns, and A + B = ZZ′ + YY′ = WW′ ∼ Wp (m + n, Σ).
˜
(5) For any k and any k × p matrix H, HAH′ ∼ Wk (n, HΣH′ ): let Y j = HZj ∼
˜ ˜
Nk (0k , HΣH′ ), iid for 1 ≤ j ≤ n, and Y = [Y 1 |. . . |Y n ] = HZ; then HAH′ =
˜ ˜ ˜
YY′ .

(6) In particular, assuming Σ > 0, for any ℓ ∈ Rp \ {0p }, ℓ′ Aℓ ∼ W1 (n, ℓ′ Σℓ) i.e.
˜ ˜ ˜ ˜ ˜ ˜
ℓ′ Aℓ 2
ℓ˜′ Σ˜ℓ
∼ χn .
˜ ˜
(7) The identifiabillity lemma i.e. Lemma 4.4 extends in a nice way to independent
Wishart matrices. If A1 , A2 , . . . , Ak are independent with Aj ∼ Wp (nj , Σ) and
δ1 , δ2 , . . . , δk ̸= 0 with A := kj=1 δj Aj ∼ Wp (n, Σ), then δ1 = · · · = δk = 1 and
P

ℓ′ A ℓ
n = n1 + · · · + nk . That is because for any ℓ ∈ Rp \ {0p }, we have Yj := ˜ℓ′ Σj˜ℓ ∼
˜ ˜ ˜ ˜
ℓ′ Aℓ
χ2nj , and Y := kj=1 δj Yj = ℓ˜′ Σ˜ℓ ∼ χ2n .
P

˜ ˜P
[added 02.06.2023] Even if A = kj=1 δj Aj ∼ Wp (n, Φ), we can prove Φ = cΣ
for some c > 0 and δj = c ∀ j. First, note cA ∼ Wp (n, cΣ) for any c > 0.
ℓ′ A ℓ
Again fixing ℓ ∈ Rp \ {0p } and defining Yj = ˜ℓ′ Σj˜ℓ ∼ χ2nj , we have
˜ ˜ ˜ ˜
k
ℓ′ Aℓ X ℓ′ Σℓ
Y := ˜ ′ ˜ = δj ˜′ ˜ Yj ∼ χ2n
ℓ Φℓ j=1
ℓ Φℓ
˜ ˜ ˜ ˜
ℓ ′ Σℓ ℓ′ Σℓ
and so n = kj=1 nj and δj ˜ℓ′ Φ˜ℓ = 1 ∀ j = 1, . . . , k. But fixing any j, ˜ℓ′ Φ˜ℓ can
P

˜ ˜ ˜ ˜
be a constant cj := δ1j free of ℓ only if Σ = cj Φ; but obviously then cj cannot
˜
vary with j; so let cj ≡ 1c . Thus δj = c for each j and Φ = c1j Σ = cΣ.

(8) ZCZ′ ∼ Wp (k, Φ) (where Cn×n is symmetric) iff Φ = cΣ and 1c C is idempotent


of rank k, for some c > 0.

For the if part, let r := r(C); write 1c C = QQ′ where Qn×r = [Q1 |. . . |Qr ] has
˜ ˜
57
orthonormal columns. Take Y j = ZQj , 1 ≤ j ≤ r; then Y 1 , Y 2 , . . . , Y r are iid
˜ ˜ ˜ ˜ ˜
Np (0, Σ). So writing Y = [Y 1 | · · · | Y r ], we have ZCZ′ = cYY′ ∼ Wp (r, Σ).
˜ ˜ ˜
For only if part, again put r := r(C) and write C = PDP′ where Pn×r =
[P1 |P2 |. . . |Pr ] has orthonormal columns and D = diag(δ1 , δ2 , . . . , δr ) so that C
˜ ˜ ˜ Pr
has non-zero eigenvalues δ1 , δ2 , . . . , δr . Then ZCZ′ = j=1 δj (ZPj )(ZPj ) =

Pr ˜ ˜

j=1 δj Y j Y j where Y j = ZPj , 1 ≤ j ≤ r.
˜ ˜ ˜ ˜
Again we know that {Y j : 1 ≤ j ≤ r} are iid Np (0, Σ). Thus Y 1 Y ′1 , . . . , Y r Y ′r
˜P ˜ ˜ ˜ ˜ ˜
are iid Wp (1, Σ) while rj=1 δj Y j Y ′j ∼ Wp (k, Φ). Therefore by the extension of
˜ ˜
the identifiability lemma, k = r and ∃ c > 0 such that Φ = cΣ and δ1 = δ2 =
· · · = δk = c. So 1c C = kj=1 Pj Pj′ is idempotent of rank k.
P
˜ ˜

5.3 The Wishart density

This derivation is due to Profs. Malay Ghosh & Bimal K. Sinha (The American
Statistician, 2002). Let us first understand on what space we are looking to obtain a
density: since Σ and with probability 1, A, are p × p p.d. matrices, the appropriate
space is the open subset

p(p+1)
Θp := {(a11 , a21 , a22 , . . . , ap1 , . . . , app ) ∈ R 2 : the symmetric matrix ((aij ))1≤i,j≤p is p.d.}

p(p+1)
of R 2 that is also the parametric space, and we obtain a density of A also on that
space.

We have A = ZZ′ where Z = [Z1 |. . . |Zn ] where Z1 , . . . , Zn are iid Np (0p , Σ); with
˜ ˜ ˜ ˜ ˜′ 
U
 ˜1 
 . 
n ≥ p and Σ > 0. Writing U 1 , . . . , U p for the rows of Z, we have Z =  .. .
˜ ˜  
U ′p
˜
58
Denote our target density by gΣ . We first treat the case Σ = Ip . For this, we
proceed recursively, denoting for each j = 1, 2, . . . p the j × j top left principal minor
of A by A[j] ; i.e.
A[j] = [[ail ]]1≤i,l≤j = Z[j] Z′[j]
 
U ′1
 ˜ 
 . 
where Z[j] =  ..  takes the first j rows of Z as its. Note that A[p] = A. Also
 

Uj
˜
denote A[0] = 1.

It will be convenient to partition Z[j] in the following obvious way:


     

Z[j−1] Z Z Z[j−1] U j A[j−1] Y j
Z[j] =   =⇒ A[j] =  [j−1] [j−1] ˜ = ˜ 
U ′j U ′j Z′[j−1] U ′j U j Y ′j ajj
˜ ˜ ˜ ˜ ˜
where Y j := (aj1 , . . . , aj j−1 ) and also to introduce notations for ‘the’ joint conditional
˜
densities of Y j and ajj ; i.e. the variables occurring in the last row of A[j] ; given Z[j−1]
˜
i.e. the variables {Zil : 1 ≤ i ≤ j − 1, 1 ≤ l ≤ n} occurring in the set U1 , . . . , Uj−1 .
Let us denote this by gj|Σ .

The idea is to develop a formula for gj|Ip ; prove that it is actually free of Z[j−1]
except through A[j−1] ; and thus also represents the conditional joint density given
A[j−1] . Now use it recursively over j = 1, 2, . . . , p so that our final target could be
expressed as

gIp = g1|Ip (a11 ) g2|Ip (a21 , a22 |a11 ) · · · gp|Ip (ap1 , ap2 , . . . , app |A[p−1] ). (5.1)

So fix 2 ≤ j ≤ p.

Note in the special Σ = Ip case that the entries Zij of Z are iid, each with a
standard normal distribution. Therefore the random vector U j ∼ Nn (0n , In ) inde-
˜ ˜
59
pendently of Z[j−1] , and it follows that conditionally,

Y j = Z[j−1] U j ∼ Nj−1 Z[j−1] 0n , Z[j−1] In Z′[j−1]


 
≡ Nj−1 0j−1 , A[j−1] .
˜ ˜ ˜ ˜

The next step is to get the conditional joint distribution including ajj , which we
actually obtain in a slightly roundabout way: we actually show that

ajj.j−1 := ajj − Y ′j A−1 ′ ′ ′ ′ −1


[j−1] Y j = U j U j − U j Z[j−1] (Z[j−1] Z[j−1] ) Z[j−1] U j
˜ ˜ ˜ ˜ ˜ ˜

and Y j are conditionally independent; and ajj|j−1 ∼ χ2n−j+1 conditionally. The latter
˜
is because
ajj.j−1 = ∥(In − P )U j ∥2 ∼ χ2n−r(Z[j−1] )
C(Z ′
[j−1]
) ˜
i.e. χ2n−j+1 while the conditional cross-covariance matrix between Y j = Z[j−1] U j and
˜ ˜
V j := (In − P )U j being
˜ C(Z′[j−1] ) ˜
Z[j−1] (In − Z′[j−1] (Z[j−1] Z′[j−1] )−1 Z[j−1] ) = Z[j−1] − Z[j−1] = 0(j−1)×n

we conclude the conditional independence of Y j = Z[j−1] U j and V j ; from which


˜ ˜ ˜
follows that of Y j and ∥V j ∥2 = ajj|j−1 . Thus ‘the’ conditional joint density of Y j and
˜ ˜ ˜
ajj.j−1 is just the product of their respective (conditional) marginal densities.

Now note that the Jacobian for the map (Y ′j , ajj ) 7→ (Y ′j , ajj.j−1 ) is 1 for every j;
˜ ˜
in fact the Jacobian matrix itself is Ij . Thus the conditional joint densities of (Y ′j , ajj )
˜
and (Y ′j , ajj.j−1 ) given Z[j−1] , or equivalently, given A[j−1] , are exactly the same gj|Ip .
˜

60
It follows that for (y j , ajj ) in the appropriate domain (i.e. so that A[j] is p.d.),
˜
h i
exp − 12 y ′j A−1
 (n−j+1)/2−1
y exp − 12 ajj.j−1 ajj.j−1

[j−1] j
gj|Ip (y j , a) = j−1
˜ ˜1 · n−j+1
2 2 Γ n−j+1

˜ (2π) 2 |A[j−1] | 2 2
h i
1 ′ −1
exp − 2 (y j A[j−1] y j + ajj.j−1 )  |A[j] |  n−j−1 2
= n j−1
˜ ˜ 1
·
2 2 π 2 Γ n−j+1 |A[j−1] | 2 |A[j−1] |
2
 ajj  n−j−1
exp − 2 |A[j] | 2
= n j−1 n−j .
2 2 π 2 Γ n−j+1

2
|A[j−1] | 2

Hence the product in equation (5.1) above reduces to


h P i
1 p
exp − 2 j=1 ajj p n−j−1 n−p−1
exp − 12 tr(A) |A| 2
 
Y |A[j] | 2
np Pp j−1 Qp
n−j+1
· n−j = np p(p−1) Qp
n−j+1

2 2 π j=1 2 j=1 Γ 2 j=1 |A[j−1] | 2 2 2 π 4
j=1 Γ 2

Σ general p.d. case: Now an ingenious trick is used exploiting the sufficiency of
A for the distribution of Z (or equivalently Z). Consider the joint density of the
˜
p
columns of Z: for z 1 , z 2 , . . . , z n ∈ R ,
˜ ˜ ˜
Y e− 2 (˜zj Σ−1˜zj )
n 1 ′ 
1

−n
= K|Σ| exp − E , say;
2

j=1
(2π)p/2 |Σ|1/2 2

for constant K where E is (−2)× the exponent in the density. Then


n
X n
 X
tr z ′j Σ−1 z j = tr Σ−1 z j z ′j

E = tr(E) =
j=1
˜ ˜ j=1
˜˜
n
! n
!
X X
−1 −1
Σ z j z ′j = tr Σ ( z j z ′j ) = tr Σ−1 A

= tr
j=1
˜˜ j=1
˜˜

So the joint density becomes, writing z = (z ′1 , . . . , z ′n )′ ,


˜ ˜ ˜
 
−n 1 −1
fΣ (z) = K|Σ| exp − tr(Σ A) ,
2
˜ 2

61
and one could invoke either factorization criterion or common properties of exponen-
tial families to conclude. As this means that the conditional distribution of Z given
˜
A is free of Σ; denoting by h its density, we can write the joint density as

fΣ (z) = gΣ (A)hZ|A (z|A)


˜ ˜ ˜

which implies that we can write, choosing any z ∈ Rnp with ZZ ′ = A,


˜
fΣ (z) fΣ (z)
gΣ (A) = ˜ = ˜
hZ|A (z|A) fIp (z)/gIp (A)
˜ n˜ ˜
|Σ|− 2 exp − 12 tr(Σ−1 A) |A|(n−p−1)/2 exp − 21 tr(A)
   
= × np/2 p(p−1)/4 Qp
exp − 21 tr(A)
 
2 π i=1 Γ ((n − i + 1)/2)

|A|(n−p−1)/2 e− 2 tr(Σ A)
1 −1

= np/2 p(p−1)/4 n/2 Qp


2 π |Σ| i=1 Γ ((n − i + 1)/2)

Note. The standard derivations, although typically more complicated, yield cer-
tain byproducts. E.g. the so-called LU-factorization, also called the Bartlett de-
composition for the Wishart matrix immediately yields the distributions of ajj.j−1 for
2 ≤ j ≤ p and thereby that of the sample generalized variance, etc.

5.4 Further properties of, and sampling distributions based


on, the Wishart

(1) Distribution of A11.2q×q :

Recall that A = ZZ′ where Z = (Z1 |Z2 |. . . |Zn ) with Z1 , Z2 , . . . , Zn are iid
˜ ˜ ˜ ˜ ˜ ˜
Np (0p , Σ).
˜
Since Σ > 0 and n ≥ p ⇒ A > 0 with probability 1, so for 1 ≤ q ≤ p − 1,

62
partitioning A as
 
A11 A12
q×q q×(p−q)
A=
 

A21 A22
(p−q)×q (p−q)×(p−q)

we see with probability 1, A11 and A22 are positive definite, and so is with
probability 1 the matrix

A11.2 = A11 − A12 A−1


22 A21

e.g. because |A|= |A11.2 |·|A22 |. Now note that writing


 
(1)
Zj
 ˜q×1   
(i) (i)
Zj =  , 1 ≤ j ≤ n, and Z(i) = Z1 |Z2 |. . . |Zn(i) , i = 1, 2,
 

˜  Z (2)  ˜ ˜ ˜
j
˜
p−q×1

 
(1)
Zq×n
we partition Z as   with Aij = Z(i) Z(j)′ , 1 ≤ i, j ≤ 2.
(2)
Z(p−q)×n
Now we know
 
(1) (2) (2)
(Zj |Zj ) ∼ Nq BZj , Σ11.2 ∀ j = 1, 2, . . . , n
˜ ˜ ˜
(1) (2)
with B := Σ12 Σ−1 (2)
22 . Write Yj = Zj − BZj , 1 ≤ j ≤ n. Then given Z , the
˜ ˜ ˜
Yj are iid Nq (0, Σ11.2 ) for j = 1, 2, . . . , n. Write
˜ ˜
Yq×n = (Y1 |Y2 |. . . |Yn ) = Z(1) − BZ(2)
˜ ˜ ˜
so that conditionally given Z(2) , YY′ ∼ Wq (n, Σ11.2 ).

Now note that the partition of A means

′ ′ ′ ′ ′
A11.2 = Z(1) Z(1) − Z(1) Z(2) (Z(2) Z(2) )−1 Z(2) Z(1) = Z(1) TZ(1)

63
′ ′
where temporarily we write T := In − Z(2) (Z(2) Z(2) )−1 Z(2) . Observe that
 


Z(2) T = 0(p−q)×n or equivalently, TZ(2) = 0n×(p−q) .

We now claim that YTY′ = A11.2 also, and prove this by noting that

 h ′ ′
i
YTY′ = Z(1) − BZ(2) T Z(1) − Z(2) B′


′ ′ ′ ′
= Z(1) TZ(1) − BZ(2) TZ(1) − Z(1) TZ(2) B′ + BZ(2) TZ(2) B′

= A11.2

since the last three terms are all null. It remains to observe that T is a symmet-
ric idempotent, hence projection, matrix. In fact it represents the projection
onto C(Z(2) ′ ); hence its rank equals (n − r(Z(2) )) = (n − (p − q)) with proba-
bility 1. Since conditionally given Z(2) , we know YY′ ∼ Wq (n, Σ11.2 ); therefore
A11.2 = YTY′ ∼ Wq (n − (p − q), Σ11.2 ) conditionally, by Property 8. But the
conditional distribution does not depend on Z(2) , so it must be the unconditional
distribution as well, independently of Z(2) .

(2) Specializing the above to the case q = 1 has several important ramifications. In
this case, let us temporarily denote the 1×(p−1) vector A12 = (a12 , . . . , a1p ) by
a1 . Then a11.2 = a11 − a1 A−1 ′
22 a1 has a W1 (n − (p − 1), σ11.2 ) distribution where
˜ ˜ ˜
σ11.2 = σ11 − σ1 Σ−1 ′
22 σ1 with σ1 = Σ12 = (σ12 , . . . , σ1p ).
˜ ˜ ˜ 1×(p−1)
This also means that σa11.2
11.2
∼ χ2n−p+1 independently of A22 .

(3) Generalized variance: When A is the SSSP matrix of a random sample of size
N = n + 1 from a normal population with dispersion matrix Σ, the determinant
|A|
|A
n
|= np
is called the generalized variance of the sample. It estimates the
population generalized variance |Σ|. We assume Σ > 0 and prove

64
Proposition 5.1 ||A|Σ| has the distribution of the product of p independent χ
2

variables with respective dfs n, n − 1, . . . , n − p + 1.

Proof. Use induction on p: writing ||A| a11.2 |A22 |


Σ| = σ11.2 · |Σ22 | , the first factor has
a χ2n−p+1 distribution independently of the second. Now apply the induction
hypothesis on the second factor, noting A22 ∼ Wp−1 (n, Σ22 ).
This leads to
|A|
Corollary 5.1 n(n−1)···(n−p+1)
is an unbiased estimator of |Σ|.

In fact it is the UMVUE, as established in the sequel.

(4) We now see another interpretation of a11.2


σ11.2
: note that if we write Σ−1 =

σ (ij) then
1≤i,j≤p

|Σ22 | |Σ22 | −1
σ (11) = = = σ11.2
|Σ| σ11.2 · |Σ22 |
and likewise, a(11) = a−1 where A−1 = a(ij)

11.2 . Therefore we conclude
1≤i,j≤p
σ (11)
a(11)
∼ χ2n−p+1 , independently of A22 . More generally, clearly for any j =
σ (jj)
2, . . . , p also, a(jj)
∼ χ2n−p+1 . Can we generalize this further?

We can. Observe that a(jj) = e′j A−1 ej and σ (jj) = e′j Σ−1 ej for 1 ≤ j ≤ p. So
−1
e′j Σ ej
˜ ˜ ˜ ˜
for each j, ˜e′ A−1 e˜j ∼ χ2n−p+1 . We claim that ej in the above can be replaced
˜
j
˜ ˜
by any ℓ ̸= 0; i.e.
˜ ˜
p ℓ′ Σ−1 ℓ 2
∀ ℓ ∈ R \ {0}, ˜′ −1˜ ∼ χn−p+1 .
˜ ˜ ℓA ℓ
˜ ˜
To prove this, choose a nonsingular matrix Q whose first column is ℓ; so that
˜
ℓ = Qe1 . Now define Λ := Q−1 ΣQ′ −1 and B := Q−1 AQ′ −1 ∼ Wp (n, Λ) by
˜ ˜
property 5; so
ℓ′ Σ−1 ℓ e′1 Q′ Σ−1 Qe1 e′1 Λ−1 e1 2
˜′ −1˜ = ˜′ ′ −1 ˜ = ˜′ −1˜ ∼ χn−p+1 .
ℓA ℓ e1 Q A Qe1 e1 B e1
˜ ˜ ˜ ˜ ˜ ˜
65
(5) This yields at one go the following consequence when the first component is
independent of the rest: then because σ1 = 0′p−1 , it follows that the vector
˜ ˜
β = β1.23...p = Σ−1 σ
22 1

of population regression coefficients becomes null, and
˜ ˜ ˜ −1
σ Σ σ′
the (squared) population multiple correlation coefficient ρ21.23...p = ˜1 σ1122 ˜1
becomes 0.

In the context of random sampling from a normal population with dispersion


matrix Σ, denoting as before the corrected sample SSSP matrix by A, let us
definethe sample dispersion matrix as S = ((sij )) = A
n
and partition it as
s11 s1
S =  ˜ . Now, in analogy with β1.23...p and ρ21.23...p , we define the
′ ˜
s1 S22
˜
sample versions:

s S−1 s′ s11 − s11.2 a11 − a11.2


b1.23...p = S−1
22 s1

and R2 := R1.23...p
2
= ˜ 1 22 ˜ 1 = = .
˜ ˜ s11 s11 a11
a11 a11.2 a11.2
Now, we know σ11
∼ χ2n and σ11.2
∼ χ2n−p+1 . But here σ11.2 = σ11 ; so σ11

a11 −a11.2
χ2n−p+1 . Further, ∵ a11.2 ≤ a11 , so σ11
≥ 0 and hence must have a χ2
a11.2
distribution independently of σ11
and hence of a11.2 .

In particular, its df is n − (n − p + 1) = p − 1 and thus, in this case


a11 −a11.2
R2 n − p + 1 σ11
/(p − 1)
F := = a11.2 ∼ Fp−1,n−p+1
1 − R2 p − 1 σ11
/(n − p + 1)

Since the original sample size is N = n+1, we often write F ∼ Fp−1,N −p instead.

(6) null distribution of sample correlation coefficient: Firstly, when A is the SSSP
matrix of a random sample from Np (µ, Σ) distribution of size N = n + 1,
˜
the sample correlation coefficients are expressible simply from A as follows: if
A sij aij
S= n
= ((sij )) then rij = √
sii sjj
= √
aii ajj
, 1 ≤ i ̸= j ̸= p; so their distribution
is determined by that of A ∼ Wp (n, Σ).

66
Clearly, if we fix i and j then the distribution of rij depends only on that of
the joint 
distribution
   of the ithand th
 j components of the distribution; which
µi σii σij
is an N2  , ; hence it is enough to we specialize to the
µj σij σjj
case p = 2. Suppose we do that and write ρ12 =: ρ, r12 =: r, and note that
2
σ12
ρ = 0 ⇔ σ12 = 0 ⇔ σ11.2 = σ11 − = σ11 (1 − ρ2 ) = σ11 .
σ22
 
U′
Now if we write Z = ((Zij ))1=1,2,1≤j≤n =  ˜ , say, then
V′
˜
A = ZZ′ =⇒ a11 = U ′ U , a12 = a21 = U ′ V , a22 = V ′ V
˜ ˜ ˜ ˜ ˜ ˜

and we know that conditionally given V , Z1j ∼ N (β12 Z2j , σ11.2 ), i.e. N (0, σ11 )
˜
if ρ = 0; and these are conditionally independent over j. Hence conditional
distribution of U given V is Nn (0n , σ11 In ).
˜ ˜ ˜
Now, let V denote a (random, but fixed given V ) orthogonal matrix whose first
˜
V
row is ∥˜V ∥ . Then the conditional distribution of VU =: W = (W1 , . . . , Wn )′ ,
˜ ˜ ˜
say, given V , is also the same Nn (0n , σ11 In ).
˜ ˜
Pn ′ ′
Now observe that a12 = j=1 Z1j Z2j = U V = W1 ∥V ∥; and a11.2 = U U −
˜ ˜ ˜ ˜ ˜
(U ′ V )2 ′ 2
˜V ˜V
′ = W W − W1 . But a 12 has the conditional distribution
˜˜ ˜ ˜
X n
2
N (0, Z2j σ11 ) or N (0, σ11 a22 )
j=1

so that conditionally, √a12 ∼ N (0, σ11 ) and √ a12 ∼ N (0, 1).


a22 σ11 a22

Further, A has the distribution given by


       
a11 a12 σ11 σ12 σ11 0
  ∼ W2 n,   or W2 n,  
a12 a22 σ12 σ22 0 σ22

67
in this case. In particular ajj /σjj ∼ χ2n while a11.2 /σ11 = a11.2 /σ11.2 ∼ χ2n−1
independently of V .
˜
We now show that √aa1222 and a11.2 are conditionally independent: we only need
to observe that
n−1
a12 X
√ = W1 and a11.2 = W ′ W − W12 = Wj2
a22 ˜ ˜ j=1

which are clearly independent N (0, σ11 ) and σ11 χ2n−1 variables conditionally,
given V . So,
˜
a12 a11 a22 − a212 a11 − a212 /a22 a11.2
r=√ =⇒ 1 − r2 = = =
a11 a22 a11 a22 a11 a11
√ √
√ r a12 / a11 a22 a12 / a22 σ11
=⇒ n − 1 √ =q =p
1 − r2 a11.2
/(n − 1) a11.2 /σ11 (n − 1)
a11

has conditionally a t distribution with n − 1 d.f., hence unconditionally too,


since the distribution obviously does not depend on V .
˜
(7) Sample partial correlation coefficients: Suppose 1 ≤ i ̸= j ≤ q < p. We
know that the population correlation coefficient between the ith and j th coordi-
σ
nates Xi and Xj of X = (X1 , . . . , Xp ) ∼ Np (µ, Σ) is ρij = √σiiijσjj ; while their
˜ ˜
population partial correlation coefficient given Xq+1 , . . . , Xp is ρij.(q+1),...,p =
σij.(q+1),...,p

σii.(q+1),...,p σjj.(q+1),...,p
, where ((σst.(q+1),...,p ))1≤s,t≤q is the q × q matrix Σ11.2 =
Σ11 − Σ12 Σ−1
22 Σ21 .

Since the sample correlation coefficient between Xi and Xj based on a random


aij
sample of size N equals rij = √
aii ajj
where A is the sample SSSP matrix, the
natural definition of the sample partial correlation coefficient between Xi and
Xj given Xq+1 , . . . , Xp would be
aij.(q+1),...,p
rij.(q+1),...,p = √
aii.(q+1),...,p ajj.(q+1),...,p

68
where ((ast.(q+1),...,p ))1≤s,t≤q = A11.2 .

Now the distribution of rij depends on A ∼ Wp (N − 1, Σ) in the same way as


that of rij.(q+1),...,p depends on A11.2 ∼ Wq (N − 1 − (p − q), Σ11.2 ), therefore it
follows that when ρij.(q+1),...,p = 0,
p rij.(q+1),...,p
N − 2 − (p − q) q ∼ tN −2−(p−q)
2
1 − rij.(q+1),...,p

ρij.(q+1),...,p −ρiq.(q+1),...,p ρjq.(q+1),...,p


Recall the recursion relation ρij.q,(q+1),...,p = q q be-
1−ρ2iq.(q+1),...,p 1−ρ2jq.(q+1),...,p
tween the population partial correlation coefficients of various orders: applying
exactly the same argument on the sample dispersion matrix, we obtain the
following recursive computational method for the partial correlations
rij.(q+1),...,p − riq.(q+1),...,p rjq.(q+1),...,p
rij.q,(q+1),...,p = q q
2 2
1 − riq.(q+1),...,p 1 − rjq.(q+1),...,p
that allows one to compute correlations between two components recursively,
removing one of the other variables at a time.

(8) Hotelling’s T 2 distribution: this is a generalization of the square of the t distri-


bution occurring frequently in inference regarding the mean of a normal popu-
lation, or equality of means of two such with unknown but equal variances. It
is defined as the distribution of
A
T 2 := nX ′ A−1 X = X ′ ( )−1 X
˜ ˜ ˜ n ˜
where X ∼ Np (0, Ip ) and A ∼ Wp (n, Ip ) independently of each other with n ≥ p.
˜ ˜
Then we write T 2 ∼ Tp,n2
.

Before proceeding further, we show that if X ∼ Np (µ, Σ) and A ∼ Wp (n, Σ)


˜ ˜
independently of each other with Σ > 0 and n ≥ p, then the variable

n(X − µ)′ A−1 (X − µ)


˜ ˜ ˜ ˜
69
2
also has the Tp,n distribution. Now if we get a nonsingular C with Σ = CC′
and define Y = C−1 (X − µ) and B = C−1 A(C′ )−1 then Y ∼ Np (0p , Ip ) and
˜ ˜ ˜ ˜ ˜
B ∼ Wp (n, Ip ) independently of each other; and (X − µ) = CY while A =
˜ ˜ ˜
CBC′ ; so

−1
n(X − µ)′ A−1 (X − µ) = n(CY )′ C′ B−1 C−1 (CY ) = nY ′ B−1 Y ∼ Tp,n
2
.
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
Thus the distribution is indeed free of parameters.

(9) The T 2 distribution is connected with the F distribution in the following way:
applying property 4 above to the case Σ = Ip ; recall that when A ∼ Wp (n, Ip ),
then for any ℓ ∈ Rp \ {0p },
˜ ˜
ℓ′ ℓ 2
˜ ˜ ∼ χn−p+1 .
ℓ′ A−1 ℓ
˜ ˜
Hence with T 2 = nZ ′ A−1 Z where Z ∼ Np (0, I) independently of A; given
˜ ˜ ˜ ′ ˜
zz
Z = z for any z ∈ Rp \ {0p }, we see z′ A ˜˜ z
−1 ∼ χ2n−p+1 and this distribution is
˜ ˜ ˜ ′ ˜ ˜ ˜
Z Z 2 ′
free of z. Thus Z ′ A ˜ Z ∼ χn−p+1 independently of Z and hence of Z Z; while
˜ −1
˜ ˜2 ˜ ˜ ˜ ˜

we know Z Z ∼ χp . Thus
˜ ˜
n − p + 1 T2 Z ′ Z/p
= Z′Z ˜ ˜ ∼ Fp,n−p+1 .
p n ′ ˜ ˜
−1 /(n − p + 1)
Z A Z
˜ ˜
(10) There is also a non-central version of this distribution: it is the distribution
of nX ′ A−1 X if X ∼ Np (µ, Σ) and A ∼ Wp (n, Σ) are still independent with
˜ ˜ ˜ ˜
Σ > 0 and n ≥ p; but µ is not necessarily 0. Note that with Y := C−1 X and
˜ ˜ ˜ ˜

B := C−1 AC−1 where CC′ = Σ as before, now Y ′ Y ∼ χ2p,λ with λ := µ′ Σ−1 µ
˜ ˜ ˜ ˜
Y ′Y 2
but is still independent of Y ′˜B−1
˜Y ∼ χ n−p+1 ; thus now
˜ ˜
n − p + 1 T2
∼ Fp,n−p+1,λ
p n

70
with λ as a non-centrality parameter. Again, the central version corresponds to
λ = 0.

Clearly the distribution has the properties of stochastic ordering and therefore
monotonicity of tail probabilities that the F distributions enjoy.

6 Inference on parameters

The sampling distributions we have discussed allow us to make various inferences on


the mean vector µ and dispersion matrix Σ based on a random sample X 1 , . . . , X N
˜ ˜ ˜
from a p-variate normal population. As before, we shall assume Σ > 0 and N > p;
¯ and A;
write n = N − 1; and denote the sample mean vector and SSSP matrix by X
˜
as earlier.

6.1 Minimum variance and ML estimation

Consider the joint density of the components of X (or of X): for x1 , x2 , . . . , xN ∈ Rp ,


˜ ˜ ˜ ˜
1 ′ Σ−1 (x −µ)
" #
Y e 2 ( ˜j
N − (x − µ)
˜j ˜
) 1X
N
˜ −N p/2 −N ′ −1

= (2π) |Σ| 2 exp − (x − µ) Σ (xj − µ)
j=1
(2π) p/2 |Σ|1/2 2 j=1 ˜ j ˜ ˜ ˜

(6.2)
 
−N 1
= K|Σ| exp − E , say;
2
2

71
1
PN
for constant K where E is (−2)× the exponent in the density. Let x¯ = N j=1 xj .
˜ ˜
We simplify E as follows:

XN
E= (xj − x¯ + x¯ − µ)′ Σ−1 (xj − x¯ + x¯ − µ)
j=1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
N
X N
X
−1
= ′
(xj − x)
¯ Σ (xj − x)
¯ +2 x − µ)′ Σ−1 (xj − x)
(¯ x − µ)′ Σ−1 (¯
¯ + N (¯ x − µ)
j=1
˜ ˜ ˜ ˜ j=1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
XN
= ¯ ′ Σ−1 (xj − x)
(xj − x) x − µ)′ Σ−1 (¯
¯ + N (¯ x − µ)
j=1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
PN
since − x)
j=1 (xj¯ = 0. Therefore given observations X 1 , . . . , X n , the log-likelihood
˜ ˜ ˜ ˜ ˜
function equals

l(µ, Σ) = ln L(µ, Σ)
˜ " N #
N 1 X ¯ ′ Σ−1 (X j − X)
¯ + N (X
¯ − µ)′ Σ−1 (X
¯ − µ)
= ln K − ln|Σ|− (X − X)
2 2 j=1 ˜ j ˜ ˜ ˜ ˜ ˜ ˜ ˜

defined on the parametric space Rp × {space of p × p positive definite matrices}.

¯ − µ)′ Σ−1 (X
Since for every p.d. Σ, (X ¯ − µ) ≥ 0 with equality iff µ = X,
¯ therefore
˜ ˜ ˜ ˜ ˜ ˜
l(µ, Σ) is maximum when and only when µ = X. ¯ So µ ˆ = X ¯ is the MLE of µ for
˜ ˜ ˜ ˜ ˜
any Σ. Thus to obtain the MLE of Σ, we need to maximize l(ˆ ¯ Σ) or
µ, Σ) = l(X,
˜ ˜

72
equivalently,
N
!
N 1
¯ Σ) − ln K = − ln|Σ|− tr
X
¯ ′ Σ−1 (X j − X)
¯
l(X, (X j − X)
˜ 2 2 j=1
˜ ˜ ˜ ˜
N
N 1X ¯ ′ Σ−1 (X j − X)
¯

= − ln|Σ|− tr (X j − X)
2 2 j=1 ˜ ˜ ˜ ˜
N
N 1X ¯ X j − X)
¯ ′
tr Σ−1 (X j − X)(

= − ln|Σ|−
2 2 j=1 ˜ ˜ ˜ ˜
N
!
N 1 X
¯ X j − X)
¯ ′
= − ln|Σ|− tr Σ−1 (X j − X)(
2 2 j=1
˜ ˜ ˜ ˜
N 1
= − ln|Σ|− tr Σ−1 A

2 2

where A is the SSSP matrix N ¯ ¯ ′


P
j=1 (X j − X)(X j − X) . Suppose we express this as a
˜ ˜ ˜ ˜
function of Σ−1 = ((σ (ij) )). Then
p p
N 1  N 1 XX
µ, Σ) − ln K =
l(ˆ ln|Σ−1 |− tr Σ−1 A = ln|Σ−1 |− aij σ (ij)
˜ 2 2 2 2 i=1 j=1

and we set the partial derivatives to 0. Now for any square matrix M = ((mij )), since
the coefficient of mij in |M | is the cofactor (−1)i+j |M (ij) | where M (ij) is obtained
from M by removing its ith row and j th column. Also, if M is nonsingular then since
(−1)i+j |M (ij) | ∂|M |
|M |
is the (j, i)th entry of M −1 , so ∂mij
= (−1)i+j |M (ij) |= |M |·(M −1 )ji and
we conclude that

∂ ln|Σ−1 | 1 ∂|Σ−1 |
(ij)
= −1 (ij)
= [(Σ−1 )−1 ]ji = σij
∂σ |Σ | ∂σ

by symmetry, since (Σ−1 )−1 = Σ. On the other hand,

∂ pi=1 pj=1 aij σ (ij)


P P
= aij , 1 ≤ i, j ≤ p
∂σ (ij)

73
so that the likelihood equations reduce to
1 aij A
(N σij − aij ) = 0 ⇔ σ̂ij = , 1 ≤ i, j ≤ p ⇐⇒ Σ̂ =
2 N N
is the unique solution of the likelihood equation. That it is indeed MLE of Σ can be
−1
justified as follows: call the eigenvalues of Σ A as λ , . . . , λ . Then ∀ Σ,
N 1 p
  −1
¯ A ¯ Σ) = N ln A N N 1
l X, − l(X, − p − ln|Σ|−1 + tr(Σ−1 A)
˜ N ˜ 2 N 2 2 2
  −1  ( p p
)
Σ−1 A

N Σ A N X X
= tr − ln −p = λj − ln λj − p
2 N N 2 j=1 j=1
p
NX
= (λj − ln λj − 1) ≥ 0 ∵ ∀ x > 0, x ≥ 1 + ln x, with equality iff x = 1;
2 j=1

A
thus with equality iff λj = 1 ∀ n i.e. Σ = Σ̂ = N
.

Again, from the form of the log-likelihood,


N 1 ¯ − µ)′ Σ−1 (X
¯ − µ)
ln|Σ|− tr Σ−1 A + N (X
 
l(µ, Σ) = ln K −
2 2 ˜ ˜ ˜ ˜
it is clear that it belongs to the p + p(p+1)
-parameter ¯ A)
exponential family. So (X,
2
˜
is jointly complete and sufficient for µ and Σ. Therefore unbiased estimators based
˜
on these statistics would be UMVUE for µ and Σ respectively. This means
˜
¯ A
µ
ˆ = X, Σ̂ =
˜ ˜ N −1
being unbiased, are respective UMVUEs.

6.2 Testing for the mean vector

Suppose we wish to test for H0 : µ = µ0 , a given element of Rp , at some given level


˜ ˜
α ∈ (0, 1).

74
Case 1 (Σ known): here X ¯ is sufficient, so use the fact that (X
¯ − µ0 ) ∼ Np (µ −
˜ ˜ ˜ ˜
µ0 , Σ
N
) so that Y = N (X¯ − µ0 )′ Σ−1 (X
¯ − µ0 ) ∼ χ2 with λ = (µ − µ0 )′ Σ−1 (µ − µ0 )
p,λ
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
and λ = 0 iff H0 is true.

Thus the test with C.R. {Y > χ2p (α)} has size α and is unbiased.

Case 2 (Σ unknown): Now we have N (X ¯ − µ) ∼ Np (0p , Ip ); so define
˜ ˜ ˜
√ √
¯ − µ0 )]′ ( A )−1 [ N (X
T 2 := [ N (X ¯ − µ0 )] = N n(X¯ − µ0 )′ A−1 (X ¯ − µ0 ) ∼ T 2
n,p
˜ ˜ n ˜ ˜ ˜ ˜ ˜ ˜
under H0 ; while its non-null distribution is a non-central T 2 . Thus the test with C.R.
 
N − p 2 N (N − p) ¯ ′ −1 ¯
T = (X − µ0 ) A (X − µ0 ) ≥ Fp,N −p (α)
np p ˜ ˜ ˜ ˜
has size α, and is unbiased.
Z′Z
Large-sample test: Recall that we can write T 2 = n Z ′ Z/˜Z ′˜A−1 Z and the denomi-
˜˜ ˜ ˜
nator has a χ2n−p+1 distribution independently of the numerator.
Pn
Now a χ2n variable Yn can be written as 2
j=1 ξj where (ξn )n≥1 is a sequence of
iid N (0, 1) variables and E(ξ12 ) = 1, it follows by the WLLN, e.g., that as n → ∞ ,
Yn P
n
→ 1. Therefore

1 Z ′Z n−p+1 1 Z ′Z P
′ ˜ =
˜ −1 ′ ˜ → 1.
˜ −1
nZ A Z n n−p+1Z A Z
˜ ˜ ˜ ˜
A well-known theorem known as Slutsky’s Theorem now guarantees that

Z ′Z d
T2 =
˜ ˜
1 Z′Z
→ Z ′ Z ∼ χ2p
˜ −1
n Z′A ˜Z ˜ ˜
˜ ˜
as n → ∞ . Thus for large n, {T 2 > χ2p (α)} defines the C.R. for a test with
approximate size α for α ∈ (0, 1).

75
Two-sample problem: We have random samples X 1 , . . . , X N1 and Y 1 , . . . , Y N2
˜ ˜ ˜ ˜
from two populations, respectively Np (µ1 , Σ) and Np (µ2 , Σ) with Σ unknown. Note
˜ ˜
that we need to assume the dispersion matrices are equal.

Now since X ¯ ∼ Np (µ1 , Σ ) and Y¯ ∼ Np (µ2 , Σ ) independently, it follows that


N1
˜ ˜1  ˜  ˜ N2 
¯ ¯
X − Y ∼ Np µ1 − µ2 , ( N1 + N2 )Σ ; or Np µ1 − µ2 , ( NN11+N
1 2
)Σ .
N2
˜ ˜ ˜ ˜ ˜ ˜
Further, A1 := N
P 1 ¯ ¯ ′ PN2
j=1 (X j − X)(X j − X) ∼ Wp (N1 − 1, Σ) and A2 := j=1 (Y j −
˜ ˜ ˜ ˜ ˜
Y¯ )(Y j − Y¯ )′ ∼ Wp (N2 − 1, Σ) independently. Thus A := A1 + A2 ∼ Wp (n, Σ) where
˜ ˜ ˜
n := N1 + N2 − 2. Hence

N1 N2 n ¯ ¯ ′ −1 ¯ ¯
T2 = 2
(X − Y ) A (X − Y ) ∼ Tp,n,λ
N1 + N2 ˜ ˜ ˜ ˜

with λ = 0 iff H0 is true.

6.3 Confidence set estimation for the mean vector

The various sampling distributions we have derived, specifically the T 2 distribution,


are used to obtain confidence regions with given confidence coefficients (1 − α) for the
mean vector.

¯−
Confidence ellipsoid: Of course when Σ is known the χ2 distribution of N (X
˜
¯ − µ) suffices to yield confidence region for µ.
µ)′ Σ−1 (X
˜ ˜ ˜
When Σ is unknown we use the T 2 distribution. Observe that for any given
α ∈ (0, 1),
 
N (N − p) ¯ ′ −1 ¯
Probµ,Σ (X − µ) A (X − µ) ≤ Fp,N −p (α) = 1 − α.
˜ p ˜ ˜ ˜ ˜

76
So, given the data, the set
 
p ¯ A (µ − X)
′ −1 ¯ ≤ p
E := µ ∈ R : (µ − X) Fp,N −p (α)
˜ ˜ ˜ ˜ ˜ N (N − p)
µ ∈ Rp : N n(µ − X)¯ ′ A−1 (µ − X)
¯ ≤ Tp,N
2

= −p (α) ,
˜ ˜ ˜ ˜ ˜
np 2
if we write N −p
Fp,N −p (α) as Tp,N −p (α), contains the true value of µ with probability
˜
(1 − α).

What sort of a set is E? When p = 2, it represents the inside of the ellipse


2
2 (11) 2 (12) (22) 2
T2,N −2 (α)
{(µ1 , µ2 ) ∈ R : a (µ1 −X̄1 ) +2a (µ1 −X̄1 )(µ2 −X̄2 )+a (µ2 −X̄2 ) = }.
nN

with centre at (X̄1 , X̄2 ).

µ2

(X̄1 , X̄2 )

µ1

Analogously, in general p dimensions, the set E is called the confidence ellipsoid


with confidence coefficient (1 − α).

77
Simultaneous CIs for the components of the mean vector or their linear combi-
¯ − µ) ∼ N (0, ℓ′ Σℓ ) and ℓ′ Aℓ ∼ χ2
nations: Since for a given ℓ ∈ Rp \ {0}, ℓ′ (X ˜N ˜ ℓ˜′ Σ˜ℓ n
˜ ˜ ˜ ˜ ˜ ˜ ˜
independently with n = N − 1, a natural idea for working out a 100(1 − α)% CI for
¯ the pivot
ℓ′ µ would be to use, temporarily writing σ 2 (ℓ) for ℓ′ Σℓ = V (ℓ′ X),
˜˜ ˜ ˜ ˜ ˜˜
√ ′ ¯
√ ′ ¯
N ℓ (X−µ)
nN ℓ (X − µ) ˜σ(ℓ)˜ ˜
√˜ ′ ˜ ˜ = q ℓ˜′ Aℓ ∼ tn .
ℓ Aℓ
˜ ˜ ˜ ℓ)
nσ( ˜
˜
Thus the CI r r !
′ ¯ ℓ′ Aℓ α ′¯ ℓ′ Aℓ α
ℓ X − ˜ ˜ tn ( ), ℓ X + ˜ ˜ tn ( )
˜˜ nN 2 ˜˜ nN 2
is a choice. But if we are given several vectors ℓ1 , . . . , ℓk and wish to find intervals I1 ,
˜ ˜
I2 , . . . , Ik such that Probµ,Σ (ℓ′j µ ∈ Ij , 1 ≤ j ≤ k) ≥ 1 − α ∀ µ, Σ, then the choices
 q ′ ˜ q˜ ˜′  ˜
′ ¯ ℓj Aℓj α ′ ¯ ℓj Aℓj α
ℓj X − ˜nN˜ tn ( 2 ), ℓj X + ˜nN˜ tn ( 2 ) as above for Ij , 1 ≤ j ≤ n, may not
˜ ˜ ˜ ˜
work; for

\k
Probµ,Σ (ℓ′j µ ∈ Ij ) = 1 − α, 1 ≤ j ≤ k ̸⇒ Probµ,Σ ( {ℓ′j µ ∈ Ij }) ≥ 1 − α.
˜ ˜˜ ˜ j=1
˜˜

A procedure due to Bonferroni is to instead choose


r r !
′ ′
ℓ A ℓ α ℓ A ℓ α
¯ − ˜j ˜j tn ( ), ℓ′ X
Ij = ℓ′j X ¯ + ˜j ˜j tn ( )
j
˜ ˜ nN 2k ˜ ˜ nN 2k

so that for each j, Probµ,Σ (ℓ′j µ ∈


/ Ij ) = αk and this implies the requirement.
˜ ˜˜
However, in this case the intervals are unnecessarily large. One workaround uses
the following idea: recall that
q
∀ℓ ∈ Rp \ {0}, ¯ − µ)|≤
|ℓ′ (X ¯ − µ)′ A−1 (X
ℓ′ Aℓ · (X ¯ − µ)
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜

78
with equality when ℓ is proportional to A−1 (X ¯ − µ), whence
˜ ˜ ˜
( )
′ ¯
|ℓ (X − µ)|
Probµ,Σ ˜ √˜ ′ ˜ ≤ K ∀ ℓ
˜ ℓ Aℓ ˜
(˜ ˜ ) ( )
¯ − µ)|
|ℓ′ (X ¯ − µ))2
(ℓ′ (X
= Probµ,Σ max ˜ √˜ ′ ˜ ≤ K = Probµ,Σ max ˜ ˜ ′ ˜ ≤ K2
˜ ℓ̸ = 0 ℓ Aℓ ˜ ℓ̸ =0 ℓ Aℓ
 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
= Probµ,Σ (X ¯ − µ)′ A−1 (X ¯ − µ) ≤ K 2
˜ ˜ ˜ ˜ ˜ 
N (N − p) ¯ ′ −1 ¯ N (N − p) 2
= Probµ,Σ (X − µ) A (X − µ) ≤ K
˜ p ˜ ˜ ˜ ˜ p
 
N (N − p) 2
= Probµ,Σ Fp,N −p ≤ K =1−α
˜ p
q
p
when we choose K = Kα := F
N (N −p) p,N −p
(α); i.e.
n  o
¯ ∈ ℓ′ µ ± Kα · ℓ′ Aℓ
p
Probµ,Σ ∀ ℓ, ℓ′ X =1−α
˜ ˜ ˜˜ ˜˜ ˜ ˜
√ √
In particular, the choices (X̄j − Kα ajj , X̄j + Kα ajj ) yield confidence intervals for
µj , 1 ≤ j ≤ p; with simultaneous confidence coefficient at least 100(1 − α).

This method also suggests another way of testing for H0 : µ = µ0 vs. H1 : µ ̸= µ0


˜ ˜ ˜ ˜
given µ0 = (µ01 , . . . , µ0p )′ as follows: we reject H0 at level α ∈ (0, 1) if for any j with
˜ √ √
1 ≤ j ≤ p, µ0j ∈
/ (X̄j − Kα ajj , X̄j + Kα ajj ). Then the test is level α.

This generalizes easily to simultaneous tests for any number k of hypotheses H0j :
ℓ′j µ = cj , 1 ≤ j ≤ k vs. H1 : at least one of H01 , . . . , H0k are false: we can reject the
˜˜
¯ − Kα ℓ′ Aℓj , ℓ′ X
/ (ℓ′j X ¯+
p
null hypothesis at level α if for any j ∈ {1, 2, . . . , k}, cj ∈ j j
˜ ˜ ˜ ˜ ˜ ˜
Kα ℓ′j Aℓj ).
p
˜ ˜

79
7 Canonical correlations

Given multivariate data, need often arises to analyze it only after replacing the set of
all variables by a smaller set, which could be either a subset or some set of transformed
variables. Among the situations where such a necessity arises naturally, is one where
we wish to capture the interrelationship between two (disjoint) sets of the original
variables by replacing them with sets of transformed variables.

We discuss the population version first. Let the given sets have respectively q and
r = p − q variables, and wlg assume that these are respectively the first q and last r
coordinates of X, and also that q ≤ r.
˜  
X 1 q×1
As before, write X =  ˜ . Since correlations do not depend on means,
˜ X 2 r×1
˜
we assume E(X) = 0. Partition the dispersion matrix Σ of X as earlier, as Σ =
 ˜  ˜ ˜
Σ11 q×q Σ12 q×r
 . We assume Σ > 0 ⇒ Σ11 , Σ22 > 0. Write Σ11 = B′1 B1 and
Σ21 r×q Σ22 r×r
Σ22 = B′2 B2 where B1 q×q and B2 r×r are nonsingular.

Then for any α ∈ Rq \ {0q } and γ ∈ Rr \ {0r }, the correlation between α′ X 1 and
˜
α′ Σ12 γ
˜ ˜ ˜ ˜ ˜
γ ′ X 2 is q ′ ˜ ˜ and its supremum value is
α Σ11 α γ ′ Σ22 γ
˜ ˜ ˜ ˜˜ ˜
α′ Σ12 γ (B1 α)′ (B′1 )−1 Σ12 B−1 −1
2 (B2 γ)
sup p ′ ˜ ˜ = sup q ˜ ˜
α̸=0q ,γ̸=0r α Σ11 α γ ′ Σ22 γ α,γ (B 1 α)′ (B α) (B γ)′ (B γ)
1 2 2
˜ ˜ ˜ ˜ ˜ ˜˜ ˜ ˜ ˜
˜ ˜
−1
˜ ˜
′ ′ −1 −1
(B1 α) (B1 ) Σ12 B1 ((B1 ) γ
= sup ˜ ˜ = sup α′ Σ12 γ
α,γ | B1 α̸=0q ,B2 γ̸=0r ∥B1 α∥ ∥B2 γ∥ α,γ | ∥B1 α∥=∥B2 γ∥=1 ˜ ˜
˜˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜˜ ˜ ˜

Thus, to maximize α′ Σ12 γ subject to α′ Σ11 α = ∥B1 α∥2 = 1 and γ ′ Σ22 γ = ∥B2 γ∥2 =
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
1, we adopt the method of Lagrangian multipliers. Define the function ψ(α, γ) =
˜ ˜
′ λ ′ µ ′
α Σ12 γ − 2 (α Σ11 α − 1) − 2 (γ Σ22 γ − 1). Setting respectively to 0q and 0r the vectors
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
80
of partial derivatives

∂ ∂
0q = ψ = Σ12 γ − λΣ11 α and 0r = ψ = Σ′12 α − µΣ22 γ (7.3)
˜ ∂α ˜ ˜ ˜ ∂ γ ˜ ˜
˜ ˜
we get, multiplying on the left respectively by α′ and γ ′ (and taking transpose),
˜ ˜
α′ Σ12 γ − λα′ Σ11 α = 0 = α′ Σ12 γ − µγ ′ Σ22 γ =⇒ λ = µ = α′ Σ12 γ
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
from the restrictions. Putting µ = λ in (7.3), we see that
  
−λΣ11 α + Σ12 γ = 0q −λΣ11 Σ12 α
˜ ˜ ˜ =⇒    ˜  = 0p . (7.4)
Σ21 α − λΣ22 γ = 0r Σ21 −λΣ22 γ ˜
˜ ˜ ˜ ˜
 
−λΣ11 Σ12
Clearly it is necessary that the matrix Lp×p = ((lij )) :=   be
Σ21 −λΣ22
singular. We first see that its determinant
X
|L|= sgn(π)l1π(1) l2π(2) · · · lpπ(p)
π∈Sn

where Sn is the set of permutations π of {1, 2, . . . , n}, is a polynomial in λ of degree


exactly p with all real roots, and that each root corresponds to a choice for α and γ
˜ ˜
meeting the requirement.

Split Sn into two sets, one containing those π that map the subset {1, 2, . . . , q}
onto itself (and hence its complement also), denoted say by J, and the other its
complement, say K.

Now if π ∈ J then for each j = 1(1)p, the (j, π(j))th entry of Λ contains −λ;
so the term sgn(π)l1π(1) l2π(2) · · · lpπ(p) is a multiple of λp while for π ∈ K, that term
is a multiple of a smaller power of λ. Further, since each π ∈ J can be thought to
consist of a permutation π1 of {1, 2, . . . , q} i.e. an element of Sq and one, say π2 , of

81
{q + 1, . . . , p} and the function π3 defined by π3 (j) = π2 (q + j) defines an element of
Sr , noting that sgn(π) = sgn(π1 ) · sgn(π3 ), we see that the term containing λp is
X    
sgn(π) l1π(1) · · · lqπ(q) · lq+1,π(q+1) · · · lpπ(p)
π∈Sp
X 
sgn(π1 )(−λ)q (Σ11 )1,π1 (1) (Σ11 )2,π1 (2) · · · (Σ11 )q,π1 (q)

=
π1 ∈Sq
X 
sgn(π3 )(−λ)r (Σ22 )1,π3 (1) (Σ22 )2,π3 (2) · · · (Σ22 )r,π3 (r)

·
π3 ∈Sr

= (−λ)p |Σ11 |·|Σ22 |̸= 0

which shows the polynomial has degree exactly p. That all its roots are real is a point
that we skip.

Note from (7.4) that λγ = Σ−1 2 −1


22 Σ21 α, so λ Σ11 α = λΣ12 γ = Σ12 (λγ) = Σ12 Σ22 Σ21 α;
˜ ˜ ˜ ˜ ˜ ˜
in particular,
(Σ−1 −1 2
11 Σ12 Σ22 Σ21 − λ Iq )α = 0q
˜ ˜
2 −1 −1
whence λ is an eigenvalue for Σ11 Σ12 Σ22 Σ21 with eigenvector α. This mean in
˜
particular that only q among the possible p values of λ2 can be non-zero, hence
among the roots for λ, p − 2q are 0. Further, the non-zero roots of |L|= 0 occur in
pairs, as ±|λ| where λ2 is one of the eigenvalues as above.

So denote the roots of |L|= 0 by λ1 ≥ λ2 ≥ . . . ≥ λq ≥ 0 = 0 = · · · = 0 ≥


−λq ≥ . . . ≥ −λ1 . The maximum correlation is therefore λ1 , between α1′ X1 =: U1
˜ ˜
and γ1′ X2 =: V1 , say; satisfying α1′ Σ12 γ1 = λ1 . The remaining canonical variable
˜ ˜ ˜ ˜
pairs are obtained recursively. Recall that all means are 0. Now having obtained
Uj := αj′ X1 and Vj := γj′ X2 satisfying V (Uj ) = E(Uj2 ) = V (Vj ) = E(Vj2 ) = 1 for
˜ ˜ ˜ ˜
j = 1, . . . , m < q, and

E(Uj Vj ) = λj , 1 ≤ j ≤ m, E(Uj Vi ) = E(Uj Ui ) = E(Vj Vi ) = 0, 1 ≤ i ̸= j ≤ m,

82
we show that the maximum correlation E(Um+1 Vm+1 ) between Um+1 = αm+1 X1
˜ ˜
and Vm+1 = γm+1 X2 which are uncorrelated with U1 , V1 , . . . , Um , Vm and satisfy
2
˜2 ˜
E(Um+1 ) = E(Vm+1 ) =, is λm+1 .

Return to the problem of maximizing α′ Σ12 γ subject to α′ Σ11 α = γ ′ Σ22 γ = 1 but


˜ ˜ ˜ ˜ ˜ ˜
now also with the additional restrictions that for 1 ≤ j ≤ m,

0 = E(U Uj ) = α′ Σ11 αj , 0 = E(V Vj ) = γ ′ Σ22 γj


˜ ˜ ˜ ˜
which automatically make U and V uncorrelated respectively with the {Vj , 1 ≤ j ≤
m} and {Uj , 1 ≤ j ≤ m} because

E(U Vj ) = α′ Σ12 γj = λj α′ Σ11 αj = 0, E(V Uj ) = γ ′ Σ21 αj = λj γ ′ Σ22 γj = 0.


˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
So construct the function
m m
λ µ X X
ψm+1 ′
= α Σ12 γ − (α′ Σ11 α) − (γ ′ Σ22 γ) − νj α′ Σ11 αj − θj γ ′ Σ22 γj
˜ ˜ 2 ˜ ˜ 2 ˜ ˜ j=1
˜ ˜ j=1 ˜ ˜
and set 0q and 0r the vectors of partial derivatives wrt α and γ respectively:
˜ ˜ ˜ ˜
Xm Xm
Σ12 γ − λΣ11 α − νj Σ11 αj = 0q and Σ21 α − µΣ22 γ − θj Σ22 γj = 0r .
˜ ˜ j=1 ˜ ˜ ˜ ˜ j=1 ˜ ˜

Now for each j = 1(1)q, pre-multiplying the first by αj′ and the second by γj′ yields
˜ ˜
0 = νj αj′ Σ11 αj = νj and 0 = θj γj′ Σ22 γj = θj ;
˜ ˜ ˜ ˜
and we are left just with the original conditions (7.3). It follows that λm+1 and
corresponding αm+1 and γm+1 are the next choice.
˜ ˜
Define the matrices Hq×q = [α1 |α2 |. . . |αq ], (Γ1 )r×q = [γ1 |γ2 |. . . |γq ] and Λq×q =
˜ ˜ ˜ ˜ ˜ ˜
diag(λ1 , λ2 , . . . , λq ). We have

H′ Σ11 H = Iq , H′ Σ12 Γ1 = Λ, Γ′1 Σ22 Γ1 = Iq .

83
Recall Σ22 = B′2 B2 . Now if we get an orthonormal set {g1 , g2 , . . . , gr−q } in the ortho-
˜ ˜ ˜
complement of the column space of B2 Γ1 , then the matrix Gr×(r−q) whose columns are
g1 , g2 , . . . , gr−q satisfies G′ G = Ir−q and G′ B2 Γ1 = 0(r−q)×q . Defining (Γ2 )r×(r−q) =
˜ −1˜ ˜
B2 G so that G′ = Γ′2 B′2 , we get Γ′2 Σ22 Γ1 = Γ′2 B′2 B2 Γ1 = G′ B2 Γ1 = 0(r−q)×q .
Further, Γ′2 Σ22 Γ2 = G′ G = Ir−q .

Putting these together, we can write


 

H 0q×r    
  −λΣ11 Σ12 H 0q×q 0q×(r−q)
 
Γ′1 · ·

 0q×q 
  Σ21 −λΣ22 0r×q Γ1 Γ2
0(r−q)×q Γ′2
 
−λH′ Σ11 H′ Σ12  
  H 0q×q 0q×(r−q) 
 
= Γ′1 Σ21 −λΓ′1 Σ22 ·

  0r×q Γ1 Γ2
Γ′2 Σ21 −λΓ′2 Σ22
 
−λIq Λ 0q×(r−q)
 
= −λIq .
 
Λ 0q×(r−q)
 
0(r−q)×q 0(r−q)×q −λIr−q

This last matrix is connected with the dispersion matrix of the random vector
   

H 0q×r U
   ˜ 
Y :=  0q×q Γ′1  X =  V 1 
   
˜  ˜  ˜ 

0(r−q)×q Γ2 V2
˜
where U = H′ X1 = (U1 , U2 , . . . , Uq )′ , V 1 = Γ1 X2 = (V1 , V2 , . . . , Vq )′ and V 2 =
˜ ˜ ˜ ˜ ˜
Γ2 X2 = (Vq+1 , Vq+2 , . . . , Vr−q )′ , say. These random variables are called the canonical
˜
variables and λ1 , λ2 , . . . , λq respectively the first, second, . . . , q th canonical correla-
tions.

84
It is worth noting that the components of V 2 = Γ2 X2 are not unique because Γ2
˜ ˜
is not: amy other choice of the onb, tantamount to post-multiplication of Γ2 by an
orthogonal matrix, serves the same purpose.

Sometimes we call the components of U and V = [V ′1 |V ′2 ]′ = Γ′ X 2 where Γ =


˜ ˜ ˜ ˜ ˜
[Γ1 |Γ2 ], together as the canonical variates, with the understanding that the first q
components of V achieve the canonical correlations with corresponding components
˜
of U , and the remaining r −q components are uncorrelated with those as well as those
˜
of U .
˜
Note: the special case when q = 1 yields the normalized X1 and multiple regression
of X1 on X2 , . . . , Xp as the pair of canonical variables, while the squared canonical
coefficient λ21 simply equals the squared multiple correlation coefficient R1.23...p
2
.

Computation: although in theory, the eigenvalues λ21 , . . . , λ2q and corresponding


eigenvectors α1 , . . . , αq of Σ−1 −1 −1 −1
11 Σ12 Σ22 Σ21 along with the relations γj = λj Σ22 αj
˜ ˜ ˜ ˜
when λj ̸= 0 for j = 1(1)q can be used at one go to obtain the q pairs of canonical
variables and correlations, in practice when q is large this is not feasible. Instead,
better a recursive procedure is adopted to first compute λ1 , then λ2 etc.

That procedure is based on the equalities H′ Σ11 H = Iq and HΛ2 H−1 = Σ−1 −1
11 Σ12 Σ22 Σ21 .

The first means that H′ Σ11 = H−1 so that the rows α̃1 , . . . , α̃q of H−1 are given by

α̃j′ = αj Σ11
˜

once we obtain αj , j = 1, . . . , q.
˜
Pq
The second equality means Σ−1 −1
11 Σ12 Σ22 Σ21 = λ2j αj α̃j′ . Now, obtain the
j=1
˜
largest root λ1 and corresponding α1 by starting with some initial approximation
˜
(0) (i) (i+1) (i+1)
α1 for α1 , and using the iterative relation Σ−1 −1
11 Σ12 Σ22 Σ21 α1 = λ1 α1 , i =
˜ ˜ ˜ ˜

85
(i+1) ′ (i+1)
0, 1, 2, . . . , and setting α1 Σ11 α1 = 1. After obtaining α1 , . . . , αm for 1 ≤ m < q,
˜ ˜ ˜ ˜
also compute α̃1 , . . . , α̃m and consider the matrix
 
m
X 0 m×m 0 m×(q−m)
Σ−1 −1
11 Σ12 Σ22 Σ21 − λ2j αj α̃j′ = H   H−1
j=1
˜ 0(q−m)×m diag(λm+1 , . . . , λq )

which has largest eigenvalue λ2m+1 and applying the above iterative procedure to this
matrix will yield λm+1 and corresponding αm+1 .
˜
Sample versions: The natural sample analogues of would be the quantities ob-
tained from Σ̂ in the same way the population ones were obtained from Σ. We use
A
the MLE Σ̂ = N
because then the sample quantities also become MLEs of the cor-
responding population characteristics. For instance, the sample canonical correlation
coefficients l1 , . . . , lq are roots of the equation

−lΣˆ11 Σˆ12
=0
Σˆ21 −lΣˆ22

and corresponding sample canonical variates α ˆ j X1 and γˆ j X2 are obtained from the
˜ ˜ ˜ ˜
following analogue to (7.4):
   
−lj Σˆ11 Σˆ12 αˆj
  ·  ˜  = 0p
ˆ
Σ21 ˆ
−lj Σ22 γˆ j ˜
˜
ˆ ′j Σˆ11 α
alongside α ˆ j = γˆ ′j Σˆ22 γˆ j = 1.
˜ ˜ ˜ ˜
These then are MLEs of the j th population partial correlation, and pair of canon-
ical variates.

In practice, when the population canonical correlations are not known as is typi-
cally the case, we may not be interested to compute those λj which are likely to be

86
small, indication of which is given by the sample quantity lj being small. In fact, it
is of interest to test for a λj to be 0, implying λj+1 = · · · = λq = 0 as well.

Note that the number of nonzero canonical correlation coefficients, or the number
of non-zero eigenvalues of Σ−1 −1
11 Σ12 Σ22 Σ21 , is equal to

r(Σ−1 −1 −1 ′ −1 −1
11 Σ12 Σ22 Σ21 ) = r(Σ12 Σ22 Σ21 ) = r(Σ12 (B2 ) )B2 Σ21 )

= r[(B−1 ′ −1 −1
2 Σ21 ) (B2 Σ21 )] = r(B2 Σ21 ) = r(Σ12 )

by the non-singularity of Σ11 and B2 . So, acceptance of a test for λm+1 = 0 and
rejection of λm = 0 can be taken as acceptance of m = r(Σ12 ), because that means
λ1 ≥ · · · ≥ λm > 0 = λm+1 = · · · = λq . We state without proof the likelihood ratio
criterion for H0 : λm+1 = 0 ⇔ r(Σ12 ) ≤ m is qj=m+1 (1 − lj2 )N/2 where N is the
Q

sample size. When N is large,


  q
p+3 X
− N− ln(1 − lj2 ) ∼a χ2qr
2 j=m+1

Of course, m = 0 corresponds to Σ12 = 0q×r or independence of X 1 and X 2 , the


˜ ˜
likelihood ratio criterion for which becomes qj=1 (1 − lj2 )N/2 and the large-sample test
Q
 Pq Pq 2
based on − N − p+3
 2 2
2 j=1 ln(1 − lj ) ∼a χqr , approximately equal to N j=1 lj =

N tr[A−1 −1
11 A12 A22 A21 ].

Computation from correlation matrix: We describe the population version. The


sample version is similar. Partition X as usual as X = (X 1 ′q×1 , X 2 ′r×1 )′ with respec-
˜ ˜ ˜ ˜
tive dispersion matrices D(X 1 ) = Σ11 and D(X 2 ) = Σ22 ; and cross-covariance matrix
˜ ˜ √ √ √
Cov(X 1 , X 2 ) = Σ12 . Then, denote respectively by Σ1 q×q = diag( σ11 , σ22 , . . . , σqq )
˜ ˜ √ √ √
and Σ2 r×r = diag( σq+1 ¯ , σq+2
¯ q+1 ¯ , . . . , σpp ) the diagonal matrices containing
¯ q+2

respectively the standard deviations of the components of X 1 and X 2 . Finally, let us


˜ ˜

87
 
Σ1 0q×r
put Σd p×p =   = diag(√σ11 , . . . , √σpp ).
0r×q Σ2
Then we note that partitioning the correlation matrix R := ((ρXi ,Xj )) 1≤i,j≤p =
Σ−1 −1
d ΣΣd in the obvious way:
   
R11 R12 Σ−1 Σ Σ−1
Σ−1
Σ Σ −1
  =  1 11 1 1 12 2
,
−1 −1 −1 −1
R21 R22 Σ2 Σ21 Σ1 Σ2 Σ22 Σ2

we have R11 Σ1 = Σ−1 −1 −1


1 Σ11 , R12 Σ2 = Σ1 Σ12 , R21 Σ1 = Σ2 Σ21 and R22 Σ2 =

Σ−1
2 Σ22 . So, multiplying the equations Σ12 γj = λΣ11 αj and Σ21 αj = λΣ22 γj of (7.4)
˜ ˜ ˜ ˜
on the left respectively by Σ−1
1 and Σ−1
2 , we get

R12 (Σ2 γj ) = λj R11 (Σ1 αj ) and R21 (Σ1 αj ) = λj R22 (Σ2 γj ),


˜  ˜   ˜
 ˜
−λj R11 R12 a
or   ·  ˜ j  = 0p ,
R21 −λj R22 cj ˜
˜
where aj = Σ1 αj and cj = Σ2 γj for 1 ≤ j ≤ q.
˜ ˜ ˜ ˜
These equations have the same structure as (7.3) or (7.4), when we consider the
λj , aj , cj as the unknown scalar and vectors, j = 1(1)q. The solutions for λ1 , . . . , λq
˜ ˜
give the canonical correlations. However, in the absence of knowledge of the variances
σ11 , . . . , σpp , we cannot compute the canonical variables αj X 1 = Σ1 aj X 1 and γj X 2 =
˜ ˜ ˜ ˜ ˜ ˜
Σ2 cj X 1 .
˜ ˜
Nevertheless,
 given the sample correlation matrix R = ((rij )) 1≤i,j≤p partitioned as
R11 R12
usual as  , the sample canonical correlations lj = λ̂j , 1 ≤ j ≤ q can be
R21 R22
calculated as above and the tests etc. described earlier carried out.

88
8 Elliptical distributions

1
exp − 12 (x − µ)′ Σ−1 (x − µ) depends on
 
The multivariate normal density (2π)p/2|Σ|1/2 ˜ ˜ ˜ ˜
the argument x through the function (x− µ)′ Σ−1 (x− µ). Thus the level sets/contours
˜ ˜ ˜ ˜ ˜
of the function, i.e. inverse images of points in the range (0, ∞), are sets

(x − µ)′ Σ−1 (x − µ) = c
˜ ˜ ˜ ˜

in the form of the boundary of an ellipsoid, generalizing ellipses in R2 . Distributions


on Rp of this type, with densities proportional to

1
|Λ|− 2 g((x − ν)′ Λ−1 (x − ν))
˜ ˜ ˜ ˜

for some ν ∈ Rp , Λp×p positive definite and some function g, are said to be elliptically
˜
contoured/elliptically symmetric, or simply elliptical.

A special case is when ν = 0 and Λ = I, when the density becomes proportional


˜ ˜
to g(x′ x) with contours that are concentric spheres with the origin 0 as centre. They
˜˜ ˜
are called spherically contoured, or just spherical for short. The terms can be defined
also for distributions without densities. We all encounter for instance the uniform
distribution on the unit sphere

S p−1 = S(0, 1) = {x ∈ Rp : ∥x∥= 1},


˜ ˜ ˜

the simplest example of a spherical distribution without a density in Rp , but we


concentrate only on distributions with a density.

In both cases, any constant of proportionality can be absorbed in g, and in the


sequel we do that. If X has an elliptical distribution as above, then Y := C−1 (X − ν)
˜ ˜ ˜ ˜

where Λ = CC has a spherical distribution. Thus the types are related.

89
In fact the density of Y then can be easily seen to be g(y ′ y) = g(∥y∥2 ). It can
˜ ˜˜ ˜
be shown that g is involved really only in the marginal distribution of ∥Y ∥ in the
˜
following sense: suppose we express y = (y1 , . . . , yp ) ∈ Rp \ {0} in polar coordinates
˜ ˜
(r, θ1 , . . . , θp−1 ) as follows:

y1 = r cos θ1 ,

y2 = r sin θ1 cos θ2 ,

y3 = r sin θ1 sin θ2 cos θ3 ,

...

yp−1 = r sin θ1 sin θ2 · · · sin θp−2 cos θp−1 ,

yp = r sin θ1 sin θ2 · · · sin θp−2 sin θp−1 .

Then r = ∥y∥. Let us write y = ψ(r, θ) so that (r, θ) = ψ −1 (y). Note that ψ is a
˜ Q ˜π π 
p−2 ˜ ˜ ˜
p
bijection from (0, ∞) × − 2 , 2 × (−π, π] → R \ {0p }.
j=1 ˜
The Jacobian matrix J of partial derivatives of ψ is given by
 
sin θ1 r cos θ1 0 ... 0 0
 
 cos θ1 sin θ2 −r sin θ1 sin θ2 r cos θ1 cos θ2 ... 0 0

 
 
 cos θ1 cos θ2 sin θ3 −r sin θ1 cos θ2 sin θ3 −r cos θ1 sin θ2 sin θ3 ... 0 0

 
 .. .. .. ... .. .. 
. . . . .
 
 
 
 p−2
Q p−2
Q p−2
Q p−3
Q p−1
Q 
 cos θj ·sin θp−1 −r sin θ1 cos θj ·sin θp−1 −r cos θ1 sin θ2 cos θj ·sin θp−1 ... −r cos θj ·sin θp−2 sin θp−1 r cos θj 
 j=1 j=2 j=3 j=1 j=1 
p−1
Q p−1
Q p−1
Q p−3
Q p−2
Q
cos θj −r sin θ1 cos θj −r cos θ1 sin θ2 cos θj ... −r cos θj ·sin θp−2 cos θp−1 −r cos θj ·sin θp−1
j=1 j=2 j=1 j=1 j=1

90
To compute its determinant, multiply on the right by
 
r sin θ1 r sin θ2 r sin θ3 . . . r sin θp−2 r sin θp−1 1
 
 cos θ1 0 0 ... 0 0 0
 

 
0 cos θ2 0 ... 0 0 0
 
 
 
K := 
 
0 0 cos θ3 . . . 0 0 0 
.. .. ..
 
 ... 
 ... ... . . . 
 
 
 0 0 0 . . . cos θp−2 0 0 
 
0 0 0 ... 0 cos θp−1 0

and note that JK is an upper triangular matrix with diagonal entries comprising the
vector (r, r cos θ1 , r cos θ1 cos θ2 , . . . , r p−2
Q Qp−1
j=1 cos θj , j=1 cos θj ), whose determinant is

rp−1 p−1 p−j


Q
j=1 cos θj .
Qp−1
Now, because the determinant of K is (−1)p+1 × cos θj , it follows that the j=1

absolute value of the Jacobian determinant |J| equals rp−1 p−1 p−j−1
Q
j=1 cos θj .

Thus the joint density of (R, Θ) = ψ −1 (Y ) is


˜ ˜
p−2
Y
g(r2 ) · rp−1 cosp−j−1 θj
j=1

which shows that R and Θ = (Θ1 , Θ2 , . . . , Θp−1 ) are independent; and the marginal
˜
densities are obtained by integration:
Z π Z π/2 Z p−2
π/2 Y
2 p−1
fR (r) = g(r )r ··· cosp−j−1 θj dθ1 · · · dθp−2 dθp−1
−π −π/2 −π/2 j=1
"p−2 Z # p−2 p−j
Γ 12
π/2
 
2 p−1
Y
p−j−1 2 p−1
Y Γ 2
= g(r )r · 2π · cos θj dθj = g(r )r · 2π · p−j−1

j=1 −π/2 j=1
Γ 2
+ 1
p−1 √ p−1
Γ 2j j

2 p−1
Y π 2 p−1 (p−2)/2
Y Γ 2π p/2
= g(r )r · 2π ·  = g(r )r · 2π · π 2
= g(r2 )rp−1 · ,
j−1 j+1
Γ p2

j=2
Γ 2
+1 j=2
Γ 2

91
noting that
Z π/2 Z π/2 Z 1 Z 1
m m−1 2 (m−1)/2
cos θ dθ = 2 cos θ cos θ dθ = 2 (1 − sin θ) (1 − u2 )(m−1)/2 du
d(sin θ) = 2
−π/2 0 0 0

Γ m+1 Γ 12
Z 1 Z 1    
2 (m−1)/2 1 2 m+1
−1 1
−1 m+1 1 2
=2 (1 − u ) d(u ) = (1 − v) 2 v 2 dv = B , =
Γ m2 + 1

0 2u 0 2 2
2π p/2
The number Ap := Γ( p2 )
is the surface area of the unit sphere S p−1 in Rp .
Y Y
It follows that the random vector U := ∥˜Y ∥ = ˜R whose polar coordinates are
˜ ˜
(1, Θ2 , Θ2 , . . . , Θp−1 ), i.e. rectangular coordinates are given by

U = (sin Θ1 , cos Θ1 sin Θ2 , . . . , cos Θ1 · · · cos Θp−2 sin Θp−1 , cos Θ1 · · · cos Θp−1 )
˜
has a uniform distribution on the sphere. As mentioned before, this is an example of
a spherically symmetric distribution without a density. Clearly E(U ) = 0. Therefore,
˜ ˜ 2)
E(Y ) = 0. Further, the dispersion matrix of U is p1 Ip , hence that of Y is E(R
p
Ip ,
˜ ˜ ˜ ˜
provided E(R2 ) < ∞.

Returning to the elliptically contoured X = CY + ν, we have


˜ ˜ ˜

Theorem 8.1 Suppose E(R2 ) < ∞. Then


E(R2 )
µ := E(X) = ν and Σ := D(X) = Λ.
˜ ˜ ˜ ˜ p
Z
Example: If Z ∼ Np (0p , Ip ) and kS 2 ∼ χ2k independently, then Y := S˜ is said to
˜ ˜ ˜
have a p-variate t-distribution with df m. Then the density of Y is
˜
m+p
m+p ′ − 2
yy

Γ( 2 )
m p/2 p/2
1+˜˜
Γ( 2 )m π m
∥Z∥2 /p
that is clearly spherically symmetric. In this case p1 ∥Y ∥2 = ˜S 2 ∼ Fp,m . An ellip-
˜
tically contoured distribution is enjoyed by X := µ + CY for any µ ∈ Rp and Cp×p
˜ ˜ ˜ ˜
nonsingular.

92
9 Classification and Mahalanobis’ D2

We are sometimes required to compare two multivariate distributions of the same


dimension in terms of how much alike/close to each other and unlike/far; one of
the chief motivations being able to decide, given an observation, which of the two
populations it belongs to.

Assuming they have the same (positive definite) dispersion matrix Σ but different
means µ1 and µ2 , their squared distance is defined as ∆2 := (µ1 − µ2 )′ Σ−1 (µ1 − µ2 ).
˜ ˜ ˜ ˜ ˜ ˜
While this quantity appears as a parameter in sampling distributions we encounter
later, what we really need is some measure of how far a given observation is from
both the distributions.

Thus, given x ∈ Rp , we define the squared distance between x and a population


˜ ˜
π with mean vector µ and dispersion matrix Σ as Dπ2 (x) = (x − µ)′ Σ−1 (x − µ).
˜ ˜ ˜ ˜ ˜ ˜
A natural classification rule therefore would be classify x as falling in π2 as opposed
˜
to π1 if the difference Dπ21 (x) − Dπ22 (x) is large.
˜ ˜
It turns out that for multivariate normal distributions, this criterion is equivalent
to another reasonable criterion in terms the densities: denoting the density of π by
pπ , we consider the ratio
 
pπ1 (x) 1 ′ −1 ′ −1
˜ = exp − (x − µ1 ) Σ (x − µ1 ) − (x − µ2 ) Σ (x − µ2 )
pπ2 (x) 2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
˜

93
and so the ratio is large enough, say ≥ k iff

1
(x − µ1 )′ Σ−1 (x − µ1 ) − (x − µ2 )′ Σ−1 (x − µ2 )

ln k ≤ −
2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
1  ′ −1
= − { x Σ x − µ′1 Σ−1 x − x′ Σ−1 µ1 + µ′1 Σ−1 µ1 }
2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
−{x Σ x − µ2 Σ x − x′ Σ−1 µ2 + µ′2 Σ−1 µ2 }
′ −1 ′ −1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
1
− −2x Σ µ1 + 2x′ Σ−1 µ2 + µ′1 Σ−1 µ1 − µ′2 Σ−1 µ2
′ −1

=
2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
′ −1 1  ′ −1
= x Σ ( µ1 − µ2 ) − µ1 Σ µ1 − µ1 Σ µ2 + µ1 Σ µ2 − µ′2 Σ−1 µ2
′ −1 ′ −1
˜ ˜ ˜ 2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
′ 1 ′ −1
= (x − (µ1 + µ2 )) Σ (µ1 − µ2 ),
˜ 2 ˜ ˜ ˜ ˜
and this last function is known as Fisher’s linear discriminant function (LDF) in x.
˜
The simplest rule is when we choose k = 1 or ln k = 0: in this case we put x
˜
in π1 or π2 according as the LDF is positive or negative, which corresponds to the
likelihood ratio being > or < 1, or equivalently, the distance of x from π1 being less
˜
or more from that from π2 .

Fisher called this rule ‘best classifier’ without being aware of the distance function.
His idea was based on the following motivation: that the LDF should maximize the
following ‘t statistic’ among linear functions. for each ℓ ∈ Rp , consider the quantity
[Eπ1 (ℓ′ X)−Eπ2 (ℓ′ X)]2
˜
˜ ˜V (ℓ′ X) ˜ ˜ where the variance is under either π1 or π2 . Fisher’s idea was to
˜˜
maximize this when µ1 ̸= µ2 over ℓ ∈ Rp \ {0p }, in order for the classification to
˜ ˜ ˜ ˜
achieve the maximum possible contrast.
(ℓ′ µ1 −ℓ′ µ2 )2
The quantity equals ˜ ˜ℓ′ Σ˜ℓ˜ , so we recall that the maximizer is proportional to
˜ ˜
Σ−1 (µ1 − µ2 ); leading to the LDF, except that the average value under the mixture
˜ ˜
giving eights 12 and 12 to π1 and π2 is subtracted: the LDF equals
 
′ −1 1  ′ −1 1  ′ −1
x Σ (µ1 − µ2 ) − Eπ1 X Σ (µ1 − µ2 ) + Eπ2 X Σ (µ1 − µ2 )} .
˜ ˜ ˜ 2 ˜ ˜ ˜ 2 ˜ ˜ ˜
94
The mixture could lead us to think of an equiprobable prior distribution on
{π1 , π2 }. Indeed, among the several possible considerations other choices for k may
arise from, a prominent one is the Bayesian paradigm with the additional incorpo-
ration of a cost function for misclassification. Suppose we put prior probabilities q1
on π1 and q2 = 1 − q1 on π1 and π2 respectively; and costs c(1|2) for an observation
actually from π2 being misclassified in π1 and c(2|1) for the other misclassification
error.

The natural objective would be to minimize the posterior expected cost. If our
rule is to classify x into π1 if x ∈ R1 and into π2 if x ∈ R2 = Rp − R1 , then the
˜ ˜ R ˜
posterior probabilities of misclassification are q1 R2 p1 (x) dx for an observation in π1
R ˜ ˜
being classified into π2 , and q2 R1 p2 (x) dx for the other case. Thus the posterior
˜ ˜
expected cost is
Z Z
c(2|1)q1 p1 (x) dx + c(1|2)q2 p2 (x) dx
R2 ˜ ˜ R1 ˜ ˜
Z
= c(2|1)q1 + [c(1|2)q2 p2 (x) − c(2|1)q1 p1 (x)] dx
R1 ˜ ˜ ˜
Clearly the minimum value is attained by putting all x in R1 for which
˜
p1 (x) c(1|2)q2
c(1|2)q2 p2 (x) < c(2|1)q1 p1 (x) ⇔ ˜ >
˜ ˜ p2 (x) c(2|1)q1
˜
and in R2 all x for which the opposite inequality ‘<’ holds; points of equality may be
˜
c(1|2)q2
put either in R1 or R2 . Thus the choice k = c(2|1)q 1
can arise from this idea. Note that
the original classifier with k = 1 is the special case when the costs of misclassification
are symmetric i.e. c(1|2) = c(2|1), and the prior makes no preference between either
π1 or π2 , i.e. q2 = q1 = 12 .

Often, the parameters of the two populations need to be estimated from data.
Say, samples x1 , . . . , xN1 and y 1 , . . . , y N of sizes N1 and N2 are given from π1 and
˜ ˜ ˜ ˜ 2
95
π2 respectively. Then we estimate µ1 by x̄, µ2 by ȳ, and Σ by A 1 +A2
N1 +N2
with the
˜ ˜ ˜ ˜
obvious meanings for the statistics. It can be shown that the resulting statistic has
distribution with ∆2 as parameter, under both π1 and π2 .

96

You might also like