Mult 2023 Final 1
Mult 2023 Final 1
References:
General notation: When K is a finite set, |K| denotes its cardinality; when a
real number, its modulus; when a non-scalar square matrix, its determinant.
First, a fixed notation: for p ∈ N, we shall write the standard (orthonormal) basis of
1 , 0, . . . , 0)′ for 1 ≤ j ≤ p.
Rp as {e1 , . . . , ep }, where ej = (0, 0, . . . , |{z}
˜ ˜ ˜ th
j
1. Nonsingular matrices with one row/column given: An often invoked fact we get
out of the way first. Recall that any linearly independent set of vectors in Rp
can be extended into a basis of Rp . So, if we start with one non-null vector
x1 ∈ Rp and extend the set {x1 } into a basis {x1 , x2 , . . . , xp }, then the matrix
˜ ˜ ˜ ˜ ˜
A = [x1 |x2 |· · · |xp ] (respectively, its transpose) is nonsingular with the given x1
˜ ˜ ˜ ˜
as its first row (respectively, column).
1
2. Gram-Schmidt orthogonalization: if x1 , x2 , . . . , xk are (ordered) linearly inde-
˜ ˜ ˜
pendent vectors in Rp (clearly p ≥ k), then the following (ordered) sequence
of orthogonal vectors is its Gram-Schmidt orthogonalization: u1 = x1 ; and
˜ ˜
recursively for 2 ≤ j ≤ k,
j−1
X ui
uj = xj − x′j ui · p˜
˜ ˜ i=1
˜ ˜ ∥ui ∥
˜
√
with the norm ∥·∥ being the Euclidean one: ∥u∥ := u′ u.
˜ ˜˜
The set has the same span as the original set. In fact, sp({u1 , . . . , uj }) =
˜ ˜
sp({x1 , . . . , xj }) ∀j = 1, . . . , k. That is because actually, if we write sp({x1 , . . . , xj })
˜ ˜ ˜ ˜
as Sj for 1 ≤ j ≤ k, then in fact for 2 ≤ j ≤ k,
j−1
X ui
x′j ui · p˜ = PSj−1 (xj ) and uj = xj − PSj−1 (xj ) = PSj−1 ⊥ (xj )
i=1
˜ ˜ ∥ui ∥ ˜ ˜ ˜ ˜ ˜
˜
so that inductively, sp{ui : 1 ≤ i ≤ j} = sp{Sj−1 ∪ {uj }} = sp{Sj−1 ∪ {xj }} =
˜ ˜ ˜
Sj .
n o
u
In effect, ∥˜uii ∥ : 1 ≤ j ≤ n is an orthonormal basis for Sj , 1 ≤ j ≤ k.
˜
3. Orthogonal matrices with one row/column given: Given one vector x1 ∈ Rp
˜
with norm ∥x1 ∥= 1, extend {x1 } into a basis and then run the Gram-Schmidt
˜ ˜
process on it to yield an orthonormal basis {x1 , x2 , . . . , xp }, of Rp . Then the
˜ ˜ ˜
matrix P with x1 , x2 , . . . , xp as rows (respectively, columns) is orthogonal with
˜ ˜ ˜
the given x1 as its first row (respectively, column).
˜
4. Spectral decomposition of symmetric matrices: If Ap×p is symmetric, then it is
expressible as PΛP′ where P is orthogonal and Λ is diagonal. Then, AP =
PΛ = [λ1 P1 |λ2 P2 | . . . |λp Pp ] where P = [P1 |P2 | . . . |Pp ] and Λ = diag(λ1 , λ2 , . . . , λp )
˜ ˜ ˜ ˜ ˜ ˜
2
which in particular means that for every j, APj = λj Pj ; so that λj is an eigen-
˜ ˜
value of A and Pj is an eigenvector for λj .
˜
Another way to express the same fact is that every symmetric matrix is diago-
nalizable by an orthogonal matrix.
The only matrix that is nnd as well as npd is the null matrix 0.
The “if” part is trivial: if so, then for any x ∈ Rp , x′ Bx = (Cx)′ (Cx) =
˜ ˜ ˜ ˜ ˜
∥Cx∥2 ≥ 0. For the “only if” part, let B = P′ ΛP be the spectral decomposition
˜
of B where P is orthogonal and Λ = diag(λ1 , λ2 , . . . , λp ) with λj ≥ 0 ∀ j (the
3
1 √ √ p 1
eigenvalues of B); then defining Λ 2 = diag( λ1 , λ2 , . . . , λp ) and C = Λ 2 P,
1
we get C′ C = P′ (Λ 2 )2 P = B.
1
More generally, we can choose C = QΛ 2 P for any orthogonal Q; the choice
Q = Ip gives the above example, but other useful choices exist. For instance,
the choice Q = P′ actually makes C symmetric and in fact nnd itself: in this
case C′ = C and B = C′ C = C2 .
1
We shall call C the square root of B and denote it by B 2 . Note that if B is pd
then so is C.
4
Of course, if B is the projection matrix onto S, then S = {Bx : x ∈ Rp } equals
˜ ˜
what is called the column space of B and is denoted usually by C(B).
A general fact true for idempotent matrices of course holds for projection ma-
trices B: since rank of B is the number of non-zero eigenvalues (counted with
multiplicity) and the eigenvalues are just 0 and 1, the rank is just the multi-
5
plicity of 1, or the sum of the eigenvalues which equals its trace.
9. Given a general matrix A, the projection into C(A) equals B = A(A′ A)− A′
where the the g-inverse is symmetric: clearly B is symmetric; easy to show
B3 = B2 whence B has only 0 and 1 as eigenvalues whence B is idempotent.
To see why this holds, first treating the case when S ∩ N (T ) = {0n } (in Rn ),
˜
we easily conclude in this case that T |S is one-to-one, so that S and W are
isomorphic, hence have the same dimensions. So let S ∩ N (T ) be non-trivial,
say dim(S ∩ N (T )) = k ≥ 1 and let {x1 , x2 , . . . , xk } be a basis for it. Now
˜ ˜ ˜
discarding the trivial case W = {0m } when S = N (T ), let dim(W ) = r ≥ 1
˜
6
and let {w1 , . . . , wr } be a basis of W . For 1 ≤ j ≤ r, let y j ∈ S be such that
˜ ˜ ˜
T (y j ) = wj . Then we claim that B := {x1 , . . . , xk , y 1 , . . . , y r } forms a basis for
˜ ˜ ˜ ˜ ˜ ˜
S.
7
(b) Transposes: (A ⊗ B)′ = A′ ⊗ B′ .
Let us use P as a common notation for underlying probabilities on the space where
relevant random variables, vectors etc. are defined.
8
2.1 Joint, marginal and conditional densities
In the language of absolute continuity and singularity, distributions with density are
absolutely continuous with respect to Lebesgue measure.
is the (1-dimensional) marginal density of Xj . More generally, for a subset {Xj1 , Xj2 , . . . , Xjm }
of {X1 , X2 , . . . , Xp }, the function
Z Z Z
fXj1 ,Xj2 ,...,Xjm (xj1 , xj2 , . . . , xjm ) = ··· fX (x1 , x2 , . . . , xp ) dxi1 dxi2 · · · dxip−m
˜
When fXj1 ,Xj2 ,...,Xjm (xj1 , xj2 , . . . , xjm ) > 0 we call, with {i1 , i2 , . . . , ip−m } = {1, 2, . . . , p}\
{j1 , j2 , . . . , jm }, the function defined on Rp−m by
fX (x1 , x2 . . . , xp )
= ˜
fXj1 ,Xj2 ,...,Xjm (xj1 , xj2 , . . . , xjm )
as conditional density of (Xi1 , Xi2 , . . . , Xip−m ) given that {(Xj1 , Xj2 , . . . , Xjm ) = (xj1 , xj2 , . . . , xjm )};
which is indeed a density, and the distribution with this density we define as the
9
conditional distribution of (Xi1 , Xi2 , . . . , Xip−m ) given that {(Xj1 , Xj2 , . . . , Xjm ) =
(xj1 , xj2 , . . . , xjm )}. What is meant by that is this: for subsets B of Rp−m , we take
the quantity
Z
B
fXi ,Xi ,...,Xi
1 2 p−m |Xj1 ,Xj2 ,...,Xjm (xi1 , xi2 , . . . , xip−m |xj1 , xj2 , . . . , xjm ) dxi1 dxi2 · · · dxip−m
as the definition of the conditional probability that {(Xi1 , Xi2 , . . . , Xip−m ) ∈ B}, given
that {(Xj1 , Xj2 , . . . , Xjm ) = (xj1 , xj2 , . . . , xjm )}.
2.2 Moments
Since ∀ i, j, σij is the expected value of the (i, j)-th entry of the (random) matrix
(X − E(X)) (X − E(X))′ , we also represent D(X) as E (X − E(X)) (X − E(X))′
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
with the obvious interpretation of entrywise expectations; i.e. Σ = E[(X − µ)(X − µ)′ ].
˜ ˜ ˜ ˜
10
A couple of immediate observations we can make are that for a k × p matrix
A = ((aij )), if we define Y k×1 = (Y1 , . . . , Yk )′ = AX then E(Y ) = Aµ = AE(X)
˜ ˜ ˜ ˜ ˜
since E(Yi ) = E[ pj=1 aij Xj ] = pj=1 aij E(Xj ) for 1 ≤ i ≤ k and D(AX) = AΣA′ =
P P
˜
AD(X)A′ since for 1 ≤ t, s ≤ k,
˜
p p p
X X X
Cov(Ys , Yt ) = Cov( asi Xi , atj Xj ) = asi atj Cov(Xi , Xj ).
i=1 j=1 i,j=1
These two properties obviously characterize the mean vector and dispersion matrix
in terms of means and variances of (one-dimensional) random variables: taking ℓ = ej
˜ ˜
yields µj = E(Xj ) and σjj = V (Xj ); while for 1 ≤ i ̸= j ≤ p, taking ℓ = ei + ej ,
˜ ˜ ˜
observe 2σij = (σii + σjj + 2σij ) − (σii + σjj ) = ℓ′ Σℓ − (e′i Σei + e′j Σej ) = V (Xi + Xj ) −
˜ ˜ ˜ ˜ ˜ ˜
(V (Xi ) + V (Xj )) = 2Cov(Xi , Xj ).
11
It follows that the dispersion matrix Σ is always non-negative definite. In fact
in cases where it is psd, X does not possess a density. That is because then Σ is
˜
singular, so ∃ ℓ ∈ Rp \ {0p } such that Σℓ = 0p ⇒ V (ℓ′ X) = ℓ′ Σℓ = 0; so that
˜ ˜ ˜ ˜ ˜˜ ˜ ˜
ℓ′ X ≡ E(ℓ′ X) = ℓ′ µ with probability 1. In other words, the distribution of X is
˜˜ ˜˜ ˜˜ ˜
concentrated on the set
{x ∈ Rp : ℓ′ x = ℓ′ µ},
˜ ˜˜ ˜˜
a hyperplane in Rp of dimension p − 1, having p-dimensional Lebesgue measure 0.
In other words, singularity of the dispersion matrix implies that the multivariate
distribution is itself singular with respect to Lebesgue measure.
12
2.3 Correlation and regression
σij
When σi , σj > 0, the quantity ρij = σi σj
is called the correlation coefficient between
Xi and Xj ; intended to measure the degree of linear relationship between them. It lies
between -1 and 1. If ρij has a high absolute value, then using a linear (affine) function
of one of Xi and Xj to predict the other is advisable. The best affine functions turn
out to be
xj = µj + βji (xi − µi ) and xi = µi + βij (xj − µj )
σj
respectively for predicting Xj from Xi , and conversely, where βji = ρij σi
and βij is
defined analogously.
Note that ρij depends only upon σij , σii and σjj ; hence only upon the joint
marginal distribution of (Xi , Xj ) and the other components do not play any role.
from the partial derivative wrt α that would yield α = µ1 − pj=2 βj µj so that
P
14
is called the multiple linear regression equation of X1 on X2 , . . . , Xp ; and the variable
′
(X 2 − µ2 ) = X1 − µ1 + Σ12 Σ−1 −1 −1
X1 − µ1 + βopt 22 (X 2 − µ2 ) = X1 −Σ12 Σ22 X 2 − µ1 − Σ12 Σ22 µ2
˜ ˜ ˜ ˜ ˜ ˜ ˜
sometimes called the residual.
Note that the residual and the best linear predictor are actually uncorrelated:
′ ′
Cov X1 − µ1 + βopt (X 2 − µ2 ) , µ1 + βopt ( X 2 − µ2 )
˜ ˜ ˜ ˜ ˜ ˜
−1 −1
= Cov X1 , Σ12 Σ22 X 2 − V Σ12 Σ22 X 2
˜ ˜
−1 −1 −1
= Σ12 Σ22 Σ21 − Σ12 Σ22 Σ22 Σ22 Σ21 = 0
assuming E(X12 ) < ∞, where g varies over all measurable functions with E (g(X2 , . . . , Xp ))2 <
∞.
Clearly, if we were allowed only constants for g, the minimum would occur when
h ≡ E(X1 ). Since when X2 , . . . , Xp are known, any function g(X2 , . . . , Xp ) of X2 , . . . , Xp
15
essentially acts like a constant so far as the conditional distribution of X1 given
X2 , . . . , Xp is concerned, applying that principle to it we see that if we take as h
the function
Z
h(x2 , . . . , xp ) = E(X1 |X2 = x2 , . . . , Xp = xp ) = xfX1 |X2 ,...,Xp (x|x2 , . . . , xp ) dx
In fact the LHS is called the conditional variance of X1 given X2 , . . . , Xp . Now using
the well-known fact that expectation of conditional expectations is the unconditional
expectation, it follows that h satisfies the requirement.
It is important to record that while for computing the best linear predictor, we
only need the mean vector and dispersion matrix of X, for computing the conditional
˜
mean we require typically the whole conditional distribution, which is really obtained
from the joint distribution. We shall see that in the extremely important multivariate
normal model, the two predictors actually coincide, so that the best overall predictor
of X1 actually turns out to be an affine function of X2 , . . . , Xp .
16
Clearly the other variables than Xi , Xj and Xq+1 , . . . , Xp have no role to play. Let
us now call (Xq+1 , . . . , Xp )′ as X 2 , its mean vector as µ2 , and construct the dispersion
˜ ˜
matrix of the relevant coordinates:
σii σij σi,(q+1) σi,(q+2) . . . σip
σij σjj σj,(q+1) σj,(q+2) . . . σjp
σii σij Σi2
σi,(q+1) σj,(q+1)
= σ
, say,
ij σjj Σj2
σi,(q+2) σ j,(q+2)
.. .. ((σst ))2≤s,t≤p
Σ2i Σ2j Σ22
. .
σip σjp
noting that Σ22 = D(X 2 ). Then we know that the two residuals are Ti = Xi − (µi +
˜
Σi2 Σ22 (X 2 − µ2 )) and Tj = Xj − (µj + Σj2 Σ−1
−1
22 (X 2 − µ2 )), with respective variances
˜ ˜ ˜ ˜
σii − Σi2 Σ−1
22 Σ 2i and σ jj − Σ j2 Σ−1
22 Σ 2j ; and their covariance is
whence their correlation coefficient, called the partial correlation coefficient between
Xi and Xj fixing or eliminating Xq+1 , . . . , Xp is
It is worthwhile to note that Cov(Ti , Tj ) equals exactly the (i, j)th entry of the q × q
matrix Σ11.2 := Σ22 − Σ12 Σ−1
22 Σ21 , where Σ has been partitioned as
Σ11 q×q Σ12 q×(p−q)
.
Σ21 (p−q)×q Σ22 (p−q)×(p−q)
17
The entries of Σ11.2 are written as {σst.(q+1),(q+2),...,p : 1 ≤ s, t ≤ q} and thus
σij.(q+1),(q+2),...,p
ρij.(q+1),(q+2),...,p = √ .
σii.(q+1),(q+2),...,p σjj.(q+1),(q+2),...,p
In other words, ρij.(q+1),(q+2),...,p is obtained in the same way from Σ11.2 as σij is from
Σ.
It will be seen that Σ11.2 has a special significance in the context of multivariate
normal distributions: namely, it is the dispersion matrix of the conditional distribu-
tion of (X1 , . . . , Xq ) given Xq+1 , . . . , Xp , so that in that case the partial correlation
coefficients have a particular meaning: as the correlation coefficient of the conditional
distribution.
which relation also yields the following consequences of the corresponding 1-dimensional
results that we state without proof:
18
Theorem 2.1 (Uniqueness Theorem) If X and Y have the same chfs, then Y ∼
˜ ˜ ˜
X.
˜
While the definition and theory of convergence in distribution will not be covered
in this course, we nevertheless state the following analogue to the corresponding 1-
dimensional theorem:
The multivariate continuity theorem follows from its univariate version and the Cramer-
Wold device, and conversely.
Definition 1. X ∼ Np (µ, Σ) if ∀ ℓ ∈ Rp , ℓ′ X ∼ N ℓ′ µ, ℓ′ Σℓ .
˜ ˜ ˜ ˜˜ ˜˜ ˜ ˜
19
The Np (0, Ip ) is often referred to as the standard p-variate normal distribution.
˜
Some immediate consequences:
(1) Obviously, µ and Σ are the mean vector and dispersion matrix of X. In partic-
˜ ˜
ular, Σ is non-negative definite.
(6) Of course, prior reference to µ and Σ is really unnecessary, and we can give the
˜
equivalent
Definition 2. X has a multivariate normal distribution if ∀ ℓ ∈ Rp , ℓ′ X has a
˜ ˜ ˜˜
univariate normal distribution.
(7) So we can now say X has a multivariate normal distribution iff it follows
˜
Np (E(X), D(X)).
˜ ˜
(8) Even more generally, if Ck×p is a matrix and b a fixed vector in Rk , then C X+b ∼
˜ ˜ ˜
Nk C µ + b, CΣC ′ .
˜ ˜
(9) The chf of Np (µ, Σ) is: ϕX (t) = ϕt′ X (1) = exp it′ µ − 21 t′ Σt
˜ ˜˜ ˜ ˜ ˜˜ ˜ ˜
In particular, we claim that
20
(10) For 1 ≤ k ̸= j ≤ p, Xk and Xj are independent iff σkj = 0. For then,
1 2 2
ϕXk ,Xj (t1 , t2 ) = exp iµk t1 + iµj t2 − (σkk t1 + σjj t2 ) = ϕXk (t1 )ϕXj (t2 ).
2
(11) Let us show that when Σ is positive definite, the distribution actually has
a strictly positive density on Rp (which would mean that the distribution is
actually mutually absolutely continuous with Lebesgue measure on Rp , B p )).
So the distribution is singular iff Σ is singular.
(12) Conditional distributions: When Σ > 0, any principal submatrix (meaning one
keeping the same rows as columns) is also so. For if Σ1 := ((σij ))i,j∈J , where
∅ =
̸ J = {j1 , j2 , . . . , jk } ⊆ {1, 2, . . . , p} is one such with k < p, then for any
z = (z1 , z2 , . . . , zk )′ ∈ Rk \ {0k }, z ′ Σz = x′ Σx > 0 where x = (x1 , . . . , xp ) ∈ Rp
˜ ˜ ˜ ˜ ˜ ˜ ˜
is obtained by putting zi as xji for 1 ≤ i ≤ k; and taking 0 for xi if i ∈ / J. We
are assuming tacitly that j1 < · · · < jk are in increasing order.
Now, since Σ1 defined above is precisely the dispersion matrix of (Xj1 , Xj2 , . . . , Xjk )′
whose components are a subset of the components of X, it follows that any such
˜
subset has a nonsingular normal distribution in the appropriate dimension.
22
1 1 ′ −1
Now, X 2 has (marginal) density fX 2 (x2 ) = 1 p−q exp − 2 (x2 − µ2 ) Σ22 (x2 − µ2 ) ;
˜ ˜ ˜ |Σ22 | 2 (2π) 2 ˜ ˜ ˜ ˜
therefore given X 2 = x2 ∈ Rp−q , we have a conditional density for X 1 given by
˜ ˜ ˜
fX x 1
˜ ˜x2
fX |X (x1 |x2 ) = fX ˜(x )
˜1 ˜2 ˜ ˜
2
1 p
˜2 ˜ 1 p−q
= |Σ|− 2 (2π)− 2 exp − 2 (x − µ)′ Σ−1 (x − µ) |Σ22 | 2 (2π) 2 exp 21 (x2 − µ2 )′ Σ−1
1
22 (x2 − µ2 )
˜ ˜ ˜ ˜ ′
˜ ˜ ˜ ˜
|Σ|
− 12 x −µ x −µ
= |Σ (2π) − 2q
exp − 1 ˜ 1 ˜ 1 Σ−1 ˜ 1 ˜ 1 −(x2 − µ2 )′ Σ−1 22 ( x 2 − µ 2 )
22 | 2
x2 − µ 2 x2 − µ 2 ˜ ˜ ˜ ˜
˜ ˜ ˜ ˜
x
for x1 ∈ Rq , where x = ˜ 1 .
˜ ˜ x2
˜
By the expression for the determinant and inverse of a partitioned matrix in
Section 1, we know |Σ|= |Σ11 − Σ12 Σ−1
22 Σ21 |·|Σ22 |= |Σ11.2 ||Σ22 |, writing Σ11 −
Σ12 Σ−1 −1
22 Σ21 as Σ11.2 ; and further writing Σ12 Σ22 as B,
−1 −1
Σ11.2 −Σ11.2 B
Σ−1 =
′ −1 −1 ′ −1
−B Σ11.2 Σ22 + B Σ11.2 B
so the exponent in the conditional density becomes, temporarily writing x1 − µ1
˜ ˜
as y and x2 − µ2 as z,
˜ ˜" ˜ ˜
′ #
1 y
− Σ−1 y − z ′ Σ−1 22 z
2 ˜z ˜z ˜ ˜
˜ ˜
1
= − y ′ Σ−1 ′ −1 ′ −1 ′ ′ −1 ′ −1
11.2 y − 2y Σ11.2 Bz + z Σ22 z + z B Σ11.2 Bz − z Σ22 z
2 ˜ ˜ ˜ ˜ ˜i ˜ ˜ ˜ ˜ ˜
1h ′ −1
=− y − Bz Σ11.2 y − Bz
2 ˜ ˜ ˜ ˜
1 h ′ −1 i
=− x − µ1 − B(x2 − µ2 ) Σ11.2 x1 − µ1 − B(x2 − µ2 )
2 ˜1 ˜ ˜ ˜ ˜ ˜ ˜ ˜
and hence the conditional density becomes
− 21 − 2q 1 ′ −1
|Σ11.2 | (2π) exp − x − {µ1 + B(x2 − µ2 )} Σ11.2 x1 − {µ1 + B(x2 − µ2 )}
2 ˜1 ˜ ˜ ˜ ˜ ˜ ˜ ˜
23
which is nothing but the density given for the Nq µ1 + B(X 2 − µ2 ), Σ11.2 dis-
˜ ˜ ˜
tribution.
However, the final result suggests an alternative derivation that turns out to
involve far fewer calculations with partitioned matrices and avoids inverting
them altogether: we define
Y 1 q×1 X − BX 2 Ip −B X
Y = ˜ = ˜1 ˜ = ˜ 1 = C X, say,
˜ Y2 X2 0(p−q)×q Ip−q X2 ˜
˜ (p−q)×1 ˜ ˜
µ1 − Bµ2
when Y ∼ Np C µ, CΣC ′ . Now, we write C µ as ˜
˜ ; and simplify
˜ ˜ ˜ µ2
D(Y ) = CΣC ′ as ˜
˜
Ip −B Σ Σ12 I 0q×(p−q)
· 11 · p
0(p−q)×q Ip−q Σ21 Σ22 −B′ Ip−q
Σ11 − BΣ21 Σ12 − BΣ22 I 0q×(p−q)
= · p
Σ21 Σ22 −B′ Ip−q
′ ′
Σ11 − BΣ21 − Σ12 B + BΣ22 B Σ12 − BΣ22
=
′
Σ21 − Σ22 B Σ22
Σ12 Σ−1 ′
22 Σ21 too, while Σ12 − BΣ22 = Σ12 − Σ12 = 0q×(p−q) and Σ21 − Σ22 B =
(13) An important observation is that the conditional dispersion matrix Σ11.2 ; i.e.
that of the conditional distribution given Y 2 , does not involve Y 2 . It is some-
˜ ˜
times called the residual dispersion matrix.
The form of the conditional dispersion matrix imparts the following meaning to
the partial correlation coefficients: e.g. since for 1 ≤ i ̸= j ≤ q, ρij.(q+1),(q+2),...,p
is the correlation coefficient obtained from Σ11.2 in the same way as correlation
coefficients are obtained from dispersion matrices in bivariate distributions, they
can be interpreted as conditional correlation coefficients between Xi and Xj
given Xq+1 , . . . , Xp .
(14) Clearly the regression matrix of X 2 on X 1 would be Σ21 Σ−1 11 and the conditional
˜ ˜
dispersion matrix would be Σ22.1 := Σ22 − Σ21 Σ−1 11 Σ12 . In fact, the conditional
25
tial correlations referred to earlier. Suppose partial correlations given a sub-
set Xq+1 , . . . , Xp of the components of a multivariate normal random vector
X = (X1 , X2 , . . . , Xp )′ are already computed where 3 ≤ q ≤ p. Then for
˜
1 ≤ i ̸= j < q, the partial correlation coefficient ρij.q,(q+1),...,p can be computed
from those already obtained by using the following procedure.
26
Z has been extracted as
2
σxz σxz σyz
σ σ σ σxx − σxy −
xx xy − xz σzz −1
· [σxz σyz ]′ = σzz
2
σzz
σxz σyz σyz
σxy σyy σyz σxy − σzz
σyy − σzz
which yields for the partial correlation coefficient ρxy.z of (X, Y ) given Z the
formula
σxy σyz
σxy − σxz σyz √ − √ σxz · √
σzz σxx σyy σxx σzz σyy σyz ρxy − ρxz ρyz
q q = q q 2
=p p
σxx −
2
σxz
σyy −
2
σyz
1−
2
σxz
· 1−
σyz 1 − ρ2xz 1 − ρ2yz
σzz σzz σxx σzz σyy σzz
In fact these recursive relations still hold even if the distribution is not a mul-
tivariate normal.
4.1 noncentral χ2
Pp
Let X ∼ Np µ, Ip . The distribution of Y = X ′ X =
2
j=1 Xj is said to be a
˜ ˜ ˜ ˜
noncentral χ2 with degrees of freedom p and noncentrality parameter/ncp λ := µ′ µ =
2 ˜˜
µ . It is typically denoted by χ2p,λ .
˜
When µ = 0, the distribution is the central χ2p . Thus the central one is a special
˜ ˜
case of the noncentral one: χ2p = χ2p,0 .
it follows that a χ2p,λ r.v. can be expressed as the independent sum of a χ21,λ variable
and a χ2p−1 variable.
One of the most important properties of this distribution is that it can be seen as
a mixture (infinite) of central χ2 distributions. We first prove the same for Y1 := Z12 .
√
Since Z1 = ± Y 1 , we have
Now, we use the so-called duplication formula: Γ(2n)Γ( 21 ) = 22n−1 Γ(n)Γ(n + 12 ) for
gamma functions, and get
√ 1 1 1 1
π(2n)! = Γ( )Γ(2n + 1) = 2nΓ(2n)Γ( ) = 22n nΓ(n)Γ(n + ) = 22n n! Γ(n + )
2 2 2 2
28
so that
∞ y 1
−λ
X λn e− 2 y n+ 2 −1
fY1 (y) = e 2 √
n=0
2 · 22n n! Γ(n + 12 )
∞ y 1
X
−λ ( λ2 )n e− 2 y n+ 2 −1
= e 2
n=0
n! 2n+ 12 Γ(n + 1 )
2
∞
X λ ( λ2 )n
= e− 2 f 2 (y)
n=0
n! χ2n+1
n=0
n! (1 − 2it)n+ 2 (1 − 2it) 2 n=0 n! (1 − 2it) 2 (1 − 2it) 2
Clearly the function has a power series representation that converges for |2it|= 2|t|<
1 i.e. for t ∈ (− 21 , 12 ). So the above representation also yields that the moment
generating function MY exists on (− 12 , 21 ); and is given by
tλ
e (1−2t) 1 1
MY (t) = p ; |t|< (actually, at least for t ∈ −∞, )
(1 − 2t) 2 2 2
29
so the cumulant generating function (cgf) equals, at least on − 21 , 21 ,
∞ ∞ ∞
λt p X
n−1 n p X (2t)n X tn n−1 p
γY (t) = ln MY (t) = − ln(1−2t) = λ 2 t + = 2 λ+ n!
1 − 2t 2 n=1
2 n=1
n n=1
n! n
n−1
p
2 λ+ n! = 2n−1 (n − 1)! (nλ + p)
n
all with variance 1, hence a χ2n1 +n2 ,λ where λ = t=1,2 nj=1 E(Ztj2 ) = λ1 + λ2 .
P P t
Our goal is to see that increasing either the degrees of freedom or the ncp for a
noncentral χ2 variable makes it stochastically larger. First let us state
30
Lemma 4.1 If Y ∼ X + Z where Z is a non-negative r.v. independent of X, then
Y ≥st X.
Proof. If n1 > n2 , get Y3 ∼ χ2n1 −n2 ,λ1 −λ2 independent of Y2 ; then Y3 ≥ 0 with
probability 1 and Y1 ∼ Y2 + Y3 by additivity.
let us define, for fixed s > 0 the function g : [0, ∞) → [0, 1] by g(c) = Φ(−s − c) + 1 −
Φ(s−c) and show that it is an increasing function of c: g ′ (c) = −ϕ(−s−c)+ϕ(s−c) =
√
ϕ(|s − c|) − ϕ(s + c) > 0 since 0 ≤ |s − c|< s + c. Applying to s = t, we obtain the
needed result.
For the case when n > 1, choose Y3 ∼ χ2n−1 independent of Z1 and Z2 . Then
Yj ∼ Zj2 +Y3 . It is an easy exercise to conclude that Z12 ≥st Z22 ⇒ Z12 +Y3 ≥st Z22 +Y3
31
or Y1 ≥st Y2 .
In fact, we often use the special case above. In that case, the stochastic inequality
is actually strict on (0, ∞); i.e. it is implied that if y > 0 then Prob(Y1 > y) >
Prob(Y2 > y).
A nice application concerns testing for equality of means of several normal pop-
ulations with a known common variance: suppose k ≥ 2 and for 1 ≤ j ≤ k,
(Xj1 , . . . , Xj,nj ) is a random sample of size nj > 1 from N (µj , σ 2 ) with µj unknown;
but σ is known. We wish to test for H0 : µ1 = µ2 = · · · = µk against H1 : not H0 at
a given level α ∈ (0, 1). Naturally, the samples are independent.
Pk 1
Pk
Let n := j=1 nj . Define µ̄ = n j=1 nj µj , and for 1 ≤ j ≤ k, the jth sample
mean nj
σ2
1 X
X̄j := Xji ∼ N µj ,
nj i=1 nj
and note that X̄1 , X̄2 , . . . , X̄k are independent. So, the overall mean
k k nj
1X 1 XX σ2
X̄ = nj X̄j = Xji ∼ N (µ̄, ).
n j=1 n j=1 i=1 n
Pk ¯ ¯
j=1 nj (Xj −X)2
Our test statistic is Y = σ 2 whose distribution we proceed to derive.
Define
1 √ √ √ ′
X := n1 X̄1 , n2 X̄2 , . . . , nk X̄k ∼ Nk µ, Ik
˜ σ ˜
1 √ √ √ ′
where µ := σ n1 µ1 , n2 µ2 , . . . , nk µk .
˜
p ′
, n , . . . , nnk is a vector of norm 1, so we can construct an orthog-
p n1 p n2
Now, n
onal matrix P with this as the first row. Let Z := PX. Then Z ∼ Nk Pµ, Ik ⇒
˜ ˜ ˜ ˜
32
X ′ X = Z ′ Z ∼ χ2k,λ′ , where
˜ ˜ ˜ ˜
X k
λ′ = {E(Zj )}2 = (Pµ)′ (Pµ) = µ′ µ, where Z = (Z1 , Z2 , . . . , Zk )′ .
j=1 ˜ ˜ ˜˜ ˜
In fact, we are actually more interested in
k
X
Zj2 = (Z2 , . . . , Zk )′ (Z2 , . . . , Zk ) ∼ χ2k−1,λ , say,
j=2
Pk
where λ = j=2 {E(Zj )}2 = λ′ − {E(Z1 )}2 ; both of which we now compute:
r r r k r √ X k √ √
n1 n2 nk 1 X nj √ n nj n nµ̄
Z1 = , ,..., X= nj X̄j = X̄j = X̄ ∼ N ,1 ;
n n n ˜ σ j=1 n σ j=1 n σ σ
so that ( k )
k
X 1 X 2
Zj2 = X ′ X − Z12 = 2 nj X̄j 2 − nX̄ = Y ∼ χ2k−1,λ
j=2
˜ ˜ σ j=1
and
√ 2 k
! Pk
′ nµ̄ 1 X j=1 nj (µj − µ̄)2
λ=µµ− = 2 nj µ2j − nµ̄ 2
= ≥ 0.
˜˜ σ σ j=1
σ2
Then λ = 0 iff H0 is true. Thus under H0 , the distribution of Y becomes central, and
the CR {Y > y} with y := χ2k−1 (α), the upper 100α% cutoff point of the (central)
χ2k−1 distribution, then yields a size α test.
4.2 noncentral t
33
ncp µ, and write T ∼ tn,µ , if
Z √ 1
T = p = nZY − 2
Y /n
where Z ∼ N (µ, 1) and Y ∼ χ2n independently. Clearly, the central case can be
considered a particular case of the non-central one allowing µ = 0.
Various moments of T , e.g., can be worked out directly from the definition. Recall
that if X has a Gamma distribution G(α, λ) then E(X u ) < ∞ iff α + u > 0 and in
this case it equals Γ(α + u)/{λu Γ(u)}. In particular, if Y ∼ χ2n i.e. G n2 , 12 , then
2u Γ( n +u)
EY u = Γ 2n assuming u + n2 > 0.
(2)
This means ET m = nm/2 E(Z m )E(Y −m/2 ) exists for all m such that − m2 + n2 > 0,
Γ( n−m )
or m < n. Then, it equals nm/2 E(Z m ) 2m/2 Γ2 n . Assuming n > 2 and putting m = 1
(2)
and 2, we get
n Γ n−1
r
2
E(T ) = µ ;
2 Γ n2
n
n Γ − 1 (µ2 + 1)n n(µ2 + 1)
E(T 2 ) = (µ2 + 1) 2
= = and so
2 Γ n2 2 n2 − 1
n−2
n−1
!2
2 2 Γ
n(µ + 1) nµ
V (T ) = E(T 2 ) − {E(T )}2 = − 2
.
n−2 2 Γ n2
Lemma 4.2 If Tµ ∼ tn,µ with n fixed, then (i) ∀ t ∈ R, Prob(Tµ > t) is an increasing
function of µ; and (ii) ∀ t > 0, Prob(|Tµ |> t) is an increasing function of |µ|.
Proof. For (i), recall Tµ = √Zµ as defined (writing Zµ for Z to indicate the role of
Y /n
34
µ). Denoting by fY the density of Y , we see that
r ! Z ∞ r
Y y
Prob(Tµ > t) = Prob Zµ > t = Prob Zµ > t |Y = y fY (y) dy
n 0 n
Z ∞ r
y
= Prob Z > t fY (y) dy by independence of Z and Y
0 n
Z ∞ r
y
= 1−Φ t − µ fY (y) dy
0 n
Z ∞ r
y
= Φ µ−t fY (y) dy
0 n
and the integrand is clearly an increasing function of µ or every y ∈ [0, ∞); hence so
is the integral over y on [0, ∞). In fact since Φ is strictly increasing on R, if µ1 > µ2
then ∀ x ∈ R, Φ(µ1 − x) > Φ(µ2 − x) so Prob(Tµ1 > t) > Prob(Tµ2 > t) ∀ t ∈ R.
For (ii), we have to first establish that the function in question is indeed a function
of |µ|. Again, for t > 0,
r !
Y
p(µ) := Prob(|Tµ |> t) = Prob |Z|> t
n
Z ∞ r
y
= Prob |Z|> t Y = y fY (y) dy
0 n
Z ∞ r
y
= Prob |Z|> t fY (y) dy (by independence)
0 n
Z ∞ r r
y y
= 1−Φ t − µ + Φ −t − µ fY (y) dy
0 n n
Z ∞ r r
y y
= Φ µ−t +1−Φ µ+t fY (y) dy
0 n n
35
R, and
∞
Z r r
′ y y
p (µ) = ϕ µ−t −ϕ µ+t fY (y) dy
0 n n
and for µ ≥ 0, since t ≥ 0, so for each y ≥ 0 we have
r r r r
y y y y
µ−t ≤ µ+t =⇒ ϕ µ − t ≥ϕ µ+t
n n n n
meaning the integrand, hence the integral over y on [0, ∞), is also non-negative. So
p is increasing on [0, ∞).
In fact, here too, both the functions can be shown to be strictly increasing. This
we leave as an exercise.
36
4.3 Noncentral F distribution
Y1 /m
F = where Y1 ∼ χ2m,λ and Y2 ∼ χ2n independently.
Y2 /n
From similar calculations as for the noncentral t, E(F u ) < ∞ iff − m2 < u < n
2
,
and given then by
Γ n2 − u
n u n u
u
E(F ) = E(Y1u )E(Y2−u ) = E(Y1u )
2u Γ n2
m m
In particular,
Γ n2 − 1
n (m + λ)n 1 (m + λ)n
E(F ) = (m + λ) n
= n
= , n>2
m 2Γ 2 m 2 2 −1 m(n − 2)
n
2 Γ 2 −2
n 2
2
E(F ) = (2m + 4λ + (m + λ) ) 2 n
m 2Γ 2
n2 (m2 + 2mλ + λ2 + m + 4λ) 1
= n
n
m2 4 2
−1 · 2
−2
n2 (m2 + 2mλ + λ2 + 2m + 4λ)
= , n > 4,
m2 (n − 2)(n − 4)
V (F ) = E(F 2 ) − {E(F )}2 , n > 4.
Again, we can derive stochastic ordering and consequences for tests from that.
Suppose m, n are fixed, and Fλ ∼ Fm,n,λ for λ ≥ 0. Then, if λ1 > λ2 ≥ 0 we claim
Fλ1 ≥st Fλ2 . Essentially, this amounts to showing that ∀x > 0, g(λ) := Prob(Fλ > x)
37
Yλ /m
is an increasing function of λ. Writing Fλ = Y /n
with a slight modification of the
notation as in the definition and arguing as in the case of the noncentral t distribution,
we have
Z ∞
mxY mxy
g(λ) = Prob Yλ > = Prob Yλ > Y = y fY (y) dy
n 0 n
Z ∞ mxy
= Prob Yλ > fY (y) dy
0 n
and the integrand being an increasing function of λ for every fixed y > 0, so is the
integral over y on [0, ∞) as earlier. Again, it is left as an exercise to show the function
g is actually strictly increasing on [0, ∞).
The consequence is in terms of the power function of the ANOVA F -test for equal-
ity of means in one-way classified data: let k ≥ 2 and for 1 ≤ j ≤ k, Xj1 , . . . , Xj,nj
a random sample of size nj > 1 from N (µj , σ 2 ) with µj unknown; as is σ. Our aim
is again to test for H0 : µ1 = µ2 = · · · = µk against H1 : not H0 at a given level
α ∈ (0, 1).
As in the case when σ was known, put n := kj=1 nj , µ̄ = n1 kj=1 nj µj , and for
P P
Pnj 2
P Pnj
1 ≤ j ≤ k, X̄j := n1j i=1 Xji ∼ N µj , σnj , X̄ = n1 kj=1 nj X̄j = n1 kj=1 i=1
P
Xji
Pk ¯ ¯
j=1 nj (Xj −X)2
and the ‘between sum of squares’ Y1 = σ 2 whose distribution we know is
Pk 2
j=1 nj (µj −µ̄)
χ2k−1,λ with λ = σ 2 .
However, σ being unknown, we use the ‘within sum of squares’ to estimate it: for
Pnj S2
1 ≤ j ≤ k, put Sj2 := i=1 (Xji − X̄j )2 so that σj2 ∼ χ2nj −1 independently of X̄j and
among themselves; and so Y2 = σ12 kj=1 Sj2 satisfies
P
Y2 ∼ χ2Pk = χ2n−k
j=1 (nj −1)
38
Y1 /(k−1)
Thus F := Y2 /(n−k)
∼ Fk−1,n−k,λ and λ = 0 iff H0 is true. So the C.R. {F > f }
where f = Fk−1,n−k (α), the upper 100α% cutoff point of the central F distribution
with (k − 1, n − k) d.f., is of size α. From what we have justified above, this test
is (in fact strictly) unbiased since any non-null distribution of the test statistic is
a noncentral F with the same d.f.; thus stochastically (strictly) larger than its null
central distribution.
39
Lemma 4.3 If X has a nonsingular p-variate normal distribution and Ap×p is sym-
˜
metric with Prob(X ′ AX ≥ 0) = 1, then A is non-negative definite.
˜ ˜
Next, we require a lemma that we may call the identifiability of linear combinations
of independent χ2 variables distributed as a χ2
δm = 1.
Proof. First, from the nonnegativity of Y it follows that the δj > 0 for all j.
(If there were a negative δl then Probability of Y < 0 would equal Prob δl Yl < − j̸=l δj Yj
P
because the integrand is strictly positive for every choice of {yj : j ̸= l} since the
support of Yl is all of (0, ∞). )
40
Next, comparing cfs,
itδj λj
itλ m
e 1−2it Y e 1−2itδj
=
(1 − 2it)k/2 j=1
(1 − 2itδj )kj /2
Now since the LHS has a finite (in fact real) non-zero limit as |t|→ ∞, so does the
RHS which is a rational function; so the degrees of the polynomials in the numerator
and denominator must match whence k = m
P
j=1 kj . Further, actually evaluating the
Further, the cgf of δj Yj is defined, and given by a power series, at least on (− 2δ1j , 2δ1j ).
So at least for |t|< minj 2δ1j ∧ 12 , the cgf γY of Y = m
P
j=1 δj Yj equals
∞ ∞
m
( m )
X k X X X kj
2n−1 tn λ + = γY (t) = γYj (δj t) = 2n−1 tn δjn λj +
n=1
n j=1 n=1 j=1
n
and comparing the cumulants by matching coefficients in both sides, we see that for
all n ≥ 1,
m
k X n kj
λ+ = δj (λj + ).
n j=1
n
The LHS is obviously bounded, so must therefore the RHS be, hence each δj must be
at most 1. So let δ1 = δ2 = δr = 1 and δj ∈ (0, 1) for r + 1 ≤ j ≤ m; we shall show
r = m.
41
We have ∀ n,
r m
k X kj X
n kj
λ+ = λj + + δj λ j +
n j=1
n j=r+1
n
Pr
and taking limit as n → ∞, we get λ = j=1 λj and plugging back, ∀ n,
r
X m
X
k + nλ = kj + nλ + δjn (kj + nλj )
j=1 j=r+1
Pm
whence, recalling k = j=1 kj , we have ∀ n,
m
X m
X
kj = δjn (kj + nλj ) → 0 as n → ∞.
j=r+1 j=r+1
Pr Pm
It follows that r = m and λ = j=1 λj = j=1 λj .
Proof. We have from the spectral decomposition of A that A = Pp×m Dm×m P′m×p
where m = r(A), D = diag(d1 , . . . , dm ) with the non-zero eigenvalues of A and
columns of P := [P1 |P2 |. . . |Pm ] are orthonormal. Then with Zm×1 := P′ X =
˜ ˜ ˜ ˜ ˜
(Z1 , . . . , Zm )′ , say, we see that D(Z) = P′ P = Im . Thus, Z1 , Z2 , . . . , Zm are in-
˜
dependent normal each with variance 1 and so Z12 , Z22 , . . . , Zm
2
are independent χ2
variables each with d.f. 1. Now
m
X
Y = Z ′ DZ = dj Zj2 ∼ χ2k,λ
˜ ˜ j=1
42
whence it follows from Lemma 4.4 that k = m = r(A) and d1 = d2 = · · · =
dm = 1 so that A = PP′ is idempotent; hence a projection. Finally we identify
λ= m 2 ′ ′ ′ ′ ′
P
j=1 E(Zj ) = [E(Z)] [E(Z)] = (P µ) (P µ) = µ Aµ.
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
The next preparatory result concerns decompositions of the total sum of squares:
(3) Ai Aj = 0 whenever i ̸= j
For (1) ⇒ (3), get spectral decomposition Aj = Pj Dj P′j for each j; permuted
so that Dj = diag(dj1 , dj2 , . . . , djnj , 0, . . . , 0) whence Aj = Qj ∆j Q′j where Pj =
[Qj |Rj ] and ∆j = diag(dj1 , dj2 , . . . , djnj ) for each j. Note that the columns of Pj are
orthonormal vectors, so that ∀ j, Q′j Qj = Inj .
∆n×n := diag(d11 , d12 , . . . , d1n1 , d21 , d22 , . . . , d2n2 , . . . , dk1 , dk2 , . . . , dknk )
∆ 0 ... 0
1
0 ∆2 . . . 0
=
.. .. . .. ,
. . . . .
0 0 . . . ∆k
so that
X X
In = Aj = Qj ∆j Q′j = Q∆Q′
j j
43
This shows Q is non-singular; so premultiply by Q−1 and postmultiply by Q′ −1 to
get
i.e.
∆−1 0 ... 0 Q′1 Q1 Q′1 Q2 . . . Q′1 Qk In1 Q′1 Q2 . . . Q′1 Qk
1
0 ∆−1
′
Q2 Q1 Q′2 Q2 . . . Q′2 Qk
′
... 0 Q2 Q1 In2 . . . Q′2 Qk
2 = =
.. .. .. .. .. .. .. .. ..
. .. .. ..
. .
. . . . . . . . .
0 0 . . . ∆−1
k Q′k Q1 Q′k Q2 . . . Q′k Qk Q′k Q1 Q′k Q2 . . . Ink
To show (2)⇒(1), just note that since for each j Aj is idempotent, so r(Aj ) =
tr(Aj ) whence kj=1 r(Aj ) = kj=1 tr(Aj ) = tr( kj=1 Aj ) = tr(In ) = n.
P P P
A significant sidelight that emerges from the proof is that under the hypotheses
44
of the lemma, we have for each j = 1(1)k,
′ ′
Q Q
1 1
.. ..
. .
Q′ Aj Q = Q′j [Aj Q1 |· · · |Aj Qj |· · · |Aj Qk ] = Q′j [0|· · · |Aj Qj |· · · |0]
.. ..
. .
′ ′
Qk Qk
0 · · · Q′1 Aj Qj · · · 0 0 ··· 0 ··· 0
.. . . .. .. .. .. . . .. .. ..
. . . . . . . . . .
′
= 0 · · · Qj Aj Qj · · · 0 = 0 · · · Inj · · · 0 .
.. . . .. .. .. .. . . .. .. ..
. . . . . . . . . .
′
0 · · · Qk Aj Qj · · · 0 0 ··· 0 ··· 0
Theorem 4.1 (Fisher-Cochran Theorem) Suppose X ∼ Nn µ, In and In =
˜ ˜
A1 + · · · + Ak with each Aj symmetric. Then (1)–(3) are equivalent to
(4) Yj := X ′ Aj X ∼ χ2nj ,λj for some nj and λj for j = 1, 2, . . . , k. In this case,
˜ ˜
Y1 , Y2 , . . . , Yk are independent, and nj = r(Aj ) while λj = µ′ Aj µ for each j.
˜ ˜
Proof. First we show (1)–(3) =⇒ (4). Writing r(Aj ) = mj and Aj = Qj Q′j where
(Qj )n×mj has orthonormal columns; define as before Qn×n := [Q1 |Q2 |· · · |Qk ] which
45
we now know is orthogonal. Therefore Z := Q′ X ∼ Nn Q′ µ, In . Writing
˜ ˜ ˜
(Q′1 X)m1 ×1 (Z )
˜ ˜ 1 m1 ×1
(Q′2 X)m2 ×1 (Z2 )m2 ×1
Z= ˜. = ˜ ..
, say; we see for each j,
˜ .
. .
′
(Qk X)mk ×1 (Zk )mk ×1
˜ ˜
Yj = X ′ Aj X = (Q′j X)′ (Q′j X) = Zj′ Zj ∼ χ2mj ,λj
˜ ˜ ˜ ˜ ˜ ˜
with λj = [E(Zj )]′ [E(Zj )] = (Q′j µ)′ (Q′j µ) = µ′ Aj µ. Independence follows from the
˜ ˜ ˜ ˜ ˜ ˜
fact that for i ̸= j, Cov(Zi , Zj ) = Q′i Qj = 0mi ×mj (∵ Q is orthogonal) so that Zi and
˜ ˜ ˜
Zj are independent, whence so are Yi = Zi′ Zi and Yj = Zj′ Zj .
˜ ˜ ˜ ˜ ˜
(4) =⇒ (2): the idempotence of Aj for each j follows from Corollary 4.2.
Actually, the idempotence of the quadratic form that was seen to be necessary in
Corollary 4.2 we can now prove to be sufficient too, so we have a converse and can
conclude
Proof. The ‘Only if’ part was Corollary 4.2. Now if A is idempotent, then so is
Ip − A and I = A + (Ip − A); so the theorem applies with k = 2.
An interesting fact is that the result is in a way symmetric with respect to the two
summands Y1 := X ′ AX and Y2 := X ′ (I − A)X comprising the total sum of squares
˜ ˜ ˜ ˜
X ′ X, a χ2 variable, except for the parameters: Y1 has a χ2 distribution iff so does Y2 .
˜ ˜
46
The same fact also holds for any pair of non-negative random variables adding to
a χ2 :
which means by Lemma 4.3 that the above matrix Q = ((qij )) is nnd. Since any prin-
cipal submatrix of an nnd matrix has to be nnd, −B is nnd; but then because P′ A2 P
was also a projection matrix and thereby nnd, so is B. So B = 0(p−m)×(p−m) . An easy
conclusion is that C = 0m×(p−m) for if not, say qij ̸= 0 for some 1 ≤ i ≤ m and m+1 ≤
x , 0, 0, . . . , y , 0, 0, . . . , 0)′ ∈ Rp ,
j ≤ p, then because qjj = 0, with z = (0, 0, . . . , |{z}
˜ th
|{z}
i j th
′ 2
z Qz = qii x + 2qij xy cannot be nonnegative for every choice of x and y; in fact for
˜ ˜
qii x2
any choice of x ̸= 0, choosing y < − 2qij x
will do the job.
47
So Y2 = Z1′ AZ1 ∼ χ2n,λ2 ⇒ A is idempotent of rank n, because D(Z1 ) = Im . By
˜ ˜ ˜
assumption Y2 is not identically equal to Y1 = Z1′ Z, so n < m and Im − A is idempo-
˜ ˜
tent of rank m − n; so Y1 − Y2 = Z1′ (Im − A)Z1 ∼ χ2m−n,λ for some λ independently
˜ ˜
of Y2 and as λ2 + λ = λ1 so λ = λ1 − λ2 .
Essentially, the proof rests on establishing that Y2 as a quadratic form only involves
those variables whose sum of squares gives Y1 .
[Exercise. Can you think of a direct proof of this claim: if A, B are projections
and A−B is nnd, then A−B is also a projection? Hint: first show that S := C(B) ⊆
C(A) := T ; for if not, then finding v ∈ S \ T we have Bv = v but u := Av ̸= v ⇒
˜ ˜ ˜ ˜ ˜ ˜
u − v ̸= 0 so that v ′ (A − B)v = v ′ (u − v) = (u − (u − v))′ (u − v) = −∥u − v∥2 < 0,
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
since u ⊥ (v − u).]
˜ ˜ ˜
Recall the simultaneous diagonalizability of matrices yielding independent χ2 vari-
ables as quadratic forms. More generally, just two quadratic forms Y1 = X ′ A1 X and
˜ ˜
Y2 = X ′ A2 X that are independent χ2 variables are given by simultaneously diago-
˜ ˜
nalizable matrices A1 and A2 . To see this, decompose the total sum of squares X ′ X
˜ ˜
as Y1 + Y2 + [X ′ X − (Y1 + Y + 2)] = X ′ [A1 + A2 + (I − A1 − A2 )]X. Of course,
˜ ˜ ˜ ˜
independence simply manifests as A1 A2 = 0.
48
4.5 Residual SS in Linear Models; restricted case and testing
linear hypotheses
Before beginning this section let us recall what projections mean geometrically: the
projection y of a vector x on a subspace S represents the vector in S nearest to x; i.e.
2 ˜
˜ ˜
y − x = minz∈S ∥z − x∥2 . For any choice of z ∈ S, since (y − x) ⊥ S, so
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
2 2 2
∥z − x∥2 = (z − y) + (y − x) = (y − x) + (z − y) + 2⟨z − y, y − x⟩
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
2 2
= (y − x) + (z − y) (∵ z − y ∈ S, (z − y) ⊥ (y − x)
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
2
≥ (y − x) .
˜ ˜
In the Gauss-Markov linear model Y n×1 = Xn×k βk×1 + ϵn×1 with ϵ ∼ Nn (0, σ 2 In );
˜ ˜ ˜ ˜ ˜
the LSE of β is
˜
βˆ := arg min∥Y − Xβ∥2
˜ ˜ ˜
2 2
and the minimum R0 := minβ ∥Y − Xβ∥ = ∥Y − Xβ∥ ˆ 2 is the residual sum of squares.
˜ ˜ ˜ ˜
Since {Xβ : β ∈ R } = C(X), the minimum is attained when Xβˆ is the projection
k
˜ ˜ ˜
of Y on C(X). Thus Xβˆ = P (Y ) = X(X′ X)− X′ Y ; and
˜ ˜ C(X) ˜ ˜
2
R02 = (In − P )Y = Y ′ AY , say,
C(X) ˜ ˜ ˜
We sometimes need to the restrict the choice of the parameter β to satisfy a linear
˜
constraint of the form Hβ = γ for some given matrix Hr×k and vector γ ∈ C(H).
˜ ˜
49
Typically, this situation arises when we set this constraint up as a null hypothesis
H0 and wish to test it. Usually, we also assume R(H) ⊆ R(X) ensuring that the
components of Hβ are estimable.
˜
The restricted residual SS is
The minimum is attained when Xβˆ is the projection of Y − Xβ0 on the following
˜ ˜ ˜
subspace of Rn :
2
W := {Xβ : Hβ = 0} = {Xβ : β ∈ S0 } with R12 = (In − PW )(Y − Xβ0 ) .
˜ ˜ ˜ ˜ ˜ ˜ ˜
Now Y − Xβ0 can easily be seen to have dispersion matrix σ 2 In too and mean vector
˜ ˜ ν(I −P )ν
ν = Xβ − Xβ0 ; whence R12 /σ 2 follows χ2 with df n − dim(W ) and ncp λ := ˜ n σ2 W ˜ .
˜ ˜ ˜
So we need to identify the dimension of W ; the range of the restrictionof X to
X
S0 = N (H). Note that S0 ∩ N (X) = {β : Xβ = 0n } ∩ {β : Hβ = 0r } = N .
˜ ˜ ˜ ˜ ˜ ˜ H
So
X
dim(W ) = dim(S0 ) − dim (S0 ∩ N (X)) = n(H) − n
H
X X
= {k − r(H)} − {k − r } = r − r(H)
H H
50
which simplifies to r(X) − r(H) if R(H) ⊆ R(X), as is normally assumed. Thus the
df is n − dim(W ) = n − (r(X) − r(H)).
Also, since R12 ≥ R02 so (R12 − R02 )/σ 2 must have χ2 distribution independently of
R02 , from result proved earlier. Its df is n − (r(X) − r(H)) − (n − r(X)) = r(H); and
(R12 −R02 )/r(H) ν(In −PW )ν
it follows that F := R02 /(n−r(X))
has an F distribution; with ncp λ = ˜ σ2 ˜.
Note that in the context of testing, when the null hypothesis H0 : Hβ = γ is true,
˜ ˜
β − β0 ∈ N (H); so ν = X(β − β0 ) ∈ W =⇒ λ = 0. Thus the F -test applies and
˜ ˜ ˜ ˜ ˜
becomes unbiased.
B′ AB is idempotent ⇔ B′ AB · B′ AB = B′ AB
or equivalently, ΣA is idempotent. Clearly, the df equals r(B′ AB) = r(A) and the
ncp is given by [E(Y )]′ (B′ AB)[E(Y )] = µ′ Aµ.
˜ ˜ ˜ ˜ ˜ ˜
For independence of two such, say X A1 X and X ′ A2 X, note that we need B′ A1 BB′ A2 B =
′
˜ ˜ ˜ ˜
0 ⇔ A1 ΣA2 = 0.
51
5 Sampling from a normal population
52
5.1 Distribution of sample mean vector & SSSP matrix: in-
dependence & the Wishart distribution
A A
The sample dispersion matrix is defined as S = N
or N −1
according to context.
2
Recall that in one dimension, X̄ ∼ N (µ, σN ) and A
σ2
∼ χ2N −1 independently. We
establish analogous properties here too.
53
√
So Z1 , Z2 , . . . , ZN −1 are iid Np (0p , Σ) and ZN ∼ Np N µ, Σ independently
˜ ˜ ˜ ˜ ˜ ˜
with them.
Thus the corrected sample SSSP matrix A of a random sample of size N from a
p-dimensional normal population with mean vector µ and dispersion matrix Σ, has a
˜
Wp (N − 1, Σ) distribution. We shall write n := N − 1 and note that N > p ⇔ n ≥ p.
Before proceeding further we make a crucial change in the notation. Since the
distribution of A depends only on the first N − 1 columns of Z; in fact A is a function
of those alone; we drop the last column and redefine Z as
54
Proof. Put L := [ℓ1 , ℓ2 , . . . , ℓm ] and Y j = Zℓj for 1 ≤ j ≤ m so that
˜ ˜ ˜ ˜ ˜
Yp×m := [Y 1 | . . . | Y m ] = ZL.
˜ ˜
If we write Y mp×1 = [Y ′1 | . . . | Y ′m ]′ , then
˜ ˜ ˜
Y = (L′ ⊗ Ip )Z ∼ Nmp (0mp , (L′ ⊗ Ip )(In ⊗ Σ)(L′ ⊗ Ip )′ )
˜ ˜ ˜
i.e. D(Y ) = (L′ ⊗ Ip )(L ⊗ Σ) = (L′ L) ⊗ Σ = Im ⊗ Σ.
˜
[A possibly more transparent alternative proof goes as follows: clearly Y has an
˜
mp-dimensional normal distribution with mean vector 0; to identify the covariances we
˜
with Ytj = n Ztu (ℓj )u ,
P
write Z = Ztj and Y = Ytj
u=1
1≤t≤p
1≤t≤p ˜
1≤j≤n 1≤j≤m
so that
X X
E(Ytj Ysj ) = E(Ztu (ℓj )u (ℓj )v Zsv ) = (ℓj )2u E(Ztu Zsu ) (∵ u ̸= v ⇒ E(Ztu Zsv ) = 0)
u,v
˜ ˜ u
˜
2
= σst ∥ℓj ∥ = σst (∵ ∥ℓj ∥ = 1) ∀j
˜ ˜
for 1 ≤ s, t ≤ p and independence follows since if i ̸= j then
X X
E(Yti Ysj ) = E(Ztu (ℓi )u (ℓj )v Zsv ) = (ℓi )u (ℓj )u E(Ztu Zsu ) = σts (ℓ′i ℓj ) = 0 (∵ ℓi ⊥ ℓj )
u,v
˜ ˜ u
˜ ˜ ˜˜ ˜ ˜
for 1 ≤ s, t ≤ p.]
Recall that if A ∼ Wp (n, Σ) then A = ZZ′ where Zp×n = [Z1 |Z2 |. . . Zn ] with
˜ ˜ ˜
Z1 , Z2 , . . . , Zn ∼ Np (0p , Σ) iid. Assume n ≥ p and Σ > 0.
˜ ˜ ˜ ˜
55
(1) Let us first show that in this case the Wp (n, Σ) Wishart distributed matrix A
is nonsingular, i.e. positive definite, with probability 1 assuming Σ is p.d.
But if any non-null linear combination of the rows of Z were to yield null vec-
tor, the same linear combination of the rows of C would do so too, rendering
C singular which happens only with probability 0. So with probability 1, the
rows of Z are l.i. implying Z to have rank p.
(3) E(A) = nj=1 E(Zj Zj′ ) = nΣ: this means that in the context of estimation of
P
˜ ˜
Σ, An
is unbiased for Σ.
56
in the usual fashion, we can take Z and Y independent, put W := [Z|Y]p×(m+n)
with iid Np (0, Σ) columns, and A + B = ZZ′ + YY′ = WW′ ∼ Wp (m + n, Σ).
˜
(5) For any k and any k × p matrix H, HAH′ ∼ Wk (n, HΣH′ ): let Y j = HZj ∼
˜ ˜
Nk (0k , HΣH′ ), iid for 1 ≤ j ≤ n, and Y = [Y 1 |. . . |Y n ] = HZ; then HAH′ =
˜ ˜ ˜
YY′ .
(6) In particular, assuming Σ > 0, for any ℓ ∈ Rp \ {0p }, ℓ′ Aℓ ∼ W1 (n, ℓ′ Σℓ) i.e.
˜ ˜ ˜ ˜ ˜ ˜
ℓ′ Aℓ 2
ℓ˜′ Σ˜ℓ
∼ χn .
˜ ˜
(7) The identifiabillity lemma i.e. Lemma 4.4 extends in a nice way to independent
Wishart matrices. If A1 , A2 , . . . , Ak are independent with Aj ∼ Wp (nj , Σ) and
δ1 , δ2 , . . . , δk ̸= 0 with A := kj=1 δj Aj ∼ Wp (n, Σ), then δ1 = · · · = δk = 1 and
P
ℓ′ A ℓ
n = n1 + · · · + nk . That is because for any ℓ ∈ Rp \ {0p }, we have Yj := ˜ℓ′ Σj˜ℓ ∼
˜ ˜ ˜ ˜
ℓ′ Aℓ
χ2nj , and Y := kj=1 δj Yj = ℓ˜′ Σ˜ℓ ∼ χ2n .
P
˜ ˜P
[added 02.06.2023] Even if A = kj=1 δj Aj ∼ Wp (n, Φ), we can prove Φ = cΣ
for some c > 0 and δj = c ∀ j. First, note cA ∼ Wp (n, cΣ) for any c > 0.
ℓ′ A ℓ
Again fixing ℓ ∈ Rp \ {0p } and defining Yj = ˜ℓ′ Σj˜ℓ ∼ χ2nj , we have
˜ ˜ ˜ ˜
k
ℓ′ Aℓ X ℓ′ Σℓ
Y := ˜ ′ ˜ = δj ˜′ ˜ Yj ∼ χ2n
ℓ Φℓ j=1
ℓ Φℓ
˜ ˜ ˜ ˜
ℓ ′ Σℓ ℓ′ Σℓ
and so n = kj=1 nj and δj ˜ℓ′ Φ˜ℓ = 1 ∀ j = 1, . . . , k. But fixing any j, ˜ℓ′ Φ˜ℓ can
P
˜ ˜ ˜ ˜
be a constant cj := δ1j free of ℓ only if Σ = cj Φ; but obviously then cj cannot
˜
vary with j; so let cj ≡ 1c . Thus δj = c for each j and Φ = c1j Σ = cΣ.
For the if part, let r := r(C); write 1c C = QQ′ where Qn×r = [Q1 |. . . |Qr ] has
˜ ˜
57
orthonormal columns. Take Y j = ZQj , 1 ≤ j ≤ r; then Y 1 , Y 2 , . . . , Y r are iid
˜ ˜ ˜ ˜ ˜
Np (0, Σ). So writing Y = [Y 1 | · · · | Y r ], we have ZCZ′ = cYY′ ∼ Wp (r, Σ).
˜ ˜ ˜
For only if part, again put r := r(C) and write C = PDP′ where Pn×r =
[P1 |P2 |. . . |Pr ] has orthonormal columns and D = diag(δ1 , δ2 , . . . , δr ) so that C
˜ ˜ ˜ Pr
has non-zero eigenvalues δ1 , δ2 , . . . , δr . Then ZCZ′ = j=1 δj (ZPj )(ZPj ) =
′
Pr ˜ ˜
′
j=1 δj Y j Y j where Y j = ZPj , 1 ≤ j ≤ r.
˜ ˜ ˜ ˜
Again we know that {Y j : 1 ≤ j ≤ r} are iid Np (0, Σ). Thus Y 1 Y ′1 , . . . , Y r Y ′r
˜P ˜ ˜ ˜ ˜ ˜
are iid Wp (1, Σ) while rj=1 δj Y j Y ′j ∼ Wp (k, Φ). Therefore by the extension of
˜ ˜
the identifiability lemma, k = r and ∃ c > 0 such that Φ = cΣ and δ1 = δ2 =
· · · = δk = c. So 1c C = kj=1 Pj Pj′ is idempotent of rank k.
P
˜ ˜
This derivation is due to Profs. Malay Ghosh & Bimal K. Sinha (The American
Statistician, 2002). Let us first understand on what space we are looking to obtain a
density: since Σ and with probability 1, A, are p × p p.d. matrices, the appropriate
space is the open subset
p(p+1)
Θp := {(a11 , a21 , a22 , . . . , ap1 , . . . , app ) ∈ R 2 : the symmetric matrix ((aij ))1≤i,j≤p is p.d.}
p(p+1)
of R 2 that is also the parametric space, and we obtain a density of A also on that
space.
We have A = ZZ′ where Z = [Z1 |. . . |Zn ] where Z1 , . . . , Zn are iid Np (0p , Σ); with
˜ ˜ ˜ ˜ ˜′
U
˜1
.
n ≥ p and Σ > 0. Writing U 1 , . . . , U p for the rows of Z, we have Z = .. .
˜ ˜
U ′p
˜
58
Denote our target density by gΣ . We first treat the case Σ = Ip . For this, we
proceed recursively, denoting for each j = 1, 2, . . . p the j × j top left principal minor
of A by A[j] ; i.e.
A[j] = [[ail ]]1≤i,l≤j = Z[j] Z′[j]
U ′1
˜
.
where Z[j] = .. takes the first j rows of Z as its. Note that A[p] = A. Also
′
Uj
˜
denote A[0] = 1.
The idea is to develop a formula for gj|Ip ; prove that it is actually free of Z[j−1]
except through A[j−1] ; and thus also represents the conditional joint density given
A[j−1] . Now use it recursively over j = 1, 2, . . . , p so that our final target could be
expressed as
gIp = g1|Ip (a11 ) g2|Ip (a21 , a22 |a11 ) · · · gp|Ip (ap1 , ap2 , . . . , app |A[p−1] ). (5.1)
So fix 2 ≤ j ≤ p.
Note in the special Σ = Ip case that the entries Zij of Z are iid, each with a
standard normal distribution. Therefore the random vector U j ∼ Nn (0n , In ) inde-
˜ ˜
59
pendently of Z[j−1] , and it follows that conditionally,
The next step is to get the conditional joint distribution including ajj , which we
actually obtain in a slightly roundabout way: we actually show that
and Y j are conditionally independent; and ajj|j−1 ∼ χ2n−j+1 conditionally. The latter
˜
is because
ajj.j−1 = ∥(In − P )U j ∥2 ∼ χ2n−r(Z[j−1] )
C(Z ′
[j−1]
) ˜
i.e. χ2n−j+1 while the conditional cross-covariance matrix between Y j = Z[j−1] U j and
˜ ˜
V j := (In − P )U j being
˜ C(Z′[j−1] ) ˜
Z[j−1] (In − Z′[j−1] (Z[j−1] Z′[j−1] )−1 Z[j−1] ) = Z[j−1] − Z[j−1] = 0(j−1)×n
Now note that the Jacobian for the map (Y ′j , ajj ) 7→ (Y ′j , ajj.j−1 ) is 1 for every j;
˜ ˜
in fact the Jacobian matrix itself is Ij . Thus the conditional joint densities of (Y ′j , ajj )
˜
and (Y ′j , ajj.j−1 ) given Z[j−1] , or equivalently, given A[j−1] , are exactly the same gj|Ip .
˜
60
It follows that for (y j , ajj ) in the appropriate domain (i.e. so that A[j] is p.d.),
˜
h i
exp − 12 y ′j A−1
(n−j+1)/2−1
y exp − 12 ajj.j−1 ajj.j−1
[j−1] j
gj|Ip (y j , a) = j−1
˜ ˜1 · n−j+1
2 2 Γ n−j+1
˜ (2π) 2 |A[j−1] | 2 2
h i
1 ′ −1
exp − 2 (y j A[j−1] y j + ajj.j−1 ) |A[j] | n−j−1 2
= n j−1
˜ ˜ 1
·
2 2 π 2 Γ n−j+1 |A[j−1] | 2 |A[j−1] |
2
ajj n−j−1
exp − 2 |A[j] | 2
= n j−1 n−j .
2 2 π 2 Γ n−j+1
2
|A[j−1] | 2
Σ general p.d. case: Now an ingenious trick is used exploiting the sufficiency of
A for the distribution of Z (or equivalently Z). Consider the joint density of the
˜
p
columns of Z: for z 1 , z 2 , . . . , z n ∈ R ,
˜ ˜ ˜
Y e− 2 (˜zj Σ−1˜zj )
n 1 ′
1
−n
= K|Σ| exp − E , say;
2
j=1
(2π)p/2 |Σ|1/2 2
61
and one could invoke either factorization criterion or common properties of exponen-
tial families to conclude. As this means that the conditional distribution of Z given
˜
A is free of Σ; denoting by h its density, we can write the joint density as
|A|(n−p−1)/2 e− 2 tr(Σ A)
1 −1
Note. The standard derivations, although typically more complicated, yield cer-
tain byproducts. E.g. the so-called LU-factorization, also called the Bartlett de-
composition for the Wishart matrix immediately yields the distributions of ajj.j−1 for
2 ≤ j ≤ p and thereby that of the sample generalized variance, etc.
Recall that A = ZZ′ where Z = (Z1 |Z2 |. . . |Zn ) with Z1 , Z2 , . . . , Zn are iid
˜ ˜ ˜ ˜ ˜ ˜
Np (0p , Σ).
˜
Since Σ > 0 and n ≥ p ⇒ A > 0 with probability 1, so for 1 ≤ q ≤ p − 1,
62
partitioning A as
A11 A12
q×q q×(p−q)
A=
A21 A22
(p−q)×q (p−q)×(p−q)
we see with probability 1, A11 and A22 are positive definite, and so is with
probability 1 the matrix
(1)
Zq×n
we partition Z as with Aij = Z(i) Z(j)′ , 1 ≤ i, j ≤ 2.
(2)
Z(p−q)×n
Now we know
(1) (2) (2)
(Zj |Zj ) ∼ Nq BZj , Σ11.2 ∀ j = 1, 2, . . . , n
˜ ˜ ˜
(1) (2)
with B := Σ12 Σ−1 (2)
22 . Write Yj = Zj − BZj , 1 ≤ j ≤ n. Then given Z , the
˜ ˜ ˜
Yj are iid Nq (0, Σ11.2 ) for j = 1, 2, . . . , n. Write
˜ ˜
Yq×n = (Y1 |Y2 |. . . |Yn ) = Z(1) − BZ(2)
˜ ˜ ˜
so that conditionally given Z(2) , YY′ ∼ Wq (n, Σ11.2 ).
′ ′ ′ ′ ′
A11.2 = Z(1) Z(1) − Z(1) Z(2) (Z(2) Z(2) )−1 Z(2) Z(1) = Z(1) TZ(1)
63
′ ′
where temporarily we write T := In − Z(2) (Z(2) Z(2) )−1 Z(2) . Observe that
′
Z(2) T = 0(p−q)×n or equivalently, TZ(2) = 0n×(p−q) .
We now claim that YTY′ = A11.2 also, and prove this by noting that
h ′ ′
i
YTY′ = Z(1) − BZ(2) T Z(1) − Z(2) B′
′ ′ ′ ′
= Z(1) TZ(1) − BZ(2) TZ(1) − Z(1) TZ(2) B′ + BZ(2) TZ(2) B′
= A11.2
since the last three terms are all null. It remains to observe that T is a symmet-
ric idempotent, hence projection, matrix. In fact it represents the projection
onto C(Z(2) ′ ); hence its rank equals (n − r(Z(2) )) = (n − (p − q)) with proba-
bility 1. Since conditionally given Z(2) , we know YY′ ∼ Wq (n, Σ11.2 ); therefore
A11.2 = YTY′ ∼ Wq (n − (p − q), Σ11.2 ) conditionally, by Property 8. But the
conditional distribution does not depend on Z(2) , so it must be the unconditional
distribution as well, independently of Z(2) .
(2) Specializing the above to the case q = 1 has several important ramifications. In
this case, let us temporarily denote the 1×(p−1) vector A12 = (a12 , . . . , a1p ) by
a1 . Then a11.2 = a11 − a1 A−1 ′
22 a1 has a W1 (n − (p − 1), σ11.2 ) distribution where
˜ ˜ ˜
σ11.2 = σ11 − σ1 Σ−1 ′
22 σ1 with σ1 = Σ12 = (σ12 , . . . , σ1p ).
˜ ˜ ˜ 1×(p−1)
This also means that σa11.2
11.2
∼ χ2n−p+1 independently of A22 .
(3) Generalized variance: When A is the SSSP matrix of a random sample of size
N = n + 1 from a normal population with dispersion matrix Σ, the determinant
|A|
|A
n
|= np
is called the generalized variance of the sample. It estimates the
population generalized variance |Σ|. We assume Σ > 0 and prove
64
Proposition 5.1 ||A|Σ| has the distribution of the product of p independent χ
2
|Σ22 | |Σ22 | −1
σ (11) = = = σ11.2
|Σ| σ11.2 · |Σ22 |
and likewise, a(11) = a−1 where A−1 = a(ij)
11.2 . Therefore we conclude
1≤i,j≤p
σ (11)
a(11)
∼ χ2n−p+1 , independently of A22 . More generally, clearly for any j =
σ (jj)
2, . . . , p also, a(jj)
∼ χ2n−p+1 . Can we generalize this further?
We can. Observe that a(jj) = e′j A−1 ej and σ (jj) = e′j Σ−1 ej for 1 ≤ j ≤ p. So
−1
e′j Σ ej
˜ ˜ ˜ ˜
for each j, ˜e′ A−1 e˜j ∼ χ2n−p+1 . We claim that ej in the above can be replaced
˜
j
˜ ˜
by any ℓ ̸= 0; i.e.
˜ ˜
p ℓ′ Σ−1 ℓ 2
∀ ℓ ∈ R \ {0}, ˜′ −1˜ ∼ χn−p+1 .
˜ ˜ ℓA ℓ
˜ ˜
To prove this, choose a nonsingular matrix Q whose first column is ℓ; so that
˜
ℓ = Qe1 . Now define Λ := Q−1 ΣQ′ −1 and B := Q−1 AQ′ −1 ∼ Wp (n, Λ) by
˜ ˜
property 5; so
ℓ′ Σ−1 ℓ e′1 Q′ Σ−1 Qe1 e′1 Λ−1 e1 2
˜′ −1˜ = ˜′ ′ −1 ˜ = ˜′ −1˜ ∼ χn−p+1 .
ℓA ℓ e1 Q A Qe1 e1 B e1
˜ ˜ ˜ ˜ ˜ ˜
65
(5) This yields at one go the following consequence when the first component is
independent of the rest: then because σ1 = 0′p−1 , it follows that the vector
˜ ˜
β = β1.23...p = Σ−1 σ
22 1
′
of population regression coefficients becomes null, and
˜ ˜ ˜ −1
σ Σ σ′
the (squared) population multiple correlation coefficient ρ21.23...p = ˜1 σ1122 ˜1
becomes 0.
Since the original sample size is N = n+1, we often write F ∼ Fp−1,N −p instead.
(6) null distribution of sample correlation coefficient: Firstly, when A is the SSSP
matrix of a random sample from Np (µ, Σ) distribution of size N = n + 1,
˜
the sample correlation coefficients are expressible simply from A as follows: if
A sij aij
S= n
= ((sij )) then rij = √
sii sjj
= √
aii ajj
, 1 ≤ i ̸= j ̸= p; so their distribution
is determined by that of A ∼ Wp (n, Σ).
66
Clearly, if we fix i and j then the distribution of rij depends only on that of
the joint
distribution
of the ithand th
j components of the distribution; which
µi σii σij
is an N2 , ; hence it is enough to we specialize to the
µj σij σjj
case p = 2. Suppose we do that and write ρ12 =: ρ, r12 =: r, and note that
2
σ12
ρ = 0 ⇔ σ12 = 0 ⇔ σ11.2 = σ11 − = σ11 (1 − ρ2 ) = σ11 .
σ22
U′
Now if we write Z = ((Zij ))1=1,2,1≤j≤n = ˜ , say, then
V′
˜
A = ZZ′ =⇒ a11 = U ′ U , a12 = a21 = U ′ V , a22 = V ′ V
˜ ˜ ˜ ˜ ˜ ˜
and we know that conditionally given V , Z1j ∼ N (β12 Z2j , σ11.2 ), i.e. N (0, σ11 )
˜
if ρ = 0; and these are conditionally independent over j. Hence conditional
distribution of U given V is Nn (0n , σ11 In ).
˜ ˜ ˜
Now, let V denote a (random, but fixed given V ) orthogonal matrix whose first
˜
V
row is ∥˜V ∥ . Then the conditional distribution of VU =: W = (W1 , . . . , Wn )′ ,
˜ ˜ ˜
say, given V , is also the same Nn (0n , σ11 In ).
˜ ˜
Pn ′ ′
Now observe that a12 = j=1 Z1j Z2j = U V = W1 ∥V ∥; and a11.2 = U U −
˜ ˜ ˜ ˜ ˜
(U ′ V )2 ′ 2
˜V ˜V
′ = W W − W1 . But a 12 has the conditional distribution
˜˜ ˜ ˜
X n
2
N (0, Z2j σ11 ) or N (0, σ11 a22 )
j=1
67
in this case. In particular ajj /σjj ∼ χ2n while a11.2 /σ11 = a11.2 /σ11.2 ∼ χ2n−1
independently of V .
˜
We now show that √aa1222 and a11.2 are conditionally independent: we only need
to observe that
n−1
a12 X
√ = W1 and a11.2 = W ′ W − W12 = Wj2
a22 ˜ ˜ j=1
which are clearly independent N (0, σ11 ) and σ11 χ2n−1 variables conditionally,
given V . So,
˜
a12 a11 a22 − a212 a11 − a212 /a22 a11.2
r=√ =⇒ 1 − r2 = = =
a11 a22 a11 a22 a11 a11
√ √
√ r a12 / a11 a22 a12 / a22 σ11
=⇒ n − 1 √ =q =p
1 − r2 a11.2
/(n − 1) a11.2 /σ11 (n − 1)
a11
68
where ((ast.(q+1),...,p ))1≤s,t≤q = A11.2 .
−1
n(X − µ)′ A−1 (X − µ) = n(CY )′ C′ B−1 C−1 (CY ) = nY ′ B−1 Y ∼ Tp,n
2
.
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
Thus the distribution is indeed free of parameters.
(9) The T 2 distribution is connected with the F distribution in the following way:
applying property 4 above to the case Σ = Ip ; recall that when A ∼ Wp (n, Ip ),
then for any ℓ ∈ Rp \ {0p },
˜ ˜
ℓ′ ℓ 2
˜ ˜ ∼ χn−p+1 .
ℓ′ A−1 ℓ
˜ ˜
Hence with T 2 = nZ ′ A−1 Z where Z ∼ Np (0, I) independently of A; given
˜ ˜ ˜ ′ ˜
zz
Z = z for any z ∈ Rp \ {0p }, we see z′ A ˜˜ z
−1 ∼ χ2n−p+1 and this distribution is
˜ ˜ ˜ ′ ˜ ˜ ˜
Z Z 2 ′
free of z. Thus Z ′ A ˜ Z ∼ χn−p+1 independently of Z and hence of Z Z; while
˜ −1
˜ ˜2 ˜ ˜ ˜ ˜
′
we know Z Z ∼ χp . Thus
˜ ˜
n − p + 1 T2 Z ′ Z/p
= Z′Z ˜ ˜ ∼ Fp,n−p+1 .
p n ′ ˜ ˜
−1 /(n − p + 1)
Z A Z
˜ ˜
(10) There is also a non-central version of this distribution: it is the distribution
of nX ′ A−1 X if X ∼ Np (µ, Σ) and A ∼ Wp (n, Σ) are still independent with
˜ ˜ ˜ ˜
Σ > 0 and n ≥ p; but µ is not necessarily 0. Note that with Y := C−1 X and
˜ ˜ ˜ ˜
′
B := C−1 AC−1 where CC′ = Σ as before, now Y ′ Y ∼ χ2p,λ with λ := µ′ Σ−1 µ
˜ ˜ ˜ ˜
Y ′Y 2
but is still independent of Y ′˜B−1
˜Y ∼ χ n−p+1 ; thus now
˜ ˜
n − p + 1 T2
∼ Fp,n−p+1,λ
p n
70
with λ as a non-centrality parameter. Again, the central version corresponds to
λ = 0.
Clearly the distribution has the properties of stochastic ordering and therefore
monotonicity of tail probabilities that the F distributions enjoy.
6 Inference on parameters
(6.2)
−N 1
= K|Σ| exp − E , say;
2
2
71
1
PN
for constant K where E is (−2)× the exponent in the density. Let x¯ = N j=1 xj .
˜ ˜
We simplify E as follows:
XN
E= (xj − x¯ + x¯ − µ)′ Σ−1 (xj − x¯ + x¯ − µ)
j=1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
N
X N
X
−1
= ′
(xj − x)
¯ Σ (xj − x)
¯ +2 x − µ)′ Σ−1 (xj − x)
(¯ x − µ)′ Σ−1 (¯
¯ + N (¯ x − µ)
j=1
˜ ˜ ˜ ˜ j=1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
XN
= ¯ ′ Σ−1 (xj − x)
(xj − x) x − µ)′ Σ−1 (¯
¯ + N (¯ x − µ)
j=1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
PN
since − x)
j=1 (xj¯ = 0. Therefore given observations X 1 , . . . , X n , the log-likelihood
˜ ˜ ˜ ˜ ˜
function equals
l(µ, Σ) = ln L(µ, Σ)
˜ " N #
N 1 X ¯ ′ Σ−1 (X j − X)
¯ + N (X
¯ − µ)′ Σ−1 (X
¯ − µ)
= ln K − ln|Σ|− (X − X)
2 2 j=1 ˜ j ˜ ˜ ˜ ˜ ˜ ˜ ˜
¯ − µ)′ Σ−1 (X
Since for every p.d. Σ, (X ¯ − µ) ≥ 0 with equality iff µ = X,
¯ therefore
˜ ˜ ˜ ˜ ˜ ˜
l(µ, Σ) is maximum when and only when µ = X. ¯ So µ ˆ = X ¯ is the MLE of µ for
˜ ˜ ˜ ˜ ˜
any Σ. Thus to obtain the MLE of Σ, we need to maximize l(ˆ ¯ Σ) or
µ, Σ) = l(X,
˜ ˜
72
equivalently,
N
!
N 1
¯ Σ) − ln K = − ln|Σ|− tr
X
¯ ′ Σ−1 (X j − X)
¯
l(X, (X j − X)
˜ 2 2 j=1
˜ ˜ ˜ ˜
N
N 1X ¯ ′ Σ−1 (X j − X)
¯
= − ln|Σ|− tr (X j − X)
2 2 j=1 ˜ ˜ ˜ ˜
N
N 1X ¯ X j − X)
¯ ′
tr Σ−1 (X j − X)(
= − ln|Σ|−
2 2 j=1 ˜ ˜ ˜ ˜
N
!
N 1 X
¯ X j − X)
¯ ′
= − ln|Σ|− tr Σ−1 (X j − X)(
2 2 j=1
˜ ˜ ˜ ˜
N 1
= − ln|Σ|− tr Σ−1 A
2 2
and we set the partial derivatives to 0. Now for any square matrix M = ((mij )), since
the coefficient of mij in |M | is the cofactor (−1)i+j |M (ij) | where M (ij) is obtained
from M by removing its ith row and j th column. Also, if M is nonsingular then since
(−1)i+j |M (ij) | ∂|M |
|M |
is the (j, i)th entry of M −1 , so ∂mij
= (−1)i+j |M (ij) |= |M |·(M −1 )ji and
we conclude that
∂ ln|Σ−1 | 1 ∂|Σ−1 |
(ij)
= −1 (ij)
= [(Σ−1 )−1 ]ji = σij
∂σ |Σ | ∂σ
73
so that the likelihood equations reduce to
1 aij A
(N σij − aij ) = 0 ⇔ σ̂ij = , 1 ≤ i, j ≤ p ⇐⇒ Σ̂ =
2 N N
is the unique solution of the likelihood equation. That it is indeed MLE of Σ can be
−1
justified as follows: call the eigenvalues of Σ A as λ , . . . , λ . Then ∀ Σ,
N 1 p
−1
¯ A ¯ Σ) = N ln A N N 1
l X, − l(X, − p − ln|Σ|−1 + tr(Σ−1 A)
˜ N ˜ 2 N 2 2 2
−1 ( p p
)
Σ−1 A
N Σ A N X X
= tr − ln −p = λj − ln λj − p
2 N N 2 j=1 j=1
p
NX
= (λj − ln λj − 1) ≥ 0 ∵ ∀ x > 0, x ≥ 1 + ln x, with equality iff x = 1;
2 j=1
A
thus with equality iff λj = 1 ∀ n i.e. Σ = Σ̂ = N
.
74
Case 1 (Σ known): here X ¯ is sufficient, so use the fact that (X
¯ − µ0 ) ∼ Np (µ −
˜ ˜ ˜ ˜
µ0 , Σ
N
) so that Y = N (X¯ − µ0 )′ Σ−1 (X
¯ − µ0 ) ∼ χ2 with λ = (µ − µ0 )′ Σ−1 (µ − µ0 )
p,λ
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
and λ = 0 iff H0 is true.
Thus the test with C.R. {Y > χ2p (α)} has size α and is unbiased.
√
Case 2 (Σ unknown): Now we have N (X ¯ − µ) ∼ Np (0p , Ip ); so define
˜ ˜ ˜
√ √
¯ − µ0 )]′ ( A )−1 [ N (X
T 2 := [ N (X ¯ − µ0 )] = N n(X¯ − µ0 )′ A−1 (X ¯ − µ0 ) ∼ T 2
n,p
˜ ˜ n ˜ ˜ ˜ ˜ ˜ ˜
under H0 ; while its non-null distribution is a non-central T 2 . Thus the test with C.R.
N − p 2 N (N − p) ¯ ′ −1 ¯
T = (X − µ0 ) A (X − µ0 ) ≥ Fp,N −p (α)
np p ˜ ˜ ˜ ˜
has size α, and is unbiased.
Z′Z
Large-sample test: Recall that we can write T 2 = n Z ′ Z/˜Z ′˜A−1 Z and the denomi-
˜˜ ˜ ˜
nator has a χ2n−p+1 distribution independently of the numerator.
Pn
Now a χ2n variable Yn can be written as 2
j=1 ξj where (ξn )n≥1 is a sequence of
iid N (0, 1) variables and E(ξ12 ) = 1, it follows by the WLLN, e.g., that as n → ∞ ,
Yn P
n
→ 1. Therefore
1 Z ′Z n−p+1 1 Z ′Z P
′ ˜ =
˜ −1 ′ ˜ → 1.
˜ −1
nZ A Z n n−p+1Z A Z
˜ ˜ ˜ ˜
A well-known theorem known as Slutsky’s Theorem now guarantees that
Z ′Z d
T2 =
˜ ˜
1 Z′Z
→ Z ′ Z ∼ χ2p
˜ −1
n Z′A ˜Z ˜ ˜
˜ ˜
as n → ∞ . Thus for large n, {T 2 > χ2p (α)} defines the C.R. for a test with
approximate size α for α ∈ (0, 1).
75
Two-sample problem: We have random samples X 1 , . . . , X N1 and Y 1 , . . . , Y N2
˜ ˜ ˜ ˜
from two populations, respectively Np (µ1 , Σ) and Np (µ2 , Σ) with Σ unknown. Note
˜ ˜
that we need to assume the dispersion matrices are equal.
N1 N2 n ¯ ¯ ′ −1 ¯ ¯
T2 = 2
(X − Y ) A (X − Y ) ∼ Tp,n,λ
N1 + N2 ˜ ˜ ˜ ˜
¯−
Confidence ellipsoid: Of course when Σ is known the χ2 distribution of N (X
˜
¯ − µ) suffices to yield confidence region for µ.
µ)′ Σ−1 (X
˜ ˜ ˜
When Σ is unknown we use the T 2 distribution. Observe that for any given
α ∈ (0, 1),
N (N − p) ¯ ′ −1 ¯
Probµ,Σ (X − µ) A (X − µ) ≤ Fp,N −p (α) = 1 − α.
˜ p ˜ ˜ ˜ ˜
76
So, given the data, the set
p ¯ A (µ − X)
′ −1 ¯ ≤ p
E := µ ∈ R : (µ − X) Fp,N −p (α)
˜ ˜ ˜ ˜ ˜ N (N − p)
µ ∈ Rp : N n(µ − X)¯ ′ A−1 (µ − X)
¯ ≤ Tp,N
2
= −p (α) ,
˜ ˜ ˜ ˜ ˜
np 2
if we write N −p
Fp,N −p (α) as Tp,N −p (α), contains the true value of µ with probability
˜
(1 − α).
µ2
(X̄1 , X̄2 )
µ1
77
Simultaneous CIs for the components of the mean vector or their linear combi-
¯ − µ) ∼ N (0, ℓ′ Σℓ ) and ℓ′ Aℓ ∼ χ2
nations: Since for a given ℓ ∈ Rp \ {0}, ℓ′ (X ˜N ˜ ℓ˜′ Σ˜ℓ n
˜ ˜ ˜ ˜ ˜ ˜ ˜
independently with n = N − 1, a natural idea for working out a 100(1 − α)% CI for
¯ the pivot
ℓ′ µ would be to use, temporarily writing σ 2 (ℓ) for ℓ′ Σℓ = V (ℓ′ X),
˜˜ ˜ ˜ ˜ ˜˜
√ ′ ¯
√ ′ ¯
N ℓ (X−µ)
nN ℓ (X − µ) ˜σ(ℓ)˜ ˜
√˜ ′ ˜ ˜ = q ℓ˜′ Aℓ ∼ tn .
ℓ Aℓ
˜ ˜ ˜ ℓ)
nσ( ˜
˜
Thus the CI r r !
′ ¯ ℓ′ Aℓ α ′¯ ℓ′ Aℓ α
ℓ X − ˜ ˜ tn ( ), ℓ X + ˜ ˜ tn ( )
˜˜ nN 2 ˜˜ nN 2
is a choice. But if we are given several vectors ℓ1 , . . . , ℓk and wish to find intervals I1 ,
˜ ˜
I2 , . . . , Ik such that Probµ,Σ (ℓ′j µ ∈ Ij , 1 ≤ j ≤ k) ≥ 1 − α ∀ µ, Σ, then the choices
q ′ ˜ q˜ ˜′ ˜
′ ¯ ℓj Aℓj α ′ ¯ ℓj Aℓj α
ℓj X − ˜nN˜ tn ( 2 ), ℓj X + ˜nN˜ tn ( 2 ) as above for Ij , 1 ≤ j ≤ n, may not
˜ ˜ ˜ ˜
work; for
\k
Probµ,Σ (ℓ′j µ ∈ Ij ) = 1 − α, 1 ≤ j ≤ k ̸⇒ Probµ,Σ ( {ℓ′j µ ∈ Ij }) ≥ 1 − α.
˜ ˜˜ ˜ j=1
˜˜
78
with equality when ℓ is proportional to A−1 (X ¯ − µ), whence
˜ ˜ ˜
( )
′ ¯
|ℓ (X − µ)|
Probµ,Σ ˜ √˜ ′ ˜ ≤ K ∀ ℓ
˜ ℓ Aℓ ˜
(˜ ˜ ) ( )
¯ − µ)|
|ℓ′ (X ¯ − µ))2
(ℓ′ (X
= Probµ,Σ max ˜ √˜ ′ ˜ ≤ K = Probµ,Σ max ˜ ˜ ′ ˜ ≤ K2
˜ ℓ̸ = 0 ℓ Aℓ ˜ ℓ̸ =0 ℓ Aℓ
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
= Probµ,Σ (X ¯ − µ)′ A−1 (X ¯ − µ) ≤ K 2
˜ ˜ ˜ ˜ ˜
N (N − p) ¯ ′ −1 ¯ N (N − p) 2
= Probµ,Σ (X − µ) A (X − µ) ≤ K
˜ p ˜ ˜ ˜ ˜ p
N (N − p) 2
= Probµ,Σ Fp,N −p ≤ K =1−α
˜ p
q
p
when we choose K = Kα := F
N (N −p) p,N −p
(α); i.e.
n o
¯ ∈ ℓ′ µ ± Kα · ℓ′ Aℓ
p
Probµ,Σ ∀ ℓ, ℓ′ X =1−α
˜ ˜ ˜˜ ˜˜ ˜ ˜
√ √
In particular, the choices (X̄j − Kα ajj , X̄j + Kα ajj ) yield confidence intervals for
µj , 1 ≤ j ≤ p; with simultaneous confidence coefficient at least 100(1 − α).
This generalizes easily to simultaneous tests for any number k of hypotheses H0j :
ℓ′j µ = cj , 1 ≤ j ≤ k vs. H1 : at least one of H01 , . . . , H0k are false: we can reject the
˜˜
¯ − Kα ℓ′ Aℓj , ℓ′ X
/ (ℓ′j X ¯+
p
null hypothesis at level α if for any j ∈ {1, 2, . . . , k}, cj ∈ j j
˜ ˜ ˜ ˜ ˜ ˜
Kα ℓ′j Aℓj ).
p
˜ ˜
79
7 Canonical correlations
Given multivariate data, need often arises to analyze it only after replacing the set of
all variables by a smaller set, which could be either a subset or some set of transformed
variables. Among the situations where such a necessity arises naturally, is one where
we wish to capture the interrelationship between two (disjoint) sets of the original
variables by replacing them with sets of transformed variables.
We discuss the population version first. Let the given sets have respectively q and
r = p − q variables, and wlg assume that these are respectively the first q and last r
coordinates of X, and also that q ≤ r.
˜
X 1 q×1
As before, write X = ˜ . Since correlations do not depend on means,
˜ X 2 r×1
˜
we assume E(X) = 0. Partition the dispersion matrix Σ of X as earlier, as Σ =
˜ ˜ ˜
Σ11 q×q Σ12 q×r
. We assume Σ > 0 ⇒ Σ11 , Σ22 > 0. Write Σ11 = B′1 B1 and
Σ21 r×q Σ22 r×r
Σ22 = B′2 B2 where B1 q×q and B2 r×r are nonsingular.
Then for any α ∈ Rq \ {0q } and γ ∈ Rr \ {0r }, the correlation between α′ X 1 and
˜
α′ Σ12 γ
˜ ˜ ˜ ˜ ˜
γ ′ X 2 is q ′ ˜ ˜ and its supremum value is
α Σ11 α γ ′ Σ22 γ
˜ ˜ ˜ ˜˜ ˜
α′ Σ12 γ (B1 α)′ (B′1 )−1 Σ12 B−1 −1
2 (B2 γ)
sup p ′ ˜ ˜ = sup q ˜ ˜
α̸=0q ,γ̸=0r α Σ11 α γ ′ Σ22 γ α,γ (B 1 α)′ (B α) (B γ)′ (B γ)
1 2 2
˜ ˜ ˜ ˜ ˜ ˜˜ ˜ ˜ ˜
˜ ˜
−1
˜ ˜
′ ′ −1 −1
(B1 α) (B1 ) Σ12 B1 ((B1 ) γ
= sup ˜ ˜ = sup α′ Σ12 γ
α,γ | B1 α̸=0q ,B2 γ̸=0r ∥B1 α∥ ∥B2 γ∥ α,γ | ∥B1 α∥=∥B2 γ∥=1 ˜ ˜
˜˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜˜ ˜ ˜
Thus, to maximize α′ Σ12 γ subject to α′ Σ11 α = ∥B1 α∥2 = 1 and γ ′ Σ22 γ = ∥B2 γ∥2 =
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
1, we adopt the method of Lagrangian multipliers. Define the function ψ(α, γ) =
˜ ˜
′ λ ′ µ ′
α Σ12 γ − 2 (α Σ11 α − 1) − 2 (γ Σ22 γ − 1). Setting respectively to 0q and 0r the vectors
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
80
of partial derivatives
∂ ∂
0q = ψ = Σ12 γ − λΣ11 α and 0r = ψ = Σ′12 α − µΣ22 γ (7.3)
˜ ∂α ˜ ˜ ˜ ∂ γ ˜ ˜
˜ ˜
we get, multiplying on the left respectively by α′ and γ ′ (and taking transpose),
˜ ˜
α′ Σ12 γ − λα′ Σ11 α = 0 = α′ Σ12 γ − µγ ′ Σ22 γ =⇒ λ = µ = α′ Σ12 γ
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
from the restrictions. Putting µ = λ in (7.3), we see that
−λΣ11 α + Σ12 γ = 0q −λΣ11 Σ12 α
˜ ˜ ˜ =⇒ ˜ = 0p . (7.4)
Σ21 α − λΣ22 γ = 0r Σ21 −λΣ22 γ ˜
˜ ˜ ˜ ˜
−λΣ11 Σ12
Clearly it is necessary that the matrix Lp×p = ((lij )) := be
Σ21 −λΣ22
singular. We first see that its determinant
X
|L|= sgn(π)l1π(1) l2π(2) · · · lpπ(p)
π∈Sn
Split Sn into two sets, one containing those π that map the subset {1, 2, . . . , q}
onto itself (and hence its complement also), denoted say by J, and the other its
complement, say K.
Now if π ∈ J then for each j = 1(1)p, the (j, π(j))th entry of Λ contains −λ;
so the term sgn(π)l1π(1) l2π(2) · · · lpπ(p) is a multiple of λp while for π ∈ K, that term
is a multiple of a smaller power of λ. Further, since each π ∈ J can be thought to
consist of a permutation π1 of {1, 2, . . . , q} i.e. an element of Sq and one, say π2 , of
81
{q + 1, . . . , p} and the function π3 defined by π3 (j) = π2 (q + j) defines an element of
Sr , noting that sgn(π) = sgn(π1 ) · sgn(π3 ), we see that the term containing λp is
X
sgn(π) l1π(1) · · · lqπ(q) · lq+1,π(q+1) · · · lpπ(p)
π∈Sp
X
sgn(π1 )(−λ)q (Σ11 )1,π1 (1) (Σ11 )2,π1 (2) · · · (Σ11 )q,π1 (q)
=
π1 ∈Sq
X
sgn(π3 )(−λ)r (Σ22 )1,π3 (1) (Σ22 )2,π3 (2) · · · (Σ22 )r,π3 (r)
·
π3 ∈Sr
which shows the polynomial has degree exactly p. That all its roots are real is a point
that we skip.
82
we show that the maximum correlation E(Um+1 Vm+1 ) between Um+1 = αm+1 X1
˜ ˜
and Vm+1 = γm+1 X2 which are uncorrelated with U1 , V1 , . . . , Um , Vm and satisfy
2
˜2 ˜
E(Um+1 ) = E(Vm+1 ) =, is λm+1 .
Now for each j = 1(1)q, pre-multiplying the first by αj′ and the second by γj′ yields
˜ ˜
0 = νj αj′ Σ11 αj = νj and 0 = θj γj′ Σ22 γj = θj ;
˜ ˜ ˜ ˜
and we are left just with the original conditions (7.3). It follows that λm+1 and
corresponding αm+1 and γm+1 are the next choice.
˜ ˜
Define the matrices Hq×q = [α1 |α2 |. . . |αq ], (Γ1 )r×q = [γ1 |γ2 |. . . |γq ] and Λq×q =
˜ ˜ ˜ ˜ ˜ ˜
diag(λ1 , λ2 , . . . , λq ). We have
83
Recall Σ22 = B′2 B2 . Now if we get an orthonormal set {g1 , g2 , . . . , gr−q } in the ortho-
˜ ˜ ˜
complement of the column space of B2 Γ1 , then the matrix Gr×(r−q) whose columns are
g1 , g2 , . . . , gr−q satisfies G′ G = Ir−q and G′ B2 Γ1 = 0(r−q)×q . Defining (Γ2 )r×(r−q) =
˜ −1˜ ˜
B2 G so that G′ = Γ′2 B′2 , we get Γ′2 Σ22 Γ1 = Γ′2 B′2 B2 Γ1 = G′ B2 Γ1 = 0(r−q)×q .
Further, Γ′2 Σ22 Γ2 = G′ G = Ir−q .
This last matrix is connected with the dispersion matrix of the random vector
′
H 0q×r U
˜
Y := 0q×q Γ′1 X = V 1
˜ ˜ ˜
′
0(r−q)×q Γ2 V2
˜
where U = H′ X1 = (U1 , U2 , . . . , Uq )′ , V 1 = Γ1 X2 = (V1 , V2 , . . . , Vq )′ and V 2 =
˜ ˜ ˜ ˜ ˜
Γ2 X2 = (Vq+1 , Vq+2 , . . . , Vr−q )′ , say. These random variables are called the canonical
˜
variables and λ1 , λ2 , . . . , λq respectively the first, second, . . . , q th canonical correla-
tions.
84
It is worth noting that the components of V 2 = Γ2 X2 are not unique because Γ2
˜ ˜
is not: amy other choice of the onb, tantamount to post-multiplication of Γ2 by an
orthogonal matrix, serves the same purpose.
That procedure is based on the equalities H′ Σ11 H = Iq and HΛ2 H−1 = Σ−1 −1
11 Σ12 Σ22 Σ21 .
The first means that H′ Σ11 = H−1 so that the rows α̃1 , . . . , α̃q of H−1 are given by
α̃j′ = αj Σ11
˜
once we obtain αj , j = 1, . . . , q.
˜
Pq
The second equality means Σ−1 −1
11 Σ12 Σ22 Σ21 = λ2j αj α̃j′ . Now, obtain the
j=1
˜
largest root λ1 and corresponding α1 by starting with some initial approximation
˜
(0) (i) (i+1) (i+1)
α1 for α1 , and using the iterative relation Σ−1 −1
11 Σ12 Σ22 Σ21 α1 = λ1 α1 , i =
˜ ˜ ˜ ˜
85
(i+1) ′ (i+1)
0, 1, 2, . . . , and setting α1 Σ11 α1 = 1. After obtaining α1 , . . . , αm for 1 ≤ m < q,
˜ ˜ ˜ ˜
also compute α̃1 , . . . , α̃m and consider the matrix
m
X 0 m×m 0 m×(q−m)
Σ−1 −1
11 Σ12 Σ22 Σ21 − λ2j αj α̃j′ = H H−1
j=1
˜ 0(q−m)×m diag(λm+1 , . . . , λq )
which has largest eigenvalue λ2m+1 and applying the above iterative procedure to this
matrix will yield λm+1 and corresponding αm+1 .
˜
Sample versions: The natural sample analogues of would be the quantities ob-
tained from Σ̂ in the same way the population ones were obtained from Σ. We use
A
the MLE Σ̂ = N
because then the sample quantities also become MLEs of the cor-
responding population characteristics. For instance, the sample canonical correlation
coefficients l1 , . . . , lq are roots of the equation
−lΣˆ11 Σˆ12
=0
Σˆ21 −lΣˆ22
and corresponding sample canonical variates α ˆ j X1 and γˆ j X2 are obtained from the
˜ ˜ ˜ ˜
following analogue to (7.4):
−lj Σˆ11 Σˆ12 αˆj
· ˜ = 0p
ˆ
Σ21 ˆ
−lj Σ22 γˆ j ˜
˜
ˆ ′j Σˆ11 α
alongside α ˆ j = γˆ ′j Σˆ22 γˆ j = 1.
˜ ˜ ˜ ˜
These then are MLEs of the j th population partial correlation, and pair of canon-
ical variates.
In practice, when the population canonical correlations are not known as is typi-
cally the case, we may not be interested to compute those λj which are likely to be
86
small, indication of which is given by the sample quantity lj being small. In fact, it
is of interest to test for a λj to be 0, implying λj+1 = · · · = λq = 0 as well.
Note that the number of nonzero canonical correlation coefficients, or the number
of non-zero eigenvalues of Σ−1 −1
11 Σ12 Σ22 Σ21 , is equal to
r(Σ−1 −1 −1 ′ −1 −1
11 Σ12 Σ22 Σ21 ) = r(Σ12 Σ22 Σ21 ) = r(Σ12 (B2 ) )B2 Σ21 )
= r[(B−1 ′ −1 −1
2 Σ21 ) (B2 Σ21 )] = r(B2 Σ21 ) = r(Σ12 )
by the non-singularity of Σ11 and B2 . So, acceptance of a test for λm+1 = 0 and
rejection of λm = 0 can be taken as acceptance of m = r(Σ12 ), because that means
λ1 ≥ · · · ≥ λm > 0 = λm+1 = · · · = λq . We state without proof the likelihood ratio
criterion for H0 : λm+1 = 0 ⇔ r(Σ12 ) ≤ m is qj=m+1 (1 − lj2 )N/2 where N is the
Q
N tr[A−1 −1
11 A12 A22 A21 ].
87
Σ1 0q×r
put Σd p×p = = diag(√σ11 , . . . , √σpp ).
0r×q Σ2
Then we note that partitioning the correlation matrix R := ((ρXi ,Xj )) 1≤i,j≤p =
Σ−1 −1
d ΣΣd in the obvious way:
R11 R12 Σ−1 Σ Σ−1
Σ−1
Σ Σ −1
= 1 11 1 1 12 2
,
−1 −1 −1 −1
R21 R22 Σ2 Σ21 Σ1 Σ2 Σ22 Σ2
Σ−1
2 Σ22 . So, multiplying the equations Σ12 γj = λΣ11 αj and Σ21 αj = λΣ22 γj of (7.4)
˜ ˜ ˜ ˜
on the left respectively by Σ−1
1 and Σ−1
2 , we get
88
8 Elliptical distributions
1
exp − 12 (x − µ)′ Σ−1 (x − µ) depends on
The multivariate normal density (2π)p/2|Σ|1/2 ˜ ˜ ˜ ˜
the argument x through the function (x− µ)′ Σ−1 (x− µ). Thus the level sets/contours
˜ ˜ ˜ ˜ ˜
of the function, i.e. inverse images of points in the range (0, ∞), are sets
(x − µ)′ Σ−1 (x − µ) = c
˜ ˜ ˜ ˜
1
|Λ|− 2 g((x − ν)′ Λ−1 (x − ν))
˜ ˜ ˜ ˜
for some ν ∈ Rp , Λp×p positive definite and some function g, are said to be elliptically
˜
contoured/elliptically symmetric, or simply elliptical.
89
In fact the density of Y then can be easily seen to be g(y ′ y) = g(∥y∥2 ). It can
˜ ˜˜ ˜
be shown that g is involved really only in the marginal distribution of ∥Y ∥ in the
˜
following sense: suppose we express y = (y1 , . . . , yp ) ∈ Rp \ {0} in polar coordinates
˜ ˜
(r, θ1 , . . . , θp−1 ) as follows:
y1 = r cos θ1 ,
y2 = r sin θ1 cos θ2 ,
...
Then r = ∥y∥. Let us write y = ψ(r, θ) so that (r, θ) = ψ −1 (y). Note that ψ is a
˜ Q ˜π π
p−2 ˜ ˜ ˜
p
bijection from (0, ∞) × − 2 , 2 × (−π, π] → R \ {0p }.
j=1 ˜
The Jacobian matrix J of partial derivatives of ψ is given by
sin θ1 r cos θ1 0 ... 0 0
cos θ1 sin θ2 −r sin θ1 sin θ2 r cos θ1 cos θ2 ... 0 0
cos θ1 cos θ2 sin θ3 −r sin θ1 cos θ2 sin θ3 −r cos θ1 sin θ2 sin θ3 ... 0 0
.. .. .. ... .. ..
. . . . .
p−2
Q p−2
Q p−2
Q p−3
Q p−1
Q
cos θj ·sin θp−1 −r sin θ1 cos θj ·sin θp−1 −r cos θ1 sin θ2 cos θj ·sin θp−1 ... −r cos θj ·sin θp−2 sin θp−1 r cos θj
j=1 j=2 j=3 j=1 j=1
p−1
Q p−1
Q p−1
Q p−3
Q p−2
Q
cos θj −r sin θ1 cos θj −r cos θ1 sin θ2 cos θj ... −r cos θj ·sin θp−2 cos θp−1 −r cos θj ·sin θp−1
j=1 j=2 j=1 j=1 j=1
90
To compute its determinant, multiply on the right by
r sin θ1 r sin θ2 r sin θ3 . . . r sin θp−2 r sin θp−1 1
cos θ1 0 0 ... 0 0 0
0 cos θ2 0 ... 0 0 0
K :=
0 0 cos θ3 . . . 0 0 0
.. .. ..
...
... ... . . .
0 0 0 . . . cos θp−2 0 0
0 0 0 ... 0 cos θp−1 0
and note that JK is an upper triangular matrix with diagonal entries comprising the
vector (r, r cos θ1 , r cos θ1 cos θ2 , . . . , r p−2
Q Qp−1
j=1 cos θj , j=1 cos θj ), whose determinant is
absolute value of the Jacobian determinant |J| equals rp−1 p−1 p−j−1
Q
j=1 cos θj .
which shows that R and Θ = (Θ1 , Θ2 , . . . , Θp−1 ) are independent; and the marginal
˜
densities are obtained by integration:
Z π Z π/2 Z p−2
π/2 Y
2 p−1
fR (r) = g(r )r ··· cosp−j−1 θj dθ1 · · · dθp−2 dθp−1
−π −π/2 −π/2 j=1
"p−2 Z # p−2 p−j
Γ 12
π/2
2 p−1
Y
p−j−1 2 p−1
Y Γ 2
= g(r )r · 2π · cos θj dθj = g(r )r · 2π · p−j−1
j=1 −π/2 j=1
Γ 2
+ 1
p−1 √ p−1
Γ 2j j
2 p−1
Y π 2 p−1 (p−2)/2
Y Γ 2π p/2
= g(r )r · 2π · = g(r )r · 2π · π 2
= g(r2 )rp−1 · ,
j−1 j+1
Γ p2
j=2
Γ 2
+1 j=2
Γ 2
91
noting that
Z π/2 Z π/2 Z 1 Z 1
m m−1 2 (m−1)/2
cos θ dθ = 2 cos θ cos θ dθ = 2 (1 − sin θ) (1 − u2 )(m−1)/2 du
d(sin θ) = 2
−π/2 0 0 0
Γ m+1 Γ 12
Z 1 Z 1
2 (m−1)/2 1 2 m+1
−1 1
−1 m+1 1 2
=2 (1 − u ) d(u ) = (1 − v) 2 v 2 dv = B , =
Γ m2 + 1
0 2u 0 2 2
2π p/2
The number Ap := Γ( p2 )
is the surface area of the unit sphere S p−1 in Rp .
Y Y
It follows that the random vector U := ∥˜Y ∥ = ˜R whose polar coordinates are
˜ ˜
(1, Θ2 , Θ2 , . . . , Θp−1 ), i.e. rectangular coordinates are given by
U = (sin Θ1 , cos Θ1 sin Θ2 , . . . , cos Θ1 · · · cos Θp−2 sin Θp−1 , cos Θ1 · · · cos Θp−1 )
˜
has a uniform distribution on the sphere. As mentioned before, this is an example of
a spherically symmetric distribution without a density. Clearly E(U ) = 0. Therefore,
˜ ˜ 2)
E(Y ) = 0. Further, the dispersion matrix of U is p1 Ip , hence that of Y is E(R
p
Ip ,
˜ ˜ ˜ ˜
provided E(R2 ) < ∞.
92
9 Classification and Mahalanobis’ D2
Assuming they have the same (positive definite) dispersion matrix Σ but different
means µ1 and µ2 , their squared distance is defined as ∆2 := (µ1 − µ2 )′ Σ−1 (µ1 − µ2 ).
˜ ˜ ˜ ˜ ˜ ˜
While this quantity appears as a parameter in sampling distributions we encounter
later, what we really need is some measure of how far a given observation is from
both the distributions.
93
and so the ratio is large enough, say ≥ k iff
1
(x − µ1 )′ Σ−1 (x − µ1 ) − (x − µ2 )′ Σ−1 (x − µ2 )
ln k ≤ −
2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
1 ′ −1
= − { x Σ x − µ′1 Σ−1 x − x′ Σ−1 µ1 + µ′1 Σ−1 µ1 }
2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
−{x Σ x − µ2 Σ x − x′ Σ−1 µ2 + µ′2 Σ−1 µ2 }
′ −1 ′ −1
˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
1
− −2x Σ µ1 + 2x′ Σ−1 µ2 + µ′1 Σ−1 µ1 − µ′2 Σ−1 µ2
′ −1
=
2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
′ −1 1 ′ −1
= x Σ ( µ1 − µ2 ) − µ1 Σ µ1 − µ1 Σ µ2 + µ1 Σ µ2 − µ′2 Σ−1 µ2
′ −1 ′ −1
˜ ˜ ˜ 2 ˜ ˜ ˜ ˜ ˜ ˜ ˜ ˜
′ 1 ′ −1
= (x − (µ1 + µ2 )) Σ (µ1 − µ2 ),
˜ 2 ˜ ˜ ˜ ˜
and this last function is known as Fisher’s linear discriminant function (LDF) in x.
˜
The simplest rule is when we choose k = 1 or ln k = 0: in this case we put x
˜
in π1 or π2 according as the LDF is positive or negative, which corresponds to the
likelihood ratio being > or < 1, or equivalently, the distance of x from π1 being less
˜
or more from that from π2 .
Fisher called this rule ‘best classifier’ without being aware of the distance function.
His idea was based on the following motivation: that the LDF should maximize the
following ‘t statistic’ among linear functions. for each ℓ ∈ Rp , consider the quantity
[Eπ1 (ℓ′ X)−Eπ2 (ℓ′ X)]2
˜
˜ ˜V (ℓ′ X) ˜ ˜ where the variance is under either π1 or π2 . Fisher’s idea was to
˜˜
maximize this when µ1 ̸= µ2 over ℓ ∈ Rp \ {0p }, in order for the classification to
˜ ˜ ˜ ˜
achieve the maximum possible contrast.
(ℓ′ µ1 −ℓ′ µ2 )2
The quantity equals ˜ ˜ℓ′ Σ˜ℓ˜ , so we recall that the maximizer is proportional to
˜ ˜
Σ−1 (µ1 − µ2 ); leading to the LDF, except that the average value under the mixture
˜ ˜
giving eights 12 and 12 to π1 and π2 is subtracted: the LDF equals
′ −1 1 ′ −1 1 ′ −1
x Σ (µ1 − µ2 ) − Eπ1 X Σ (µ1 − µ2 ) + Eπ2 X Σ (µ1 − µ2 )} .
˜ ˜ ˜ 2 ˜ ˜ ˜ 2 ˜ ˜ ˜
94
The mixture could lead us to think of an equiprobable prior distribution on
{π1 , π2 }. Indeed, among the several possible considerations other choices for k may
arise from, a prominent one is the Bayesian paradigm with the additional incorpo-
ration of a cost function for misclassification. Suppose we put prior probabilities q1
on π1 and q2 = 1 − q1 on π1 and π2 respectively; and costs c(1|2) for an observation
actually from π2 being misclassified in π1 and c(2|1) for the other misclassification
error.
The natural objective would be to minimize the posterior expected cost. If our
rule is to classify x into π1 if x ∈ R1 and into π2 if x ∈ R2 = Rp − R1 , then the
˜ ˜ R ˜
posterior probabilities of misclassification are q1 R2 p1 (x) dx for an observation in π1
R ˜ ˜
being classified into π2 , and q2 R1 p2 (x) dx for the other case. Thus the posterior
˜ ˜
expected cost is
Z Z
c(2|1)q1 p1 (x) dx + c(1|2)q2 p2 (x) dx
R2 ˜ ˜ R1 ˜ ˜
Z
= c(2|1)q1 + [c(1|2)q2 p2 (x) − c(2|1)q1 p1 (x)] dx
R1 ˜ ˜ ˜
Clearly the minimum value is attained by putting all x in R1 for which
˜
p1 (x) c(1|2)q2
c(1|2)q2 p2 (x) < c(2|1)q1 p1 (x) ⇔ ˜ >
˜ ˜ p2 (x) c(2|1)q1
˜
and in R2 all x for which the opposite inequality ‘<’ holds; points of equality may be
˜
c(1|2)q2
put either in R1 or R2 . Thus the choice k = c(2|1)q 1
can arise from this idea. Note that
the original classifier with k = 1 is the special case when the costs of misclassification
are symmetric i.e. c(1|2) = c(2|1), and the prior makes no preference between either
π1 or π2 , i.e. q2 = q1 = 12 .
Often, the parameters of the two populations need to be estimated from data.
Say, samples x1 , . . . , xN1 and y 1 , . . . , y N of sizes N1 and N2 are given from π1 and
˜ ˜ ˜ ˜ 2
95
π2 respectively. Then we estimate µ1 by x̄, µ2 by ȳ, and Σ by A 1 +A2
N1 +N2
with the
˜ ˜ ˜ ˜
obvious meanings for the statistics. It can be shown that the resulting statistic has
distribution with ∆2 as parameter, under both π1 and π2 .
96