STAT 5205
Matrix Approach to Simple
Linear Regression
Geometry of Vectors
• A vector of order n is a point in n-dimensional space
• The line running through the origin and the point
represented by the vector defines a 1-dimensional
subspace of the n-dim space
• Any p linearly independent vectors of order n, p < n
define a p-dimensional subspace of the n-dim space
• Two vectors x and y are orthogonal if x’y = y’x = 0
and form a 90 angle at the origin
• Two vectors x and y are linearly dependent if they
form a 0 or 180 angle at the origin
Geometry of Vectors
• The length Lx of a vector x = (x1, … , xn)ʹ (aka norm) is defined as
length of 𝒙 = 𝒙 = 𝐿𝒙 = 𝑥12 + 𝑥22 + ⋯ +𝑥𝑛2= 𝒙′ 𝒙
• Cauchy-Schwarz inequality:
𝒂⋅𝒃 ≤ 𝒂 𝒃
• How does it relate to statistics? If x = (x1, … , xn)ʹ is a random sample with sample
mean 𝒙
ഥ, then the sample standard deviation of x is
ഥ
𝑥1 − 𝒙 2 ഥ
+ ⋯ + 𝑥𝑛 − 𝒙 2 ഥ
𝒙−𝒙
𝑠= = length of
𝑛−1 𝑛−1
Geometry of Vectors
• The angle 𝜃 between two vectors x and y is defined to be
such that
𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛 𝒙′ 𝒚
cos 𝜃 = =
𝐿 𝒙 𝐿𝒚 𝒙′ 𝒙 𝒚′ 𝒚
𝒙′ 𝒚
⇒ 𝜃 = arccos
𝒙′ 𝒙 𝒚′ 𝒚
• If 𝒂 ⋅ 𝒃 = 𝟎 then the vectors are called orthogonal.
• In statistics: If two vectors each have mean 0 among their
elements, then cos(𝜃) is the correlation between the two
vectors.
Example: Geometry of Vectors
Let
−3 1
−1 5
𝒙= , 𝒚=
5 −3
1 3
Find cos(𝜃).
𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛 𝒙′ 𝒚
cos 𝜃 = =
𝐿𝒙 𝐿𝒚 𝒙 ′ 𝒙 𝒚′ 𝒚
𝐿𝒙 = 𝑥12 + 𝑥22 + ⋯ 𝑥𝑛2
Geometry
of Vectors
• The projection of y on x is:
• 𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑖𝑜𝑛 𝑜𝑓 𝒚 𝑜𝑛 𝒙 =
𝒚′ 𝒙 𝒚′ 𝒙 𝒚′ 𝒙 1
2 𝒙 = ′ 𝒙 =
𝐿𝒙 𝒙𝒙
𝒙
𝐿𝒙 𝐿𝒙
• and the length of the
projection is:
|𝒚′ 𝒙| 𝒚′ 𝒙
• = 𝐿𝒚 = 𝐿𝒚 cos 𝜃
𝐿𝒙 𝐿𝒚 𝐿𝒙
• Applications in statistics:
many, for example,
regression and principal
components.
Graphical Depiction of Matrix multiplication
1 2 3 1 −2
Consider 𝑴 = ,𝒂 = ,𝒃 = ,𝒄 =
3 −1 4 7 3
Find the image of the quadrilateral 0abc.
• Graphical Depiction of Matrix multiplication:
Matrix multiplication scales/rotates/skews a geometric plane.
Ax
•
a12 x
a 2
x = 1
x 22
•
x
2 a12 • • aa11 x1
a
22
()
0 21
1 • • • aa
11
21
•
(10)
DETERMINANTS OF SQUARE MATRICES
A = 11 12
a a •
a21 a22
a12
a
22 •
det( A)
•a
• • a11
= a11a22 – 21
a21a12
•
| det( A) |
= Area of the image of the unit square under A
Example:
1 2
𝑴= ⇒ det 𝑴 = 1 −1 − 2 3 = −7
3 −1
Time to think: What does the negative sign represent?
DETERMINANTS OF SQUARE MATRICES
In general,
det 𝑨 = 𝑎𝑖1 𝐴𝑖1 + 𝑎𝑖2 𝐴𝑖2 + ⋯ + 𝑎𝑖𝑛 𝐴𝑖𝑛 , ∀𝑖 = 1, … , 𝑛
where 𝐴𝑖𝑗 are the so-called co-factors.
Exercise:
1 2 1
𝐴= 0 1 2
−3 4 1
Find
Matrix Inverse
• Note: For scalars (except 0), when we multiply a number, by its reciprocal,
we get 1:
2(1/2) = 1 𝑥(1/𝑥) = 𝑥(𝑥 −1 ) = 1
• In matrix form if A is a square matrix of full rank (all rows and columns are
linearly independent), then A has an inverse: A-1 such that: A-1 A = A A-1 = I
• Example: Let
2 8
2 8
𝑨= ⇒ 𝑨−1 = 36 36
4 −2 4 2
−
36 36
Verify:
2 8 4 32 16 16
+ −
𝑨−1 𝑨 = 36 36
2 8
= 36 36 36 36 = 1 0 = 𝑰
4 2 4 −2 8 8 32 4 2
0 1
− − +
36 36 36 36 36 36
Computing an Inverse of 2x2 Matrix
Use of Inverse Matrix – Solving Simultaneous Equations
AY = C where A and C are matrices of of constants, Y is matrix of unknowns
A -1 AY = A -1C Y = A -1C (assuming A is square and full rank)
Equation 1: 12 y1 + 6 y2 = 48 Equation 2: 10y1 − 2 y2 = 12
12 6 y1 48
A= Y= C= Y = A -1C
10 −2 y2 12
1 −2 −6 1 2 6
A -1 = =
12(−2) − 6(10) −10 12 84 10 −12
1 2 6 48 1 96 + 72 1 168 2
Y=A C=
-1
= = =
84 10 −12 12 84 480 − 144 84 336 4
Note the wisdom of waiting to divide by | A | at end of calculation!
Useful Matrix Results
All rules assume that the matrices are conformable to operations.
• Addition rules:
A+B=B+A
(A + B) + C = A + (B + C)
• Multiplication rules:
(A B)C = A(BC)
C(A + B) = CA + CB
k(A + B) = kA + kB, where k is a scalar
• Transpose rules:
(Aʹ)ʹ = A
(A + B)ʹ = A ʹ + Bʹ
(ABC)ʹ = C ʹBʹAʹ
• Inverse rules (assuming square matrices of full rank):
(A -1)-1 = A
(ABC)-1 = C-1B-1A-1
(Aʹ)-1 = (A-1)ʹ
Important Matrix Results
In general,
• 𝑨𝑩 ≠ 𝑩𝑨 (no commutative law)
• 𝑨𝑩 = 𝟎 does not imply 𝑨 = 0 or 𝑩 = 0
• 𝑨𝑩 = 𝑨𝑪 does not imply 𝑩 = 𝑪 even if 𝑨 ≠ 𝟎
PROPERTIES OF DETERMINANTS
• Another notation for det(A) is |A|
• |A′| = |A|
• |AB| = |A||B| only when A and B are both square
• |A-1| = |A|-1
• Partitioned matrices:
𝐴11 𝐴12 −1
= 𝐴11 𝐴22 − 𝐴21𝐴11 𝐴12
𝐴21 𝐴22
= 𝐴22 𝐴11 − 𝐴12𝐴−1 22 𝐴21
Trace of a matrix
Definition: If A is a square matrix n x n, then the trace is an operator
defined as
𝑛
tr 𝑨 = 𝑎𝑖𝑖
𝑖=1
Properties:
• tr(𝛼A) = 𝛼 tr(A)
• tr(A + B) = tr(A) + tr(B)
• tr(AB) = tr(BA)
𝑝
• tr(A′A) = tr (AA′) = σ𝑛𝑖=1 σ𝑗=1 𝑎𝑖𝑗
2
, where A is n x p matrix
• Circular shift: tr(ABC) = tr(BCA) = tr(CAB)
• It is a useful tool when we need to find the MLE of the covariance
matrix of multivariate normal distribution.
Random Vectors and Matrices
Shown for case of n=3, generalizes to any n:
Y1
Random variables: Y1 , Y2 , Y3 Y = Y2
Y3
E Y1
Expectation: E Y = E Y2 In general: E Y = E Yij i = 1,..., n; j = 1,..., p
E Y3 n p
Variance-Covariance Matrix for a Random Vector:
Y1 − E Y1
2 Y = E Y − E Y Y − E Y ' = E Y2 − E Y2 Y1 − E Y1 Y2 − E Y2 Y3 − E Y3 =
Y − E Y
3 3
( Y1 − E Y1) (Y − E Y ) (Y − E Y ) (Y − E Y ) (Y − E Y )
2
1 1 2 2 1 1 3 3 2
1 12 13
= E (Y2 − E Y2 ) (Y1 − E Y1) (Y − E Y ) (Y − E Y ) (Y − E Y ) =
2
2 2 2 2 3 3 21 22 23 =
32 32
(Y3 − E Y3 ) (Y1 − E Y1) ( )( ) ( )
2
Y − E Y Y − E Y Y − E Y
31
3 3 2 2 3 3
Linear Regression Example (n=3)
Error terms are assumed to be independent, with mean 0, constant variance 2 :
E i = 0 2 i = 2 i , j = 0 i j
1 0 2 0 0
ε = 2 E ε = 0 σ 2 ε = 0 2 0 = 2I
3 0 0 0 2
Y = Xβ + ε E Y = E Xβ + ε = Xβ + E ε = Xβ
2 0 0
σ 2 Y = σ 2 Xβ + ε = σ 2 ε = 0 2 0 = 2 I
0 0 2
Mean and Variance of Linear Functions of Y
Frequently we encounter a random vector W that is obtained by multiplying a
random vector Y by a constant matrix A:
W = AY
That is, if A is k n and Y is 1 n, then W is 1 k with:
𝑊1 𝑎11 𝑌1 + ⋯ + 𝑎1𝑛 𝑌𝑛
𝑊= ⋮ = ⋮
𝑊𝑘 𝑎𝑘1 𝑌1 + ⋯ + 𝑎𝑘𝑛 𝑌𝑛
Some basic results:
E(W) = AE(Y)
2(W) = 2(AY) = A2(Y)A’
or
cov(W) =Acov(Y)A’
Exercise (5.18. on p. 211)
Consider the following functions of the random variables Y1, Y2, Y3, and Y4 :
1
𝑊1 = 𝑌1 + 𝑌2 + 𝑌3 + 𝑌4
4
1 1
𝑊2 = 𝑌1 + 𝑌2 − 𝑌3 + 𝑌4
2 2
a) State the above in matrix notation;
b) Find E(W);
c) Find the cov(W)
Multivariate Normal Distribution
The observations vector Y contains an observation from each of the p variables:
𝑌1
𝑌2
𝒀= ⋮
𝑌𝑝
The mean vector E(Y), denoted by , contains the expected value of each of the p variables:
𝜇1
𝜇2
𝑬 𝒀 =𝝁= ⋮
𝜇𝑝
Finally, the covariance matrix 2(Y), denoted , contains the variances and covariances:
𝜎12 𝜎12 … 𝜎1𝑝
𝜎 𝜎22 … 𝜎2𝑝
𝜮 = 21
⋮ ⋮ ⋮
𝜎𝑝1 𝜎𝑝2 … 𝜎𝑝2
Multivariate Normal Distribution
The density function of the multivariate normal distribution can be stated as:
1 1
𝑓(𝒀) = 𝑝 1 exp − (𝒀 − 𝝁)′Σ −1 (𝒀 − 𝝁)
2
2𝜋 2 Σ 2
We abbreviate this as:
Y ~ N(, )
It can be shown that marginally each Yi is normally distributed:
𝑌𝑖 ~ 𝑁 𝑖, 𝜎𝑖2 , 𝑖 = 1, … , 𝑝
and (Yi, Yj) = ij, i j
Theorem: If A is a matrix of fixed constants, then:
W = AY ~ N(A, AA’)
Simple Linear Regression in Matrix Form
Estimating Parameters by Least Squares
Q Q
Normal equations obtained from: , and setting each equal to 0:
0 1
n n
nb0 + b1 X i = Yi
i =1 i =1
n n n
b0 X i + b1 X = X iYi
i
2
i =1 i =1 i =1
n
n
n Xi Yi b
Note: In matrix form: X ' X = n X'Y = n Defining b = 0
i =1 i =1
n
b1
X i X i2 X iYi
i =1 i =1 i =1
X'Xb = X'Y b = ( X'X ) X'Y
-1
Exercise:
Based on matrix form: Verify that the
Q = ( Y - Xβ ) ' ( Y - Xβ ) = Y'Y - Y'Xβ - β'X'Y + β'X'Xβ = formula
n n
n n 𝑏 = 𝑋 ′ 𝑋 −1 𝑋 ′ 𝑌
= Y ' Y − 2 0 Yi + 1 X iYi + n 0 + 2 0 1 X i + 1 X i2
2 2
i =1 i =1 i =1 i =1 Reduces to the
Q n n
previous version with
− 2 Yi + 2 n 0 + 2 1 Xi SSxx, SSXY, …
(Q) = = = −2 X ' Y + 2 X ' X
0 i =1 i =1
β Q n n n
2
−2 X iYi + 2 0 X i + 2 1 X i
1 i =1 i =1 i =1
Setting this equal to zero, and replacing with b X ' Xb = X ' Y
Fitted Values and Residuals
^ ^
Y i = b0 + b1 X i ei = Yi − Y i In Matrix form:
^
Y 1 b0 + b1 X 1
^ ^ b + b X
Y = Y 2 = 0 1 2 = Xb = X ( X'X ) X'Y = HY H = X ( X'X ) X'
-1 -1
M M
^ b +b X
Y n 0 1 n
H is called the "hat" or "projection" matrix, note that H is idempotent (HH = H ) and symmetric(H = H'):
HH = X ( X'X ) X'X ( X'X ) X' = X'I ( X'X ) X' = X ( X'X ) X' = H
-1 -1 -1 -1
( -1
)
H' = X ( X'X ) X' ' = X ( X'X ) X' = H
-1
^
^
Y
1 − Y 1
Y1 Y 1
^ Y ^
e = Y2 − Y 2 = 2 − Y 2 = Y - Y = Y - Xb = Y - HY = (I - H)Y
^
M M M
^
Y − Y n Yn Y n
^
n
^
Note: E Y = E HY = HE Y = HXβ = X ( X'X ) X'Xβ = Xβ
-1
^
σ 2 Y = Hσ 2IH' = 2 H
E e = E ( I − H ) Y = ( I − H ) E Y = ( I − H ) Xβ = Xβ - Xβ = 0 σ 2 e = ( I − H ) σ 2I ( I − H ) ' = 2 ( I − H )
^
s Y = MSE H
2
s 2 e = MSE ( I − H )
Analysis of Variance
How do we write sum of squares in a matrix form?
𝑛 𝑛 𝑛 2
σ 𝑖=1 𝑌𝑖
𝑆𝑆𝑇 = 𝑌𝑖 − 𝑌ത 2 = 𝑌𝑖2 −
𝑛
𝑖=1 𝑖=1
𝑛 1 ⋯ 1
2 ′
σ𝑛𝑖=1 𝑌𝑖 2 1 ′
𝑌𝑖 = 𝒀 𝒀, = 𝒀 𝑱𝒀, 𝑱 = ⋮ ⋱ ⋮
𝑛 𝑛
𝑖=1 1 ⋯ 1
′
1 ′ ′
1
⇒ 𝑆𝑆𝑇 = 𝒀 𝒀 − 𝒀 𝑱𝒀 = 𝒀 𝐈 − 𝐉 𝒀
𝑛 𝑛
𝑆𝑆𝐸 = 𝒆′ 𝒆 = 𝒀 − 𝑿𝒃 ′ 𝒀 − 𝑿𝒃 = 𝒀′ 𝒀 − 𝒀′ 𝑿𝒃 − 𝒃′ 𝑿′ 𝒀 + 𝒃′ 𝑿′ 𝑿𝒃
= 𝒀′ 𝒀 − 𝒃′ 𝑿′ 𝒀 = 𝒀′ 𝑰 − 𝑯 𝒀
Since 𝒃′ 𝑿′ 𝒀 = 𝒀′ 𝑿′ 𝑿′ 𝑿 −1𝑿′ 𝒀 = 𝒀′ 𝑯𝒀
Exercise:
Show that
1
𝑆𝑆𝑅 = 𝒀′ 𝑯 − 𝑱 𝒀
𝑛
Note: All of these sums of squares are of the form 𝑌 ′ 𝐴𝑌 where A is a symmetric
matrix.
Time to think: What is the name of expressions like 𝑌 ′ 𝐴𝑌 ?
Eigenvalues and Eigenvectors
Def: If A is a square matrix and 𝜆 is a scalar and x is a nonzero vector such that
Ax = 𝜆x
then we say that 𝜆 is an eigenvalue of A and x is its corresponding eigenvector.
Note: To find the eigenvalues we solve |A - 𝜆I| = 0
Note: If A is n⨉n then A has n eigenvalues 𝜆1, … , 𝜆n.
The 𝜆’s are not necessarily all distinct, or nonzero or real numbers.
Properties:
• tr 𝐴 = σ𝑛𝑖=1 𝜆𝑖
• 𝐴 = ς𝑛𝑖=1 𝜆𝑖
Note: A symmetric matrix is positive definite if all the eigenvalues are positive.
10 3
Exercise: Find the eigenvalues and eigenvectors of 𝑨 =
3 8
Inferences in Linear Regression
b = ( X'X ) X'Y Þ E b = ( X'X ) X'E Y = ( X'X ) X'Xβ = β
-1 -1 -1
σ 2 b = ( X'X ) X'σ 2 Y X ( X'X ) = σ 2 ( X'X ) X'IX ( X'X ) = σ 2 ( X'X ) s 2 b = MSE ( X'X )
-1 -1 -1 -1 -1 -1
1 X
2
X MSE 2
X MSE X MSE
+ n − n + n − n
n
( ) (
Xi − X ) n
( ) (
Xi − X )
2 2 2 2
Xi − X Xi − X
( X'X ) s b =
-1
Recall: = i =1 i =1
2 i =1 i =1
X 1 X MSE MSE
− −
2 2
( ) ( ) ( ) ( )
n 2 n n 2 n
Xi − X Xi − X Xi − X Xi − X
i =1 i =1 i =1 i =1
Estimated Mean Response at X = X h :
^
Y h = b0 + b1 X h = Xh'b
1
Xh =
Xh
^
s 2 Y h = Xh's 2 b Xh = MSE Xh' ( X'X ) Xh ( -1
)
Predicted New Response at X = X h :
( )
^
s 2 pred = MSE 1 + Xh' ( X'X ) Xh
-1
Y h = b0 + b1 X h = Xh'b