Lecture 2: Probability and linear algebra basics
Statistical Learning (BST 263)
Jeffrey W. Miller
Department of Biostatistics
Harvard T.H. Chan School of Public Health
1 / 21
Outline
Linear algebra basics
Probability basics
Random vectors
2 / 21
Outline
Linear algebra basics
Probability basics
Random vectors
3 / 21
Linear algebra in this course
A little bit of linear algebra is essential for understanding
many machine learning methods.
I E.g., linear regression, logistic regression, LDA, QDA, PCA,
GAMs, kernel ridge, SVMs, K-means.
Linear algebra is not a prerequisite for this course, so I made
the following slides to give you the basic concepts needed.
You will need to study this material carefully if you are not
already familiar with it.
4 / 21
Matrices and transposes
A is an m × n real matrix, written A ∈ Rm×n , if
a11 a12 · · · a1n
a21 a22 · · · a2n
A= .
. ..
. .
am1 am2 · · · amn
where aij ∈ R. The (i, j)th entry of A is Aij = aij .
The transpose of A ∈ Rm×n is defined as
A11 A21 · · · Am1
A12 A22 · · · Am2
T n×m
A = . .. ∈ R .
.. .
A1n A2n · · · Amn
In other words, (AT )ij = Aji .
Note: x ∈ Rn is considered to be a column vector in Rn×1 .
5 / 21
Sums and products of matrices
The sum of matrices A ∈ Rm×n and B ∈ Rm×n is the matrix
A + B ∈ Rm×n such that
(A + B)ij = Aij + Bij .
The product of matrices A ∈ Rm×n and B ∈ Rn×` is the
matrix AB ∈ Rm×` such that
n
X
(AB)ij = Aik Bkj .
k=1
6 / 21
Basic matrix properties
In the following properties, it is assumed that the matrix
dimensions are compatible. (For example, if we write A + B then
it is assumed that A and B are the same size.)
(AB)C = A(BC)
I Consequently, we can write ABC without specifying the order
in which the multiplications are performed.
A(B + C) = AB + AC
(B + C)A = BA + CA
Except in special circumstances, AB is not equal to BA.
(AB)T = B T AT
(A + B)T = AT + B T
7 / 21
Identity, inverse, and trace
The n × n identity matrix, denoted In×n or I for short, is
1 0 ··· 0
0 1 · · · 0
n×n
I = In×n = . .. ∈ R .
.. .
0 0 ··· 1
IA = A = AI
If it exists, the inverse of A, denoted A−1 , is a matrix such
that A−1 A = I and AA−1 = I.
If A−1 exists, we say that A is invertible.
(A−1 )T = (AT )−1
(AB)−1 = B −1 A−1
n×n , denoted tr(A), is
The trace of a square Pnmatrix A ∈ R
defined as tr(A) = i=1 Aii .
tr(AB) = tr(BA) if AB is a square matrix.
8 / 21
Symmetric and definite matrices
A is symmetric if A = AT .
A is symmetric positive semi-definite (SPSD) if and only if
A = B T B for some B ∈ Rm×n and some m.
A is symmetric positive definite (SPD) if and only if A is
SPSD and A−1 exists.
There are many equivalent definitions of SPSD and SPD
(which is why I wrote “if and only if”). I believe the
definitions above are the easiest to understand and use.
9 / 21
Outline
Linear algebra basics
Probability basics
Random vectors
10 / 21
Discrete random variables
Informally, a random variable (r.v.) is a quantity that
probabilistically takes any one of a range of values.
Notation: Uppercase for r.v.s, lowercase for values taken.
A random variable X is discrete if it takes values in a
countable set X = {x1 , x2 , . . .}.
Examples: Bernoulli, Binomial, Poisson, Geometric.
The density of a discrete r.v. is the function
p(x) = P(X = x) = probability that X equals x.
I Sometimes, p(x) is called the probability mass function in the
discrete case, but “density” is technically correct also.
Properties (discrete case):
X X
0 ≤ p(x) ≤ 1, p(x) = 1, P(X ∈ A) = p(x).
x∈X x∈A
11 / 21
Continuous random variables
A random variable X ∈ R is continuous
R if there is a function
p(x) ≥ 0 such that P(X ∈ A) = A p(x)dx for all A ⊆ R.
I (We will ignore measure-theoretic technicalities in this course.)
Examples: Normal, Uniform, Beta, Gamma, Exponential.
p(x) is called the density of X.
Careful! p(x) is not the probability that X equals x.
R
Note that R p(x)dx = 1, but p(x) can be > 1.
The same definitions apply to random vectors X ∈ Rn , with
Rn in place of R.
The cumulative distribution function (c.d.f.) of X ∈ R is
Z x
F (x) = P(X ≤ x) = p(x0 )dx0 .
−∞
12 / 21
Joint distributions of multiple random variables/vectors
p(x, y) denotes the joint density of X ∈ X and Y ∈ Y.
I P(X = x, Y = y) = p(x, y) if X and Y are discrete.
R
I P(X ∈ A, Y ∈ B) = A×B
p(x, y)dx dy if X and Y are
continuous.
R
I P(X = x, Y ∈ B) = B
p(x, y)dy if X is discrete and Y is
continuous.
The density of X can be recovered from the joint density by
marginalizing over Y :
P
I p(x) = y∈Y p(x, y) if Y is discrete,
R
I p(x) = Y p(x, y)dy if Y is continuous.
Note: It is common to use “p” to denote all densities and
follow the convention that X is taking the value x, Y is
taking the value y, etc.
13 / 21
Conditional densities and Independence
If p(y) > 0 then the conditional density of X given Y = y is
p(x, y)
p(x|y) = .
p(y)
X and Y are independent if p(x, y) = p(x)p(y) for all x, y.
X1 , . . . , Xn are independent if
p(x1 , . . . , xn ) = p(x1 ) · · · p(xn )
for all x1 , . . . , xn .
X1 , . . . , Xn are conditionally independent given Y if
p(x1 , . . . , xn | y) = p(x1 |y) · · · p(xn |y)
for all x1 , . . . , xn , y.
14 / 21
Expectations (a.k.a. expected values)
Suppose h(x) is a real-valued function of x.
The expectation of h(X), denoted E(h(X)), is
P
I E(h(X)) = R x∈X h(x)p(x) if X is discrete,
I E(h(X)) = X h(x)p(x)dx if X is continuous.
The conditional expectation of h(X) given Y = y is
P
I E(h(X) | Y = y) = R x∈X h(x)p(x|y) if X is discrete,
I E(h(X) | Y = y) = X h(x)p(x|y)dx if X is continuous.
E(h(X)|Y ) is defined as g(Y ) where g(y) = E(h(X)|Y = y).
Law of iterated expectations: E(E(h(X)|Y )) = E(h(X)).
15 / 21
Outline
Linear algebra basics
Probability basics
Random vectors
16 / 21
Random vectors
If Z1 , . . . , Zn ∈ R are random variables, then
Z1
..
Z = . = (Z1 , . . . , Zn )T
Zn
is a random vector in Rn .
The expectation of a random vector Z ∈ Rn is
E(Z1 )
E(Z) = ... .
E(Zn )
17 / 21
Random vectors
The covariance matrix of a random vector Z ∈ Rn is the
matrix Cov(Z) ∈ Rn×n with (i, j)th entry
Cov(Z)ij = Cov(Zi , Zj )
where
Cov(Zi , Zj ) = E (Zi − E(Zi ))(Zj − E(Zj ))
= E(Zi Zj ) − E(Zi )E(Zj ).
Equivalently,
Cov(Z) = E (Z − E(Z))(Z − E(Z))T
= E(ZZ T ) − E(Z)E(Z)T .
Recall that Z ∈ Rn is considered to be a column vector in
Rn×1 , so ZZ T is a matrix in Rn×n .
18 / 21
Random vectors
Cov(Z) is always SPSD.
If Z ∈ Rn is a random vector, then
E(AZ + b) = A E(Z) + b
and
Cov(AZ + b) = A Cov(Z)AT
for any fixed (i.e., nonrandom) A ∈ Rm×n and b ∈ Rm .
If Y, Z ∈ Rn are independent random vectors, then
Cov(Y + Z) = Cov(Y ) + Cov(Z).
19 / 21
Multivariate normal distribution
If µ ∈ Rn and C ∈ Rn×n is SPSD, then Z ∼ N (µ, C) denotes
that Z is multivariate normal with E(Z) = µ and
Cov(Z) = C.
Standard multivariate normal: If Z1 , . . . , Zn ∼ N (0, 1)
independently and Z = (Z1 , . . . , Zn )T , then Z ∼ N (0, I).
Affine transformation property : If Z ∼ N (µ, C) then
AZ + b ∼ N (Aµ + b, ACAT ) for any fixed A ∈ Rm×n ,
b ∈ Rm , µ ∈ Rn , and SPSD C ∈ Rn×n .
Any multivariate normal distribution can be obtained via an
affine transformation (AZ + b) of Z ∼ N (0, In×n ) for an
appropriate choice of n, A, and b.
20 / 21
Multivariate normal distribution
Sum property : If Y ∼ N (µ1 , C1 ) and Z ∼ N (µ2 , C2 )
independently, then Y + Z ∼ N (µ1 + µ2 , C1 + C2 ).
Density : If Z = (Z1 , . . . , Zn )T ∼ N (µ, C) and C −1 exists,
then Z has density
1 1 T −1
p(z) = exp − 2 (z − µ) C (z − µ)
(2π)n/2 | det(C)|1/2
for all z ∈ Rn .
21 / 21