Introduction to pattern recognition
Mathematical background
ENGG 5202: Pattern Recognition
Introduction and Mathematical Background
Prof. LI Hongsheng
Office: SHB 428
e-mail: [email protected]
web: https://blackboard.cuhk.edu.hk
Department of Electronic Engineering
The Chinese University of Hong Kong
Jan. 2024
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Outline
1 Introduction to pattern recognition
2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
1 Introduction to pattern recognition
2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
The objective of pattern recognition
The term pattern recognition and machine learning are generally used
interchangeably. Nowadays, machine learning is actually a more popular
term
Learn a mapping function or probability distributions from training data
• Example of classification
• Example of regression
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Pattern Recognition aims at learning a function or a distribution
The course mostly focuses on function learning or function fitting
Given an input value or vector, a function assigns it with a value or vector
“One-to-many” mapping is not a function. “Many-to-one” mapping is a
function.
Not a function A function
Note that a function can have a vector output or matrix output. For
instance, the following formula is still a function
y1 x1 x1 + x2
=f =
y2 x2 x1 x2
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Function estimation
We are interested in predicting y from input x and assume there exists a
function that describes the relationship between y and x, e.g., y = f (x)
If the function f ’s parametric form is fixed, prediction function f can be
parametrized by a parameter vector θ
Estimating fˆ from a training set D = {(x(1) , y (1) ), (x(2) , y (2) ), · · · ,
(x(n) , y (n) )}
With a better design of the parametric form of the function, the learner
could achieve better performance
This design process typical involves domain knowledge
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Face recognition in smart surveillance for crossing at a red light in China
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
An interesting failure case
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Face recognition helps capture criminal suspects in Jacky Cheung’s
concerts
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Object detection for autonomous driving
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Action recognition
Sample video frames from UCF-101 dataset
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Email spam classification
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Speech recognition
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Computer-aided medical diagnosis
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Function of gene sequence classification
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Example of pattern recognition
Financial time series prediction
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Pattern recognition is a sub-field of artificial intelligence
Artificial intelligence: general reasoning
Pattern recognition or machine learning: learn to obtain a function with
expected outputs
Deep learning: machine learning with deep neural networks
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Pattern recognition systems
Classification of types of fishes
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Statistical learning
Make the optimal decision from the statistical point of view (with
statistical explanation) view (with statistical explanation)
Bayesian decision theory
What’s the statistical model (the probability distributions of samples
belonging to each class)?
Manually specified or learned from data
Maximum likelihood estimation, Bayesian parameter estimation,
non-parametric density estimation
Directly model the parametric form of decision boundary
Support Vector Machine
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Classification model
Each sample is represented by a d-dimensional feature vector.
The goal of classification is to establish decision boundaries in the feature
space to separate samples belonging to difference classes patterns
belonging to difference classes
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Feature matters
Properly choose features for different classification/regression problems is
one of the key problems in patter recognition applications
Computer vision applications
Histogram of Oriented Gradients (HOG) features
Scale-invariant Feature Transform (SIFT) features
Oriented FAST and rotated BRIEF (ORB) features
Speech recognition
Linear Predictive Codes (LPC) features
Perceptual Linear Prediction (PLP) features
Mel Frequency Cepstral Coefficients (MFCC) features
If discriminative (good) enough features exist, even a very simple linear
classifier can perform well
Back in 1970s to early 2010s, features are mostly manually designed by
humans according to experience
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Pattern recognition systems
Classification of types of fishes
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Fish classification
Features
Length of the fish & average lightness of the fish
Models
Sea bass have some typical length (lightness) and it is greater than that for
salmon
f (x) = “sea bass” if x > x∗
Training set (with training samples)
Used to tune the parameters of an adaptive model
Minimize the classification errors on the training set
Histograms for the length Histograms for the lightness
feature for the two categories feature for the two categories
None of the features alone will serve to unambiguously
discriminate between the two categories!
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Jointly use two features
The classification error on the training data becomes lower than using only
one feature
Use more features?
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Linearly separable features
What makes good features: linearly separable features
A linear classifier (decision boundary) that correctly classifies all training
samples
However, such a property cannot be met for most scenarios
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Training set and testing set
The data with annotations should be separated into training set, validation
set (optional), and testing set
Reaching 100% accuracy on the training set cannot guarantee good
performance to general unseen samples
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Training set and testing set
The data with annotations should be separated into training set, validation
set (optional), and testing set
Reaching 100% accuracy on the training set cannot guarantee good
performance to general unseen samples
The model must have the capability of generalize to unseen (test) samples
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Model (learner) capacity
The ability of the learner (or called model) to discover a function taken
from a family of functions. Examples:
Linear predictor
y = wx + b
Quadratic predictor
y = w2 x2 + w1 x + b
Degree-10 polynomial predictor
10
X
y =b+ wi xi
i=1
The latter family is richer, allowing to capture more complex functions
Capacity can be measured by the number of training examples {x(i) , y (i) }
that the learner could always fit, no matter how to change the values of
x(i) and y (i)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Underfitting
The learner cannot find a solution that fits training examples well
For example, use linear regression to fit training examples {x(i) , y (i) } where
y (i) is an quadratic function of x(i)
Underfitting means the learner cannot capture some important aspects of
the data
Reasons for underfitting happening
Model is not rich enough
Difficult to find the global optimum of the objective function on the training
set or easy to get stuck at local minimum
Limitation on the computation resources (not enough training iterations of
an iterative optimization procedure)
Underfitting commonly happens in non-deep learning approaches with
large scale training data and could be even a more serious problem than
overfitting in some cases
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Overfitting
The learner fits the training data well, but loses the ability to generalize
well, i.e. it has small training error but larger generalization error
A learner with large capacity tends to overfit
The family of functions is too large (compared with the size of the training
data) and it contains many functions which all fit the training data well.
Without sufficient data, the learner cannot distinguish which one is most
appropriate and would make an arbitrary choice among these apparently
good solutions
A separate validation set helps to choose a more appropriate one
In most cases, data is contaminated by noise. The learner with large
capacity tends to describe random errors or noise instead of the underlying
models of data (classes)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Model complexity (capacity)
The goal is to classify novel examples not seen yet, but not the training
examples!
Generalization. The ability to correctly classify new examples that differ
from those used for training
Overly complex models lead to complicated de- The decision boundary might represent the
cision boundaries. It leads to perfect classifica- optimal tradeoff between performance on the
tion on the training examples, but would lead training set and simplicity of classifier, there-
to poor performance on new patterns. fore giving highest accuracy on new patterns.
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Optimal capacity
Difference between training error and generalization error increases with
the capacity of the learner
Generalization error is a U-shaped function of capacity
Optimal capacity capacity is associated with the transition from
underfitting to overfitting
One can use a validation set to monitor generalization error empirically
Optimal capacity should increase with the number of training examples
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Optimal capacity
Typical relationship between capacity and both training and generalization (or
test) error. As capacity increases, training error can be reduced, but the
optimism (difference between training and generalization error) increases. At
some point, the increase in optimism is larger than the decrease in training
error (typically when the training error is low and cannot go much lower), and
we enter the overfitting regime, where capacity is too large, above the optimal
capacity. Before reaching optimal capacity, we are in the underfitting regime.
(Bengio et al. Deep Learning 2014)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Curse of dimensionality
In general, the classification errors on the training set can decrease by
simply increasing the number (dimension) of features
But it is not the case on the testing set which includes unseen samples
Scatter plot of the training data of three classes
Two features are used The classes. Two features
are used. The goal is to classify the new testing
point denoted by ‘x’.
The feature space is uniformly divided into celss.
A cell is labeled as a class, if the majority of
training examples in that cell are from that class.
The testing point is classified according to the
label of the cell where it falls in.
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Curse of dimensionality
The more training samples in each cell, the more robust the classifier
The number of cells grows exponentially with the dimensionality of the
feature space
If each dimension is divided into three intervals, the number of cells is
N = 3D
Some cells are empty when the number of cells is very large!
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Machine learning with big data
Machine learning with small data: overfitting, reducing model complexity
(capacity)
Machine learning with big data: underfitting, need to increase model
complexity, difficult optimization, high computation resources
Curse of dimensionality
⇓
Blessing of dimensionality†
⇓
Learning hierarchical feature transforms
(with deep learning)
†
D. Chen et al. “Blessing of dimensionality: High dimensional feature and its efficient compression
for face verification,” CVPR 2013.
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Pattern recognition systems
Classification of types of fishes
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Pattern recognition systems
Feature extraction
Discriminative features
Invariant features with respect to certain transformation
A small number of features
Classifier/regressor
Tradeoff of classification errors on the training set and the model complexity
Decide the form of the classifier
Tune of the parameters of the classifiers by training
Post processing
Risk: the cost of mis-classifying sea bass is different than that of
mis-classifying salmon
Context: it is more likely for a fish to be the same class as its previous one
Integrating multiple classifiers: classifiers are based on different sets of
features
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Training cycle
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Data collection
Collect both training data, validation data, and test data
Label the ground truth annotations
Is the training set large enough?
Is the training set representative enough?
Are the training data and the testing data collected under the same
condition?
Initial examination of the data to get a feel of data structure
Summary of statistics
Producing plots
The analysis of the evaluation results may require further data collection
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Problem setup for supervised learning
Given pairs of inputs and outputs, learn a function to map inputs to
outputs
Function inputs: features x(i) (x(i) ∈ Rd for general problems)
Function outputs: target outputs y (i) ∈ R
One training sample: (x(i) , y (i) )
Training set of m samples: {(x(1) , y (1) ), (x(2) , y (2) ), · · · , (x(m) , y (m) )}
Hypothesis h : Rd → R: the function to be learned to map a general input
x to expected output y
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Training and testing
The parametric form of h is fixed
Training: find the optimal parameters θ of function h based on the
training set {(x(1) , y (1) ), (x(2) , y (2) ), · · · , (x(m) , y (m) )}, usually by
minimizing some cost function
Testing: fix the found optimal parameters θ, given the input features of
one unseen example x, predict the output value y
Training set
Learning h
algorithm
Testing input Testing output
h
Training inputs Training outputs
Training Stage Testing Stage
If target variables y are continuous, the learning is a regression problem
If target variables y can only take a small number of discrete values
(classes), it is a classification problem
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Evaluation
Apply the trained classifier to an independent validation set of labeled
samples
It is important to both measure the performance of the system and to
identify the need for improvements in its system and to identify the need
for improvements in its components
Compare the error rates on the training set and the validation set to
decide if it is overfitting or underfitting
High error rates on both the training set and the validation set: underfitting
Low error rate on the training set and high error rate on the validation set:
overfitting
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Design cycle
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Binary Classification Evaluation Criteria
A binary classifier classifies each sample into two classes (positive or
negative)
In general, the binary classifier outputs a continuous value, usually (but
not necessarily) between [0,1]. One can choose a threshold value to
determine a sample is considered as positive or negative
Error rate in the test sample set D = {(x(1) , y (1) ), (x(2) , y (2) ),
· · · , (x(m) , y (m) )}
m
1 X
E(h) = 1 h(x(i) ) ̸= y (i)
m i=1
Accuracy on the test sample set
m
1 X
acc(h) = 1 h(x(i) ) = y (i) = 1 − E(h)
m i=1
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Classification Evaluation Criteria
Precision
TP
P =
TP + FP
Recall
TP
R=
TP + FN
Precision-Recall (P-R) curve
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Classification Evaluation Criteria
F-1 Measure tries to balance the contributions of precision and recall, and
gives a single value
2×P ×R 2 × TP
F1 = =
P +R #All samples + T P − T N
Receiver Operating Characteristic (ROC) curve. Different from the P-R
curve, the tested samples are ranked according to their positive
confidences.
The horinzontal axis is False Positive Rate (FPR) and the vertical axis is
True Positive Rate (TPR)
TP FP
TPR = , FPR =
TP + FN TN + FP
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Classification Evaluation Criteria
Similar to P-R curve, if one curve enclose another one entirely, the former
represents better performance than the latter one
It still poses difficulties, if two curves intersect each other
It is more appropriate to use Area Under ROC Curve (AUC)
m−1
1 X
AUC = (xi+1 − xi ) · (yi + yi+1 )
2 i=1
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Learning schemes
Supervised learning
An “expert” provides a category label for each pattern in the training set
It may be cheap to collect patterns but expensive to obtain the labels
Unsupervised learning
The system automatically learns feature transformation from the training
samples without any annotation to best represent them
Weakly supervised learning (not covered in this course)
The supervisions are not exact or rough
Example: learning image segmentation by providing image-level annotations
Semi-supervised learning (not covered in this course)
Some samples have labels, while some do not
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Deep learning
Deep learning aims at learning better feature representations
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Neural networks
Deep learning is based on neural networks
Neural networks originates back to 1970s-1980s
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Neural networks
A network of interconnecting artificial neurons
It simulates some properties of biological neural networks: learning
generalization adaptivity fault networks: learning, generalization, adaptivity,
fault tolerance, distribution computation
Low dependence on domain-specific knowledge
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Neural networks
They provide a new suite of nonlinear algorithms for feature extraction
(using hidden layers) and classification (e.g., multi-layer perceptrons)
Existing feature extraction and classification algorithms can aslo be
mapped on neural network architecture for efficient hardware mapped on
neural network architecture for efficient hardware implementation
In spite of the seemly different underlying principles, most of the well
known neural network models are implicitly equivalent or similar to
classical statistical machine learning methods
Link between statistical learning and neural networks
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
What makes the difference?
Deep learning becomes popular again in 2010s
Large-scale training data
Super parallel computing power (e.g. GPU and TPU)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Deep learning in neural networks
Become hot since 2006
Hinton et al., “A Fast Learning Algorithm for Deep Belief Nets,” Neural
Computation, 2006
Other famous researchers in deep learning
Andrew Ng (Stanford), Yann LeCun (NYU), Yoshua Bengio (U of Montreal)
MIT Technology Review lists deep learning as MIT Technology Review
lists deep learning as one of the top-10 breakthrough technologies in 2013
Neural networks with more hidden layers
Many existing statistical models can be approximated as neural networks
with one or two hidden layers
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Success of deep learning
Speech recognition
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Success of deep learning
Object classification over 1 million images of 1000 classes
ImageNet Challenge 2012
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Success of deep learning
ImageNet Challenge 2013
All teams used deep learning
MSRA, IBM, Adobe, NEC, Clarifai, Berkley, U. Tokyo, UCLA, UIUC,
Toronto, etc.
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Introduction to pattern recognition
Mathematical background
Different types of deep learning
Back in 2006, the name “deep learning” proposed by G. Hinton mostly
describe deep neural networks trained in the unsupervised learning setting
Restricted Boltzmann Machine
Deep Belief Network
Auto-encoder
Nowadays, deep learning research is dominated by supervised learning
approaches
Multi-layer perceptron (MLP)
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
In this course, we only focus on the basics of multi-layer perceptron
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
1 Introduction to pattern recognition
2 Mathematical background
Linear Algebra
Probability theory review
Multivariate Gaussian distributions
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Vector-vector multiplication
Inner product for two vectors x, y ∈ Rn . Can be used to measure two
vectors’ similarity.
y1
.
xT y ∈ R = [x1 , x2 , · · · , xn ] ..
yn
Outer product
···
x1 y1 x1 y2 x1 yn
x1
..
x2 y1 x2 y2 ··· x2 yn
xy T ∈ Rm×n = . [y1 , y2 , · · · , yn ] =
.. .. .. ..
. . . .
xn
xm y1 xm y2 ··· xm yn
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Matrix-vector multiplication
Given a matrix A ∈ Rm×n and a vector x ∈ Rn , their product is a vector
y = Ax ∈ Rm .
If we write A by rows, Ax can be expressed by
aT1
T
a1 x
T T
a 2
2x
a
y = Ax = x =
.. .
..
.
aTm aTm x
The ith entry of y is equal to the inner product of the ith row of A and x
If write A in column form,
x1
| | | x2 | |
y = Ax = a1 a2 ··· an = a1 x1 + · · · + an xn
..
| | |
.
| |
xn
y is a linear combination of columns of A
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Matrix-vector multiplication
If multiply A on the left by a row vector xT , y T = xT A ∈ Rn for
A ∈ Rm×n , x ∈ Rm
Express A in terms of its columns
the ith entry of y T is equal to the inner product of x and the ith column
of A.
Express A in terms of rows
y is a linear combination of rows of A
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Matrix multiplication
The product of two matrices A ∈ Rm×n and B ∈ Rn×p
C = AB ∈ Rm×p
where
n
X
Cij = Aik Bkj
k=1
The (i, j)th entry of C is equal to the inner product of the ith row of A
and the jth column of B
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Matrix multiplication
If represent B by columns, we can view C as matrix-vector product
between A and the columns of B
These matrix-vector products can in turn be interpreted using both
viewpoints given in the previous slides
If represent A by rows, and view the rows of C as the matrix-vector
product between the rows of A and C
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Transpose, symmetric matrix, trace
The transpose of a matrix “flips” the rows and columns of a matrix.
Given A ∈ Rm×n , its transpose AT ∈ Rn×m
(AT )ij = Aji
(AT )T=A
(AB)T = B T AT
(A + B)T = AT + B T
Symmetric matrix: a square matrix A ∈ Rn×n is symmetric if A = AT
Trace of a square matrix A ∈ Rn×n , denoted as tr(A)
n
X
trA = Aii
i=1
For A ∈ Rn×n , trA = trAT
For A, B ∈ Rn×n , tr(A + B) = trA + trB
For A, B such that AB is square, trAB = trBA
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Norms
A norm a vector ∥x∥ is informally a measure of the “length” of the vector
Euclidean (or L2 norm) v
u n 2
uX
∥x∥2 = t xi
i=1
A norm can be viewed as a distance and is any function f : Rn → R
For all x ∈ Rn , f (x) ≥ 0 (non-negativity)
f (x) = 0 if and only if x = 0 (definiteness)
For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x) (homogeneity)
For all x, y ∈ Rn , f (x + y) ≤ f (x) + f (y) (triangle inequality)
L1 norm
n
X
∥x∥1 = |xi |
i=1
L∞ norm
∥x∥∞ = maxi |xi |
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Norms
Lp norm
n
!1/p
X p
∥x∥p = |xi |
i=1
For matrix, we have Frobenius norm
v
um X n
√
uX
∥A∥F = t A2ij = trAT A
i=1 j=1
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Linear Independence
A set of vectors {x1 , x2 , · · · , xn } ⊂ Rn is said to be (linearly) independent
if no vector can be represented as a linear combination of the remaining
vectors
Conversely, if one vector belonging to the set can be represented as a
linear combination of the remaining vectors, the vector are said to be
(linearly) dependent
n−1
X
xn = αi xi
i=1
for some scalar values α1 , · · · , αn−1 ∈ R
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Inverse
The inverse of a matrix A ∈ Rn×n denoted as A−1 , such that
A−1 A = I = AA−1
A is invertible or non-singular if A−1 exists and non-invertible or
singular otherwise
Assume A, B ∈ Rn×n are non-singular
(A−1 )−1 = A
(AB)−1 = B −1 A−1
(A−1 )T = (AT )−1
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Orthogonal matrices
Two vectors x, y ∈ Rn are orthogonal if xT y = 0. A vector x ∈ Rn is
normalized if ∥x∥2 = 1.
A square matrix U ∈ Rn×n is orthogonal if all columns are orthogonal to
each other and are normalized. Its columns are referred to be
orthonormal.
In other words, the inverse of an orthogonal matrix is its transpose
A⊤ A = I = AA⊤
Operating on a vector with an orthogonal matrix do not change its
Euclidean norm
∥U x∥2 = ∥x∥2
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Quadratic forms
Given a square matrix A ∈ Rn×n and a vector x ∈ Rn , the scalar value
xT Ax is called a quadratic form
Xn X
n
xT Ax = Aij xi xj
i=1 j=1
Note that
1 1
xT Ax = xT
A + AT x
2 2
Only the symmetric part of A contributes to the quadratic form
A symmetric matrix A ∈ S n is positive definite if for all non-zero vectors
x ∈ Rn , xT Ax > 0, written as A ≻ 0
A symmetric matrix A ∈ S n is positive semidefinite if for all vectors
x ∈ Rn , xT Ax ≥ 0, written as A ⪰ 0
A symmetric matrix A ∈ S n is negative definite if for all non-zero vectors
x ∈ Rn , xT Ax < 0, written as A ≺ 0
A symmetric matrix A ∈ S n is positive semidefinite if for all vectors
x ∈ Rn , xT Ax ≤ 0, written as A ⪯ 0
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Quadratic forms
Finally, a symmetric matrix A ∈ S n is indefinite, if it is neither positive
semidefinite nor negative semidefinite
Positive definite and negative definite matrices are always full rank
Gram matrix: given any matrix A ∈ Rm×n , the matrix G = AT A is
always positive semidefinite. If m ≥ n, then G = AT A is positive definite
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Eigenvalues and eigenvectors
Given a square matrix A ∈ Rn×n , λ ∈ Cn is an eigenvalue of A and
x ∈ Cn is the corresponding eigenvector if
Ax = λx, x ̸= 0
Solve the following equation
(λI − A)x = 0, , x ̸= 0
(λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a
non-empty nullspace, i.e., (λI − A) is singular, i.e.,
|λI − A| = 0
Solving the equation leads to n (possibly complex) eigenvalues
λ1 , λ2 , · · · , λn . Solving (λi I − A) = 0 leads to n associated eigenvectors.
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Properties of eigenvalues and eigenvectors
n
X
trA = λi
i=1
n
Y
|A| = λi
i=1
The rank of A is equal to the number of non-zero eigenvalues of A
If A is non-singular then 1/λi is an eigenvalue of A−1 with associated
eigenvector xi , i.e., A−1 xi = (1/λi )xi
The eigenvalues of a diagonal matrix D = diag(d1 , · · · , dn ) are just the
diagonal entries d1 , · · · , dn
All the eigenvector equations can be formulated as
AX = XΛ
If the eigenvectors of A are linearly independent, then the matrix X will
be invertible, so A = XΛX −1 . A is called diagonalizable.
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Eigenvalues and eigenvectors of symmetric matrices
All the eigenvalues of a symmetric matrix A are real
The eigenvectors of A are orthonormal, i.e., the matrix X is an
orthogonal matrix (re-written as U )
A = U ΛU T
The definiteness of a matrix depends entirely on the sign of its eigenvalues
Xn
xT Ax = xT U ΛU T x = y T Λy = λi yi2 , where y = U T x
i=1
Because yi2 is always positive, the sign of this expression depends entirely
on the λi ’s
Application of eigenvalues and eigenvectors. For a matrix A ∈ Sn , the
solutions of the following problems
maxx∈Rn xT Ax subject to ∥x∥22 = 1
T
minx∈Rn x Ax subject to ∥x∥22 = 1
are the eigenvectors corresponding to the maximal and minimal eigenvalues
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Derivatives
Recall that derivatives of a function f (x) is defined as
df (x) f (x + δ) − f (x)
= lim
x δ→0 δ
Common functions
d
c=0
dx
d
ax = a
dx
d a
x = axa−1
dx
d 1
log x =
dx x
Rules
d
Product rule: f (x)g(x) = f (x)g ′ (x) + f ′ (x)g(x)
dx
d 1 −f (x)′
Quotient rule: =
dx f (x) f (x)2
d
Chain rule: f (g(x)) = f (g(x)) · g ′ (x)
′
dx
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Computational graph
Computational graph is a graphical representation of a function
composition
Example
u = bc, v = a + u, J = 3v
Calculate derivatives backward sequentially
∂J ∂J ∂J ∂v ∂J ∂J ∂v ∂J ∂J ∂u ∂J ∂J ∂u
, = , = , = , =
∂v ∂u ∂v ∂u ∂a ∂v ∂a ∂b ∂u ∂b ∂c ∂u ∂c
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Matrix calculus
Suppose that f : Rm×n → R is a function that takes as input a matrix
A ∈ Rm×n and returns a scalar real value
The gradient of f with respect to A ∈ Rm×n is the matrix of partial
derivatives,
∂f (A) ∂f (A) ∂f (A)
···
∂A11 ∂A12 ∂A1n
∂f (A) ∂f (A) ∂f (A)
···
m×n
∂A21 ∂A22 ∂A2n
∇A f (A) ∈ R =
. . . .
. . . .
.
. . .
∂f (A) ∂f (A) ∂f (A)
···
∂Am1 ∂Am2 ∂Amn
∂f (A)
where (∇A f (A))ij =
∂Aij
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Matrix calculus
The gradient of f with respect to x ∈ Rn is the vector of partial
derivatives,
∂f (x)
∂x1
∂f (x)
n
∂x2
∇x f (x) ∈ R =
.
.
.
∂f (x)
∂xn
Properties
∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x)
For t ∈ R, ∇x (tf (x)) = t∇x f (x)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Gradients of linear and quadratic functions
For x ∈ Rn , let f (x) = bT x for some known vector b ∈ Rn . Then
Xn
f (x) = bi xi
i=1
so
n
∂f (x) ∂ X
= bi xi = bk
∂xk ∂xk i=1
We have
∇x bT x = b
For quadratic function x Ax for A ∈ Sn ,
T
n n
∂f (x) ∂ XX
= Aij xi xj
∂xk ∂xk i=1 j=1
∂ X X X X 2
= Aij xi xj + Aik xi xk + Akj xk xj + Akk xk
∂xk
i̸=k j̸=k i̸=k j̸=k
X X n
X
= Aik xi + Akj xj + 2Akk xk = 2 Aki xi
i̸=k j̸=k i=1
We have ∇x xT Ax = 2Ax
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Least squares
Given A ∈ Rm×n and a vector x ∈ Rn , solve x for
Ax = b
When m > n and A is full rank, there might not be an exact solution. We
minimize the following objective function instead,
min∥Ax − b∥22
x
We have
∥Ax − b∥22 = (Ax − b)T (Ax − b) = xT AT Ax − 2bT Ax + bT b
Set the gradient to 0
∇x (xT AT Ax − 2bT Ax + bT b) = ∇x xT AT Ax − ∇x 2bT Ax + ∇x bT b
= 2AT Ax − 2AT b = 0
Solving for x yields
x = (AT A)−1 AT b
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Elements of probability
Elements of probability
Sample space Ω: The set of all the outcomes of a random experiment.
Each ω ∈ Ω is a complete outcome at the end of the experiment.
Set of events or event space F: A set whose elements A ∈ F (called
events) are subsets of Ω.
Axioms of probability: A function P : F → R that satisfies the following
properties
P (A) ≥ 0, for all A ∈ F
P (Ω) = 1 P
If A and B are disjoint events, then P (∪i Ai ) = i P (Ai )
Conditional probability and independence
Event B has non-zero probability. The conditional probability of any event A
given B is
P (A ∩ B)
P (A|B) =
P (B)
Two events are independent if and only if P (A ∩ B) = P (A)P (B) (or
equivalently P (A|B) = P (A)).
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Random variables
Random variables
Random variable X is a function X : Ω → R.
For discrete random variable,
def
P (X = k) = P ({ω : X(ω) = k})
For continuous random variable,
def
P (a ≤ X ≤ b) = P ({ω : a ≤ X(ω) ≤ b})
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
PMF and PDF
Probability mass functions
If X is a discrete random variable, a probability mass function (PMF) is
a function pX : Ω → R
def
pX (x) = P (X = x)
P≤ pX (x) ≤ 1
0
Pall x pX (x) = 1
x∈A = P (X ∈ A)
Probability density functions
If X is a continuous random variable, a probability density function
(PDF) is a function fX : Ω → R
def dFX (x)
fX (x) =
dx
fX (x) ≥ 0
R ∞
f (x) = 1
R−∞ X
x∈A fX (x)dx = P (X ∈ A)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Two random variables
Joint PMF
If X and Y are discrete random variables, the joint probability mass
function pXY : R × R → [0, 1]
pXY (x, y) = P (X = x, Y = y)
Marginal probability mass function
P pX (x)
pX (x) = y pXY (x, y)
Joint PDF
If X and Y are continuous random variables, the joint probability density
function fXY : R × R → [0, 1]
∂ 2 FXY (x, y)
fXY (x, y) =
∂x∂y
Marginal probability density function fX (x)
Z ∞
fX (x) = fXY (x, y)dy
−∞
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Two random variables
Conditional distributions
Intuitive understanding: Given X = x, the probability mass function or
probability density function of Y
Discrete cases:
pXY (x, y)
pY |X (y|x) =
pX (x)
Continuous cases:
fXY (x, y)
fY |X (y|x) =
fX (x)
Bayes’ rule
Discrete cases:
pXY (x, y) PX|Y (x|y)PY (y)
pY |X (y|x) = = P ′ ′
pX (x) all y ′ PX|Y (x|y )PY (y )
Continuous cases:
fXY (x, y) fX|Y (x|y)fY (y)
fY |X (y|x) = = R∞
fX (x) f
−∞ X|Y
(x|y ′ )fY (y ′ )dy ′
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Two random variables
Independence
Discrete cases:
pXY (x, y) = pX (x)pY (y) for all possible x and y
pY |X (y|x) = pY (y) where pX ̸= 0 for all possible y
Continuous cases:
fXY (x, y) = fX (x)fY (y) for all possible x and y
fY |X (y|x) = fY (y) where fX ̸= 0 for all possible y
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Maximum Likelihood Estimation
In statistics, maximum likelihood estimation (MLE) is a method of
estimating the parameters of a distribution by maximizing a likelihood
function, so that under the assumed statistical model the observed data is
most probable
Take a continuous distribution as an example, the probability density
function of a distribution f (y; θ) can be parameterized by parameters
θ = [θ1 , θ2 , · · · , θk ]T
Given all the observed data samples from the distribution
y = (y1 , · · · , yn ), the joint density of the samples is
Ln (θ) = Ln (θ; y) = f (y; θ)
The goal of maximum likelihood estimation is to find the values of the
model parameters that maximize the likelihood function over the
parameter space
Ln (θ̂; y) = sup Ln (θ; y)
θ∈Θ
In practice, it is often convenient to work with the natural logarithm of the
likelihood function, called the log-likelihood function
ℓ(θ; y) = ln Ln (θ; y)
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Multivariate Gaussian (normal) distribution
Multivariate Gaussian distribution
A random vector X = [X1 , · · · , Xn ]T has a multivariate Gaussian
distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn ++ (positive
definite n × n matrix), it is denoted by X ∼ N (µ, Σ) and its PDF is given
by
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ)
(2π n/2 |Σ|1/2 ) 2
The argument of the exponential function − 12 (x − µ)T Σ−1 (x − µ) is a
quadratic form in vector variable x. For any vector x ̸= µ,
1
− (x − µ)T Σ−1 (x − µ) < 0
2
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Interpretation of the covariance matrix
Diagonal convariance matrix can be viewed as a collection of n
independent Gaussian random variables with mean µi and σi2
n
Y 1 1
p(x; µ, Σ) = √ exp − 2 (xi − µi )2
i=1
2πσi 2σi
Shape of isocontours:
Diagonal covariance matrix: axis-aligned ellipsoids in Rn centered at µ with
axis length proportional to σ1 , σ2 , · · · , σn
Non-diagonal covariance matrix: rotated ellipsoids in Rn centered at µ with
axis length proportional to Σ’s eigenvalues.
25 0 10 5
µ = [3, 2]T , Σ = µ = [3, 2]T , Σ =
0 9 5 5
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Basic optimization
Pattern recognition systems usually involves optimizing some cost (or loss,
energy) functions
If a function J(θ) has global minimum or maximum, the minimal or
maximal point θ̂ must be reached when
∇θ J(θ̂) = 0
Then θ̂ must be a the global minimum or maximum
For local minima and minima, we cannot simply find optimal θ̂ by setting
the gradient to 0
Prof. LI Hongsheng ENGG 5202: Pattern Recognition
Linear Algebra
Introduction to pattern recognition
Probability theory review
Mathematical background
Multivariate Gaussian distributions
Gradient descent
To recover local minimum, we could utilize the gradient descent algorithm
with initial parameter θ(0)
Gradient descent algorithm
For i = 1, 2, 3, · · ·
θ(i+1) = θ(i) − γ∇θ J(θ(i) )
Terminate iterations if i is large enough or ∥∇θ J(θ(i) )∥ is small
enough
γ is called the step size (or learning rate), −∇J(θ(i) ) is the negative
gradient direction.
1 variable 2 variables
Prof. LI Hongsheng ENGG 5202: Pattern Recognition