Foundations of Machine Learning
DSA 5102 • Lecture 9
Li Qianxiao
Department of Mathematics
Last Time
Until now, we have focused on supervised learning
• Datasets comes in input-label pairs
• Goal is to learn their relationship for prediction
For the rest of the course, we are going to look at a variety of
unsupervised learning methodologies.
As always, we start with the simplest linear cases and proceed
from there.
Unsupervised Learning Overview
Supervised Learning
Supervised learning is about learning to make predictions
(Oracle) Cat
Predictive Dog
Model
Our goal: Using data, learn a predictive model that approximates
Unsupervised Learning
Unsupervised learning is where we do not have label information
(Oracle) Cat
Dog
Example goal: learn some task-agnostic patterns from the input data
Examples of Unsupervised Learning
Tasks: Dimensionality Reduction
https://media.geeksforgeeks.org/wp-content/uploads/Dimensionality_Reduction_1.jpg
Examples of Unsupervised Learning
Tasks: Clustering
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Cluster-2.svg/1200px-Cluster-2.svg.png
Examples of Unsupervised Learning
Tasks: Density Estimation
By طاها- Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24309466
Examples of Unsupervised Learning
Tasks: Generative Models
http://www.lherranz.org/wp-content/uploads/2018/07/blog_generativesampling.png
Why unsupervised learning?
• Labelled data is expensive to collect
• Labelled data is impossible to get
• Different application scenarios
Principal Component Analysis
Review: Eigenvalues and Eigenvectors
• For a square matrix , an eigenvector with associated eigenvalue satisfies
• We say is diagonalizable if there exists a diagonal (matrix of eigenvalues)
and an invertible (columns=eigenvectors) such that
• is symmetric if . is orthogonal if
• Well-known result: if is symmetric then it is diagonalizable by orthogonal
matrices, i.e.
Columns of are orthonormal: . In fact, is an orthonormal basis for . Moreover,
the eigenvalues are real.
Watch this! https://www.youtube.com/watch?v=PFDu9oVAE-g&t=453s
Review: Eigenvalues and Eigenvectors
• A symmetric matrix is
• Positive semi-definite if for all
• Positive definite if for all
• Suppose is symmetric positive definite. Then, WLOG we will
order its eigenvalues
and are the corresponding orthonormal eigenvectors.
Motivating PCA: Shoe Sizes
Capturing the Variation?
Although there are two dimensions to the data, there is really one
effective dimension! How do we uncover this dimension?
A Dynamic Visualization
Find the direction
that captures the
most variance
Two
Formulations
Find the direction
that minimizes
projection error
Derivation of PCA
(Maximize Variance)
Derivation of PCA
(Minimize Error)
The PCA Algorithm
Simple Example
Choosing The Embedding Dimension
PCA in Feature Space (Example)
PCA in Feature Space
We define a vector of feature maps
Form design matrix
Perform PCA on the Transformed dataset!
PCA in Feature Space
PCA in Feature Space (Example)
Define Feature Maps
PCA as a Form of Whitening
Recall: Principal component scores are given by
Define the transformation
Then, !
In other words, has uncorrelated features. This is known as a
PCA whitening transform.
Example: Iris Dataset
Autoencoders
PCA as Compression Algorithm
𝑍 𝑚 = 𝑋 𝑈𝑚 𝑋 ′ =𝑍 𝑚 𝑈 𝑀
𝑇
Encoder Decoder
Latent
Autoencoders
In this sense, the autoencoder is a nonlinear counter-part of PCA
based compression!
PCA: 𝑍 𝑚 = 𝑋 𝑈𝑚 𝑋 ′ =𝑍 𝑚 𝑈 𝑚
𝑇
Encoder Latent Decoder
AE: 𝑍 𝑚 =𝑇 enc ( 𝑋 ;𝜃) 𝑋 ′ =𝑇 dec ( 𝑍 𝑚 ; 𝜙 )
Neural Network Autoencoders
How do we pick the encoding and decoding and
One choice: use universal approximators, e.g. neural networks!
where
Neural Network Autoencoders
Given a dataset , we solve the empirical risk minimization to
minimize the distance between and
The empirical risk minimization uses inputs as labels!
Demo: PCA and Autoencoders
Summary
PCA fits an ellipsoid to data. Two interpretations:
• Maximize variance
• Minimize error
PCA is useful for:
• Dimensionality reduction
• Feature extraction / clustering
• Data whitening
Viewed as a reconstruction algorithm, autoencoders is a nonlinear
analogue of PCA