UNSUPERVISED LEARNING AND
DIMENSIONALITY REDUCTION
DR. M M MANJURUL ISLAM
Contents
1. Unsupervised Learning (Clustering Problem)
2. K-means clustering algorithm
3. Centroid initialization and choosing the number of K
4. Dimensionality Reduction
5. Principal Component Analysis
1
Unsupervised Learning
Introduction
4
1 Unsupervised Learning (Clustering Problem)
❖ Introduction into unsupervised learning
• Supervised learning example: Classification Problem
x1
x2
(1) (1) (2) (2) (3) (3) ( m)
Given a training set: {( x , y ),( x , y ),( x , y ),...,( x , y( m) )}
5
1 Unsupervised Learning (Clustering Problem)
❖ Introduction into unsupervised learning
• Unsupervised learning example: Clustering Problem
x1
Cluster 1
Cluster 2
x2
(1) (2) (3) ( m)
Given a training set: {x , x , x ,..., x }
6
1 Unsupervised Learning (Clustering Problem)
❖ Introduction into unsupervised learning
• Applications of clustering
Market Segmentation Organize computing clusters
Astronomical data analysis 7
2
K-means Clustering
Algorithm
2 K-means clustering algorithm
❖ K-means algorithm
• Inputs:
─ K (number of clusters)
─ Training set {𝑥 (1) , 𝑥 (2) , 𝑥 (3) , . . . , 𝑥 (𝑚) }, 𝑥 (𝑖) ∈ ℝ𝑛
• Algorithm flow:
𝑛
Randomly initialize K cluster centroids 𝜇1 , 𝜇2 , . . . , 𝜇𝐾 ∈ ℝ
Repeat {
for i = 1 to m
c (i ) := index (from 1 to K) of cluster centroid closest to x (i )
for k = 1 to K
k := average (mean) of points assigned to cluster k
}
9
2 K-means clustering algorithm
❖ K-means algorithm
• Execution of algorithm
10
2 K-means clustering algorithm
❖ K-means algorithm
• Execution of algorithm
11
2 K-means clustering algorithm
❖ K-means algorithm
• Execution of algorithm
12
2 K-means clustering algorithm
❖ K-means algorithm
• Execution of algorithm
13
2 K-means clustering algorithm
❖ K-means algorithm
• Execution of algorithm
14
2 K-means clustering algorithm
❖ K-means algorithm
• Execution of algorithm
15
2 K-means clustering algorithm
❖ K-means algorithm objective function
c (i ) =index of cluster (1,2,…,K) to which example x (i ) is currently assigned
k =cluster centroid k ( k n
)
c (i ) =cluster centroid to which example x (i ) has been assigned
1 m (i )
, 1 ,..., K ) = x − c( i )
2
(1) (m)
J (c ,..., c
m i =1
min
(1) (m)
J ( c (1)
,..., c (m)
, 1 ,..., K )
c ,..., c
1 ,..., K
16
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
• Note that: it terminates at a local optimum if SSE is
used. The global optimum is hard to find due to
complexity.
17
Weaknesses of k-means
• The algorithm is only applicable if the mean
is defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away
from other data points.
– Outliers could be errors in the data recording or
some special data points with very different
values.
18
Weaknesses of k-means: Problems with outliers
19
Weaknesses of k-means: To deal with outliers
• One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
– To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
• Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
– Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
20
Weaknesses of k-means (cont …)
• The algorithm is sensitive to initial seeds.
21
Weaknesses of k-means (cont …)
• If we use different seeds: good results
There are some met
hods to help choo
se good seeds
22
Weaknesses of k-means (cont …)
• The k-means algorithm is not suitable for discoveri
ng clusters that are not hyper-ellipsoids (or hyper-
spheres).
23
Common ways to represent clusters
• Use the centroid of each cluster to
represent the cluster.
– compute the radius and
– standard deviation of the cluster to
determine its spread in each dimension
– The centroid representation alone works well
if the clusters are of the hyper-spherical
shape.
– If clusters are elongated or are of other
shapes, centroids are not sufficient
24
3
Centroid initialization and
Choosing the number of K
3 Centroid initialization and choosing the number of K
❖ Random initialization of centroids for K-means algorithm
• Number of clusters K<m
• Randomly pick K training examples
• Set 1 ,..., K equal to these K examples
26
3 Centroid initialization and choosing the number of K
❖ Random initialization of centroids for K-means algorithm
• Problem
Ideal case of random initialization Bad examples of random initialization
27
3 Centroid initialization and choosing the number of K
❖ Random initialization of centroids for K-means algorithm
• Solution
For i =1 to 100 {
Randomly initialize K-means.
Run K-means. Get c (1) ,..., c( m ) , 1 ,..., K
Compute cost function:
J (c (1) ,..., c( m ) , 1 ,..., K )
}
Pick clustering with the lowest cost J (c (1) ,..., c( m ) , 1 ,..., K )
28
3 Centroid initialization and choosing the number of K
❖ Choosing the number of clusters
• What is the right number of clusters? How to choose it?
29
3 Centroid initialization and choosing the number of K
❖ Choosing the number of clusters K
• Elbow method
30
3 Centroid initialization and choosing the number of K
❖ Choosing the number of clusters K
• Evaluate K-means based on a metric for how well it performs for the particular
purpose
Rubbing Increasing
Feature 1 and Feature 2
construct Health Index
for rubbing prognosis Feature 1
- No rubbing
- Slight rubbing
- Intensive rubbing
Feature 2
31
4
Dimensionality Reduction
4 Dimensionality Reduction
❖ Motivation. Data Compression and Visualization.
• Reduce data from 2D to 1D
Mapping
x2
𝑥 (1) ∈ ℝ2 → 𝑧 (1) ∈ ℝ
𝑥 (2) ∈ ℝ2 → 𝑧 (2) ∈ ℝ
(1) 𝑥 (𝑚) ∈ ℝ2 → 𝑧 (𝑚) ∈ ℝ
x
x (2)
(1) (2) x1
z z
33
EMBEDDED
4 Dimensionality Reduction SYSTEM
LABORATORY
❖ Motivation. Data Compression and Visualization.
• Reduce data from 3D to 2D
z2
x3
x1 x2
z1
x (i ) 3
z (i ) 2
34
5
Principal Component
Analysis
Principal Component Analysis
36
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) problem formulation
x2
x1
Reduce from n-dimension to k-dimension: Find k vectors u (1) , u (2) ,..., u ( k ) onto which
project the data, so as to minimize the projection error. 37
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• Data Preprocessing
Given a training data set: x , x ,..., x
(1) (2) ( m)
Preprocessing (feature scaling/mean normalization):
1 m (i )
j = xj
m i =1
Replace each x j with x j − j .
(i )
If different features on different scales, scale features to have comparable
range of values:
x j (i ) − j
x j (i ) =
max(x j ) − min(x j )
38
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
Reduce data from n-dimensions to k-dimensions
Compute ‘covariance matrix:
1 n
= ( x ( i ) )( x ( i ) )T
m i =1
Compute “eigenvectors” of covariance matrix:
[U,S,V] = svd(Sigma);
Here, we obtain:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢(𝑛) ∈ ℝ𝑛×𝑛
| | |
References: J.Shlens, ‘A Tutorial on Principal Component Analysis: Derivation, Discussion and
Singular Value Decomposition’, 2003 39
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• Linear transformation of the original data
On previous step, we obtained eigenvectors from singular value decomposition of
covariance matrix:
| | |
𝑈 = 𝑢(1) 𝑢(2) 𝑢 (𝑛) ∈ ℝ𝑛×𝑛
| | |
k
Now, we can transform our original data with it is own feature dimension into new
feature space, based on principal components (𝑥 ∈ ℝ𝑛 → 𝑧 ∈ ℝ𝑘 ) :
| | | − (u (1) ) −
z (i ) = u (1) u (2) u ( k ) x ( i ) = − (u (2) ) − x(i )
| | | − (u ( k ) ) − nx1
nxk kxn
Ureduce
kx1 40
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• MATLAB implementation
After mean normalization (ensure every feature has zero mean) and optionally
feature scaling:
1 n
= ( x ( i ) )( x ( i ) )T
m i =1
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
Z = Ureduce’*x;
*Notice:
In MATLAB there are new functions, which you can use directly to obtain principal
components for your data without the steps mentioned above.
Please, have a look on functions called: pca (if you apply PCA directly to data) and pcacov
(if you want to extract principal components from covariance matrix). The results should be
similar. 41
5 Principal Component Analysis
❖ Principal Component Analysis (PCA) algorithm
• Reconstruction of data from its compressed representation
x2 x2
(1) x1 (1)
xapprox x1
x
x (2) (2)
xapprox
z = U reduce x 𝑧 ∈ ℝ → 𝑥 ∈ ℝ2
z (1) (𝑖)
𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧 (𝑖)
z1 nxk kx1
nx1 42
5 Principal Component Analysis
❖ Choosing the number of principal components
1 m
2
Average squared projection error: x ( i ) − xapprox
(i )
m i =1
1 m
Total variation in the data: (i ) 2
x
m i =1
Typically, choose k to be smallest value so that:
1
m 2
i =1
x (i )
− x (i )
approx
m 0.01
1
i =1 (i ) 2
m
x
m
“99% of variance is retained”
43
5 Principal Component Analysis
❖ Choosing the number of principal components
Two ways to check the retained variance of data:
[U,S,V] = svd(Sigma)
Algorithm:
S11 ... ... ... ...
Try PCA with k= 1 ... S 22 ... ... ...
Compute:
S = ... ... S33 ... ...
U reduce , z (1) , z (2) ,..., z ( m ) , xapprox
(1) ( m)
,..., xapprox
... ... ... ... ...
... ... ... ... S nn
Check if: For given k:
1
m 2
−
(i ) (i )
i =1
x x approx
k
Sii
m 0.01 1− i =1
0.01
n
1
i =1 x (i ) 2
m
i =1
Sii
m
If ‘yes’ – stop algorithm, if ‘no’ –change Pick smallest value of k for which “99%
the value k+=1 and run algorithm again of variance retained
44
5 Principal Component Analysis
❖ Advices for applying PCA
• Supervised learning speedup
( x (1) , y (1) ), ( x (2) , y (2) ),..., ( x ( m ) , y ( m ) )
Extract inputs:
Unlabelled dataset: 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑚) ∈ ℝ10000
PCA
𝑧 (1) , 𝑧 (2) , . . . , 𝑧 (m) ∈ ℝ100
New training set:
( z (1) , y (1) ), ( z (2) , y (2) ),..., ( z ( m ) , y ( m ) )
Note: Mapping x ( i ) → z (i) should be defined by running PCA only on the training set.
This mapping can be applied as well to the examples xcv (i )
and xtest
(i )
in the cross valida
tion and test sets.
45
5 Principal Component Analysis
❖ Advices for applying PCA
• Compression
─ Reduce memory/disk needed to store data
─ Speed up learning algorithm
• Visualisation
• Before implementing PCA, first try running whatever you want to do with the
original/raw data. Only if that does not do what you want, then implement PCA
and consider using another feature space dimension.
46
Principal Component Analysis
47
How to choose a clustering algorithm
• Clustering research has a long history. A vast colle
ction of algorithms are available.
– We only introduced several main algorithms.
• Choosing the “best” algorithm is a challenge.
– Every algorithm has limitations and works well with certa
in data distributions.
– It is very hard, if not impossible, to know what distributi
on the application data follow. The data may not fully fo
llow any “ideal” structure or distribution required by the
algorithms.
– One also needs to decide how to standardize the data, t
o choose a suitable distance function and to select other
parameter values.
48
Choose a clustering algorithm (cont …)
• Due to these complexities, the common practice is
to
– run several algorithms using different distance functions
and parameter settings, and
– then carefully analyze and compare the results.
• The interpretation of the results must be based on
insight into the meaning of the original data toget
her with knowledge of the algorithms used.
• Clustering is highly application dependent and to
certain extent subjective (personal preferences).
49
Cluster Evaluation: hard problem
• The quality of a clustering is very hard to
evaluate because
– We do not know the correct clusters
• Some methods are used:
– User inspection
• Study centroids, and spreads
• Rules from a decision tree.
• For text documents, one can read some documents
in clusters.
50
Cluster evaluation: ground truth
• We use some labeled data (for classification)
• Assumption: Each class is a cluster.
• After clustering, a confusion matrix is constructed. From the
matrix, we compute various measurements, entropy, purity,
precision, recall and F-score.
– Let the classes in the data D be C = (c1, c2, …, ck). The
clustering method produces k clusters, which divides D
into k disjoint subsets, D1, D2, …, Dk.
51
Evaluation measures: Entropy
52
Evaluation measures: purity
53
A remark about ground truth evaluation
• Commonly used to compare different clustering
algorithms.
• A real-life data set for clustering has no class
labels.
– Thus although an algorithm may perform very well on
some labeled data sets, no guarantee that it will perform
well on the actual application data at hand.
• The fact that it performs well on some label data
sets does give us some confidence of the quality
of the algorithm.
• This evaluation method is said to be based on
external data or information.
54
Evaluation based on internal information
• Intra-cluster cohesion (compactness):
– Cohesion measures how near the data points in a
cluster are to the cluster centroid.
– Sum of squared error (SSE) is a commonly used
measure.
• Inter-cluster separation (isolation):
– Separation means that different cluster centroids
should be far away from one another.
• In most applications, expert judgments are still the key.
55
Indirect evaluation
• In some applications, clustering is not the primary
task, but used to help perform another task.
• We can use the performance on the primary task
to compare clustering methods.
• For instance, in an application, the primary task is
to provide recommendations on book purchasing
to online shoppers.
– If we can cluster books according to their features, we
might be able to provide better recommendations.
– We can evaluate different clustering algorithms based on
how well they help with the recommendation task.
– Here, we assume that the recommendation can be
reliably evaluated.
56
Summary
• Clustering is has along history and still active
– There are a huge number of clustering algorithms
– More are still coming every year.
• We only introduced several main algorithms. There
are many others, e.g.,
– density based algorithm, sub-space clustering, scale-up
methods, neural networks based methods, fuzzy clustering,
co-clustering, etc.
• Clustering is hard to evaluate, but very useful in
practice. This partially explains why there are still a
large number of clustering algorithms being devised
every year.
• Clustering is highly application dependent and to
some extent subjective.
57
58