Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views65 pages

ML09 Clustering

The document discusses various clustering techniques in machine learning, focusing on prototype-based clustering methods such as k-means, fuzzy c-means, and the Expectation Maximization (EM) algorithm. It explains the principles behind these algorithms, their applications, and the challenges associated with clustering, such as local minima and the selection of initial starting points. Additionally, hierarchical clustering methods are introduced, emphasizing the construction of a hierarchy of clusters through agglomerative techniques.

Uploaded by

Balaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views65 pages

ML09 Clustering

The document discusses various clustering techniques in machine learning, focusing on prototype-based clustering methods such as k-means, fuzzy c-means, and the Expectation Maximization (EM) algorithm. It explains the principles behind these algorithms, their applications, and the challenges associated with clustering, such as local minima and the selection of initial starting points. Additionally, hierarchical clustering methods are introduced, emphasizing the construction of a hierarchy of clusters through agglomerative techniques.

Uploaded by

Balaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Data &

Knowledge
Engineering
Group

Machine Learning

Cluster Analysis
(partly based on transparencies of Tom Mitchell)

Andreas Nürnberger
Office: G29-113
Phone : 0391 / 67-58487
Email: [email protected]
Cluster Analysis
▪ Prototype-Based Clustering
▪ Definition
▪ K-means
▪ Fuzzy c-means
▪ EM algorithm
▪ Hierarchical Clustering
▪ Online and Stream Clustering

Andreas Nürnberger 18. Dezember 2024 2


Cluster Analysis (Clustering)
▪ So far: Training examples {(x,y)} given, looking for
hypothesis h(x)=y.

▪ Prototype based Clustering:


Set of instances x given. We are looking for a
mapping of the instances to the clusters Ci, such
that the sum of the distances to the cluster centers
ci and the assigned instances are as small as
possible.
(c1,...,c n ) = argmin   d(x,c i )
i xC i

x  Ci  c i = argmin d(x,c i )
▪ Clustering is sometimes also called „Unsupervised
classification“
Andreas Nürnberger 18. Dezember 2024 3
k-means Algorithm
▪ Input: Set of instances X, number of clusters k
▪ Start with k randomly defined cluster centers
▪ Do until the mapping of instances to clusters doesn‘t
change anymore:
▪ Mapping:
Assign each instance to the cluster with the
smallest distance to the cluster center.
▪ Recompute the cluster centers:
Compute cluster centers as the mean over all
assigned instances.
(It can be shown that this algorithm terminates!)

Andreas Nürnberger 18. Dezember 2024 4


k-means for k=2

Andreas Nürnberger 18. Dezember 2024 5


Partitioning of the Data Space

▪ Dots represent cluster centers


▪ Left: Delauny Triangulation – A circle through the edges of all triangles
does not contain another dot
▪ Right: Voronoi Diagramm – Perpendiculars on the centers of the
Delauny triangulation; Borders between data points assigned to
neighboring clusters (Voronoi cells)

Andreas Nürnberger 18. Dezember 2024 6


k-means Clustering Example (1)

➢ Left: Data points to be clustered


➢ Right: Initial starting points (randomly selected)

Andreas Nürnberger 18. Dezember 2024 7


k-means Clustering Example (2)

➢ Delauny triangulation (left) and Voronoi cells

Andreas Nürnberger 18. Dezember 2024 8


k-means Clustering Example (3)

1 2 3

4 5 6

Andreas Nürnberger 18. Dezember 2024 9


k-means Clustering: Local Minima
▪ Clustering result strongly depends on initial starting
points
▪ With poorly assigned starting points the clustering
process might even fail
▪ Selection of the number of initial starting points?
▪ Only a few methods available. None works for all
possible situations
▪ Open research problem!

Andreas Nürnberger 18. Dezember 2024 10


k-means Clustering: Local Minima
Example 1 Example 2 Example 3
Initial starting points
End states of clustering

Andreas Nürnberger 18. Dezember 2024 11


Fuzzy c-means
▪ Fuzzy c-means is based on k-means algorithm
▪ Abolishment of requirement to ‘crisply’ assign an
instance to exactly one cluster
▪ Instances can be assigned to a certain degree to
more than one cluster
▪ Method is more robust

Andreas Nürnberger 18. Dezember 2024 12


Fuzzy c-means
▪ Uses Euclidian distance (: centre):
k

(
=  x ij − vj )
2 2
2
di = x i − v
j =1
▪ The membership degrees of instances xi to a centre
are defined by a matrix U, for which holds:

ui [0,1], {1,...,c}i {1,...,N},



c N

 u i = 1,i {1,...,N} and 0   ui  N,  {1,...,c}


 =1 i=1

where c is the number of clusters and N the number


of instances.

Andreas Nürnberger 18. Dezember 2024 13


Fuzzy c-means
▪ Minimizes iteratively the target function Jm:
N c N c

Jm (U,V ) =   ui di =  ui x i − v


m 2 m 2

i=1  =1 i=1  =1

▪ After random initialization of U and definition of c


centers, centers and U are computed iteratively (m is
the “fuzzifier”):
N

 u i
m
 xi
1
v = i=1
N and ui = 1
d 2
 u m c (m −1)
i
 d 2
 i

 =1 i 
i=1

Andreas Nürnberger 18. Dezember 2024 14


Fuzzy c-means
▪ Effect of the fuzzifier m:
▪ Example: 2 cluster centers at 0 and 1
▪ Left m=2; right m=1.5

Andreas Nürnberger 18. Dezember 2024 15


k-means and the EM Algorithm
▪ K-means is a special case of the more general
Expectation Maximization (EM) algorithm
▪ EM solves the following problem:
▪ Given: Set of instances with missing attribute
(here: cluster assignment).
▪ We are looking for the most probable model for
the data (here: cluster centers)
▪ and need to determine – based on the model – the
missing attribute (here: cluster assignment of the
instances).

Andreas Nürnberger 18. Dezember 2024 16


Expectation Maximization (EM)
Application of the algorithm:
▪ Data (instances) are only partially observable
▪ Unsupervised learning (target value is not
observable)
▪ Supervised Learning (attributes of some instances
are not observable)

Areas of application:
▪ Unsupervised clustering
▪ Training of Bayesian-Belief networks
▪ Learning of Hidden Markov models

Andreas Nürnberger 18. Dezember 2024 17


Mixture of k Gaussians

example for k=2

P(x)

x
Assumption of method
Every instance x was generated in the following way:
1. Select one of the k Gaussian distributions with equal probability
2. Create randomly one instance based on the selected distribution

Andreas Nürnberger 18. Dezember 2024 18


Mixture of k Gaussians (special case: k=1)

▪ Assumption:
X was created by just one Gaussian distribution
▪ Then:
Since ML hypothesis minimizes the sum of squared
errors over the training examples, i.e.
m

ML = argmin  (x i − ) 2

 i=1
we can compute the mean  by:
m
1
ML =  xi
 m i=1

Andreas Nürnberger 18. Dezember 2024 19


EM to estimate k-means
Given:
▪ Instances X created by a mixture of k Gaussian
distributions
▪ Unknown means 1,...,  k of the k Gaussian
distributions
▪ It is unknown, which instance xi was created by
which Gaussian distribution

Andreas Nürnberger 18. Dezember 2024 20


EM to estimate k-means
Sought:
▪ Maximum-likelihood estimation of 1,...,  k
Approach (for the example we assume two distributions):
▪ Every instance is completely defined by the triple:

y i = x i ,z i1,zi2
where
▪ zij is 1, if xi was created by the j-th Gaussian
distribution
▪ x
i observable (training examples)
▪ zij non-observable (hidden random variable)

Andreas Nürnberger 18. Dezember 2024 21


EM to estimate k-means
▪ EM algorithm: Select randomly initial h = 1, 2
and repeat
▪ Expectation step: Compute the expectation value
E[zij] of every hidden variable zij, under the
assumption, that the current  hypothesis h is correct.
p(x = x i |  =  j )
E[zij ] =

2
p(x = x i |  = n )
n =1
1

( )
2
2 2 x i −  j
e
= 1

e
2 2 2 (x i −  n )
2

n =1

Andreas Nürnberger 18. Dezember 2024 22


EM to estimate k-means
▪ Maximization-step: Compute a new maximum-
likelihood hypothesis h ' = 1' ,  2' , under the
assumption that the derived value for the hidden
variable zij fits the computed expectation value E[zij].

Replace h = 1, 2 with h ' = 1', '2 .

j


E[z ] x
m
i=1 ij i
  E[z ]
m
i=1 ij

 Andreas Nürnberger 18. Dezember 2024 23


EM Algorithm
▪ Converges to the local maximum-likelihood h and
provides estimates for the hidden variables zij.


The real local maximum is in E ln P(Y | h) 
▪ Y is the complete data (observable and non-
observable variables)
▪ Expectation value is computed based on the possible
values of the non-observable
 variables in Y

Andreas Nürnberger 18. Dezember 2024 24


Generalized EM Problem
Given:
▪ Observable data X = {x1,..., x m }
▪ Non-observable data Z = {z1,...,z m }
▪ Parameterized probability distributions P(Y|h),
where
▪ Y ={ y1 ,... ym } all data y i = x i  z i
▪ h are the parameters
To be determined:
▪ h, which maximizes
 (locally)
E ln P (Y | h )

Andreas Nürnberger 18. Dezember 2024 25


Generalized EM Problem
Remarks:
 (
▪ We maximize E ln P Y | h ) since:
▪ P(Y|h‘) is the probability of Y given hypothesis h‘
(this probability should be maximal)
▪ Logarithm in order to simplify (does not affect
outcome)
▪ Expectation value is computed since Y itself is a
distribution.

Andreas Nürnberger 18. Dezember 2024 26


Generalized EM Method
▪ Define likelihood function Q(hl|h), which computes Y
= X  Z based on observable X and the current
parameters h in order to estimate Z.

Q(h'| h) E ln P (Y | h') | h, X 

Andreas Nürnberger 18. Dezember 2024 27


Generalized EM Method
EM Algorithm:
▪ Estimation (E)-step: Compute Q(h‘|h) based on the
current hypothesis h and the observable data X in
order to estimate the probability distribution over Y.

Q(h'| h) E ln P (Y | h') | h, X 

▪ Maximization (M)-step: Replace hypothesis h with


hypothesis h‘, which maximizes the function Q.

h argmax Q(h'| h)
h'


Andreas Nürnberger 18. Dezember 2024 28
Hierarchical Clustering

▪ So far: Flat cluster structure, i.e., every cluster was


independent of another one. All clusters are of
similar ‚importance‘.

▪ Now: Try to find a hierarchy of clusters in data

Andreas Nürnberger 18. Dezember 2024 29


Hierarchical Agglomerative Clustering
▪ Algorithm (bottom-up):
▪ Every data point (instance) is considered as
individual cluster
▪ Sequential merging of „most similar“ clusters,
i.e.
▪ Search for the (two) most similar clusters
and merge them
▪ Repeat this until we have just one cluster
left

▪ Requirements:
▪ Distance or similarity measure for instances, i.e.,
d(xi,xj)
▪ Distance or similarity measure for clusters, i.e.,
d(Ck,Cl)
Andreas Nürnberger 18. Dezember 2024 30
Hierarchical Agglomerative Clustering
Commonly used distance measure for clusters (Ck, Cl):
Single Linkage: d(Ck ,Cl ) = min d(x i , x j )
iC k , jC l
Complete Linkage: d(Ck ,Cl ) = max d(x i , x j )
iC k , jC l

Centroid Linkage: d (Ck , Cl ) = d ( xk , xl ), with


 1

xp =
Cp
x
iC p
i

Andreas Nürnberger 18. Dezember 2024 31


Hierarchical Agglomerative Clustering
▪ Single linkage can follow „chains“ in data
▪ Complete linkage derives compact clusters
▪ Average linkage tends usually also to find compact
clusters

Andreas Nürnberger 18. Dezember 2024 32


Hierarchical Agglomerative Clustering
▪ A dendrogram of a hierarchical cluster

Andreas Nürnberger 18. Dezember 2024 33


Hierarchical Agglomerative Clustering
▪ Example:
1-dimensional data {2,12,16,25,29,45}

Andreas Nürnberger 18. Dezember 2024 34


Selection of Clusters with HAC
▪ Simple approach:
▪ Definition of a minimal distance
▪ Stop merging of clusters as soon as the closest
clusters are more distant than the minimal distance
▪ Visual approach:
▪ Merge all clusters until only one cluster is left
▪ Draw the diagram and find a cut
▪ Advantage: Cut need not be horizontal
▪ Refined approaches:
▪ Analyze distances during merging process
▪ Select a cut if the distance during merging is
significantly bigger than the distance in prior steps
▪ There are a lot more heuristics…

Andreas Nürnberger 18. Dezember 2024 35


Learning Vector Quantization (LVQ)
▪ Iterative adaptation of prototype vectors
▪ Similar to “online” k-means clustering (Adaptation of
prototypes for each instance).
▪ For all training example the closest prototype vector
is determined.
▪ Only this prototype vector is adapted (‘winner-
takes-all’).
▪ Prototype vectors are adapted according to:
r (new ) r (old ) r r (old )
r =r + ( p − r )

where  is a learning rate, p the considered training


instance and r the prototype vector.
Andreas Nürnberger 18. Dezember 2024 36
Learning Vector Quantization (LVQ)
Modifications:
▪ Batch Learning
▪ The prototypes are adapted after all training
examples had been processed:
r (new) r (old )
r =r + ( p − r )
r r (old )
r r
winner( p )= r ( old )
▪ Learning with labeled data
▪ i.e. to the training instances and the prototype
vectors a class label is assigned

▪ Attraction rule (instance p and prototype r belong
to the same class):
r (new ) r (old ) r r (old )
r =r + ( p − r )
▪ Repulsion rule (p and r belong to different
classes): r (new ) r (old ) r r (old )
r =r − ( p − r )
 Andreas Nürnberger 18. Dezember 2024 37
Learning Vector Quantization (LVQ)
Adaptation of prototypes:

Attraction rule Repulsion rule


p: training instance
ri: prototype vector
: learning rate

Andreas Nürnberger 18. Dezember 2024 38


Learning Vector Quantization (LVQ)
Problem: Fixed learning rate might cause oscillation

Solution: Learning rate that changes over time

(t) = 0 t , 0    1 or (t) = 0 t − ,   0

Andreas Nürnberger 18. Dezember 2024 39


Learning Vector Quantization (LVQ)

Training examples:

Left: Online learning with learning rate  = 0.1


Right: Batch learning with learning rate  = 0.05

demo program: http://www.borgelt.net/lvqd.html


Andreas Nürnberger 18. Dezember 2024 40
Self Organizing Map (SOM)

x1 x2 x3 … xn-2 xn-1 xn

... Input-
layer

y2 Map

y1

Artificial neural net to project high-dimensional data in low-


dimensional data space (usual two dimensional). Model tries to
keep neighborhood relations.

Andreas Nürnberger 18. Dezember 2024 41


SOM – Typical 2D Grid Layouts

Andreas Nürnberger 18. Dezember 2024 42


Self Organizing Map (SOM)

SOM Learning: v(w,i)=0

▪ Competitive learning v(w,i)=0.2


similar to LVQ
v(w,i)=0.5
▪ Neighborhood relation is
defined v(w,i)=1

▪ All vectors in the w

neighborhood around
the winner neuron w are
adapted:
r (new ) r (old ) r r (old )
i : ri = ri + v(w,i)  ( p − ri )
where v(w,i) is a neighborhood function
and  is a learning rate.
demo program: http://www.borgelt.net/somd.html
 Andreas Nürnberger 18. Dezember 2024 43
Remarks…
▪ Clustering can be used to project high-dimensional
document space to a number of small clusters of
‘similar’ documents.
▪ A projection of these clusters to two-dimensional
data space can be used to visualize the distribution
of documents (and, by labeling of the clusters,
topics).

Andreas Nürnberger 18. Dezember 2024 44


Example
▪ Clustering of a document collection (Wise et al. 95)

Andreas Nürnberger 18. Dezember 2024 45


Growing SOMs
▪ problem:
▪ fixed network topology (size of the map)
(parameters may be estimated by heuristics)

▪ What if the dataset changes significantly?


▪ train a new SOM
▪ may look completely different
▪ may take a long time
▪ adapt the topology
▪ introduce new cell(s)
▪ retrain the network (faster)

Andreas Nürnberger 18. Dezember 2024 46


Growing SOMs
strategy:
start with small grid (e.g., 2x2 cells)
repeat:
train SOM for a number of iterations
compute error for each cell
if error < threshold
stop learning
else
grow at the boundary
select border cell pk with highest error
add a new neighbor cell pa
extrapolate prototype

Andreas Nürnberger 18. Dezember 2024 47


Growing SOMs

Andreas Nürnberger 18. Dezember 2024 48


Text Retrieval Prototype
Interactive tool to search and navigate in document
collections
• Clustering of document
collections
• Supports iterative
keyword search
• Visualization of
document collections,
queries and changes of
the collection
• Supports user feedback
to adapt clusters (semi-
supervised learning
methods)

Andreas Nürnberger 18. Dezember 2024 49


Keyword Search

Andreas Nürnberger 18. Dezember 2024 50


Image Retrieval Prototype

Main objectives:
• Evaluate
usability of
approach for
image and video
database
analysis
• Evaluation and
optimization of
clustering
method for non-
text data
• Development of
user feed-back
model

Andreas Nürnberger 18. Dezember 2024 51


Image Retrieval Prototype (Sample dataset)

Andreas Nürnberger 18. Dezember 2024 52


Online and Stream Clustering
▪ Motivation:
▪ Most clustering algorithms discussed require the
(permanent) availability of a complete data set in
order to iteratively improve the clustering
▪ Therefore they can not be applied to a continues
stream of data or data sets that can not be stored
(completely) because they are too huge

▪ Further problems:
▪ Clustering must be fast
▪ Clustering must possibly adapt to changes in the
cluster structure over time

Andreas Nürnberger 18. Dezember 2024 53


Definitions...
▪ Online learning algorithms
▪ assume a continues (infinite) stream of data that has
to be processed continuosly
▪ Streaming model of learning
▪ is inspired by massive data sets that are too large to
fit in memory and so only a small portion of the input
can be held in memory at any one time

On-line Streaming
Endless stream of data Stream of (known) length n
Fixed amount of memory Memory available is o(n)
Tested at every time step Tested only at the very end
Each point is seen only once More than one pass may be possible

Andreas Nürnberger 18. Dezember 2024 54


On-line k-clustering
▪ Input: endless stream of instances x
▪ Output: k cluster centers
▪ Requirements:
▪ Each instance is just processed once
▪ at any given moment in time, we want the algorithm’s
k-clustering to be close to the optimal clustering of all
the data seen so far

▪ Basic Algorithm:

repeat forever:
get a new data point x
update the current set of k centers

Andreas Nürnberger 18. Dezember 2024 55


The on-line k-center problem
More formally…
▪ Given: Instances X and centers T
▪ Requirement:
▪ For all times t, we would like it to be the case that the
cost of T for the points seen so far is close to the
optimal cost (or lowest error) achievable for those
particular points.
▪ In the on-line setting, our algorithm must conform
to the following template:

repeat forever:
get x ∈ X
update centers T ⊂ X, |T| = k

Andreas Nürnberger 18. Dezember 2024 56


A common scheme for online k-means
▪ The following is a commonly used scheme for online
k- means
▪ Properties:
▪ It stabilizes over time.
▪ It does not adapt to (late) changes in the data!

initialize the k centers t1, . . . , tk in any way


create counters n1, . . . , nk, all initialized to zero
repeat forever:
get data point x
let ti be its closest center
set ti ← (niti + x)/(ni + 1) and ni ← ni + 1

Andreas Nürnberger 18. Dezember 2024 57


Another online k-means Algorithm…
▪ Remarks
▪ Previous algor. works for a pre-specified value of k.
▪ If the cluster centers are placed appropriately in the
beginning it converges to the mean of the xi , i=0,…,n
assigned to the center t :
1𝑥 + 𝑥1
2 0 +𝑥2
3 2 +𝑥3
𝑖∙𝑡𝑖 +𝑥𝑖 3 𝑥0 +𝑥1 +⋯+𝑥𝑛
With 𝑡𝑖+1 = we obtain t = …=
𝑖+1 4 𝑛+1

Since the initial xi of the stream define where the center is placed
in the beginning and therefore have an influence on the center
selection for the following x, the xi that are assigned to this
center depend on the order of processing!

▪ In the following:
▪ An algorithm that is able to handle arbitrary k’s
▪ Proposed by Beygelzimer, Kakade, and Langford
(2003) Andreas Nürnberger 18. Dezember 2024 58
A Cover Tree (1)
▪ A cover tree on data points x1, . . . , xn is a rooted
infinite tree with the following properties:
▪ Each node of the tree is associated with one of the
data points xi.
▪ If a node is associated with xi, then one of its children
must also be associated with xi.
▪ All nodes at depth j are at distance at least 1/2j from
each other.
▪ Each node at depth j + 1 is within distance 1/2j of its
parent (at depth j).
▪ The tree is described as an infinite tree for simplicity
of analysis, but it would not be stored as such.
▪ In practice, there is no need to duplicate a node as
its own child, and so the tree would take up O(n)
space.
Andreas Nürnberger 18. Dezember 2024 59
A Cover Tree (2)
▪ The figure below gives an example of a cover tree for a data set
of five points.
▪ This is just the top few levels of the tree, but the rest of it is
simply a duplication of the bottom row!
▪ From the structure of the tree we can conclude, for instance,
that x1, x2, x5 are all at distance ≥ 1/2 from each other (since
they are all at depth 1), and that the distance between x2 and
x3 is ≤ 1/4 (since x3 is at depth 3, and is a child of x2).

Andreas Nürnberger 18. Dezember 2024 60


k-means with a Cover Tree
▪ What makes cover trees especially convenient is that
they can be built on-line, one point at a time.

▪ To insert a new point x:


▪ find the largest j such that x is within 1/2j of some
node p at depth j in the tree
▪ and make x a child of p.

▪ Once the tree is built, it is easy to obtain k-clusters


from it.

Andreas Nürnberger 18. Dezember 2024 61


A streaming k-medoid algorithm (1)
k-medoid clustering:
▪ Input:
▪ Finite set of instances X
▪ number of clusters k.
▪ Output:
▪ Cluster centers T ⊂ X with |T| = k.
▪ Goal:
▪ Minimize cost(T) = x∈X C(x, T).
▪ (C defines the classification costs, e.g. the distance to
the centers like in k-means as described earlier)

▪ Remark: Medoids are representative objects of a cluster whose


average dissimilarity to all the objects in the cluster is minimal.
Medoids are always members of the data set.

Andreas Nürnberger 18. Dezember 2024 62


A streaming k-medoid algorithm (2)
▪ Idea:
▪ Read as much of data stream S as will fit into memory
(call this S1), solve this sub-instance, then read the
next batch S2, solve this sub-instance, and so on.
▪ At the end, the partial solutions need to be combined
somehow.

Divide S into groups S1, S2, . . . , Sl


for each i = 1, 2, . . . , l:
run an approximation alg on Si to get ≤ ak medoids Ti = {ti1, ti2,…}
suppose Si1 ∪ Si2 ∪ ・ ・ ・ are the induced clusters of Si
Sw ← T1 ∪ T2 ∪ ・ ・ ・ ∪ Tl, with weights w(tij) ← |Sij |
run an approximation algorithm on weighted Sw to get ≤ a′k centers T
return T

Andreas Nürnberger 18. Dezember 2024 63


A streaming k-medoid algorithm (3)
▪ The following diagram shows the data flow of the
streaming algorithm for the k-medoid problem
(based on “divide and conquer”)

Andreas Nürnberger 18. Dezember 2024 64


A streaming k-medoid algorithm (4)
Remarks:
▪ The approach motivates the general idea of how to
find clusters in huge datasets by
▪ splitting it in (random) subsets (that can be handled
independently and that might also be processed on
distributed servers/databases)
▪ finding clusters in the subsets and extract cluster
descriptions or prototypical samples of the found
cluster and
▪ aggregating the clustering information of the
individual subsets in a final merging step

Andreas Nürnberger 18. Dezember 2024 65

You might also like