Cluster Analysis
What is Cluster Analysis?
Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Clustering is used:
As a stand-alone tool to get insight into data distribution
Visualization of clusters may unveil important information
As a preprocessing step for other algorithms
Efficient indexing or compression often relies on clustering
Some Applications of
Clustering
Pattern Recognition
Image Processing
cluster images based on their visual content
Bio-informatics
WWW and IR
document classification
cluster Weblog data to discover groups of similar access patterns
What Is Good Clustering?
A good clustering method will produce high quality
clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
Scalability
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
usability
Outliers
Outliers are objects that do not belong to any cluster
or form clusters of very small cardinality
cluster
outliers
In some applications we are interested in discovering
outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
data matrix
x 11 ... x
1f
... x
1p
tuples/objects
(two modes)
... ... ... ... ...
x ... x ... x
i1 if ip
... ... ... ... ...
x ... x ... x
n1 nf np
objects
dissimilarity or distance
0
matrix d(2,1) 0
objects
(one mode) d(3,1 ) d ( 3, 2) 0
Assuming simmetric distance : : :
d(i,j) = d(j, i) d ( n , 1 ) d (n ,2) ... ... 0
Measuring Similarity in
Clustering
Dissimilarity/Similarity metric:
The dissimilarity d(i, j) between two objects i and j is expressed in
terms of a distance function, which is typically a metric:
d(i, j)0 (non-negativity)
d(i, i)=0 (isolation)
d(i, j)= d(j, i) (symmetry)
d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)
The definitions of distance functions are usually different
for interval-scaled, boolean, categorical, ordinal and ratio-
scaled variables.
Weights may be associated with different variables based
on applications and data semantics.
Type of data in cluster
analysis
Interval-scaled variables
e.g., salary, height
Binary variables
e.g., gender (M/F), has_cancer(T/F)
Nominal (categorical) variables
e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)
Ordinal variables
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Ratio-scaled variables
population growth (1,10,100,1000,...)
Variables of mixed types
multiple attributes with various types
Similarity and Dissimilarity Between
Objects
Distance metrics are normally used to measure the
similarity or dissimilarity between two data objects
The most popular conform to Minkowski distance:
1/ p
p p p
L p (i , j ) x | | x x | ... | x x |
| x
i1
j1 i2 j2 in jn
where i = (xi1 , x , …, xin ) and j = (xj1 , xj2 , …, xjn ) are two n-dimensional
i2
data objects, and p is a positive integer
If p = 1, L1 is the Manhattan (or city block) distance:
L ( i , j ) | x x | | x x | ... | x x |
1 i1 j1 i2 j2 in jn
Similarity and Dissimilarity
Between Objects (Cont.)
If p = 2, L2 is the Euclidean distance:
d ( i , j ) (| x x | x x ... | x x
2 2 2
| | | )
i1 j1 i2 j2 in jn
Properties
d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance:
d (i , j ) ( w | x x | w x ... w n | x x
2 2 2
|x | | )
1 i1 j1 2 i2 j2 in jn
Binary Variables
A binary variable has two states: 0 absent, 1 present
A contingency table for binary data object j
1 0 sum
i= (0011101001)
1 a b a b
J=(1001100110)
object i 0 c d c d
sum a c b d p
Simple matching coefficient distance :
d (i , j ) b c
Jaccard coefficient distance : a b c d
d (i , j ) b c
a b c
Binary Variables
Another approach is to define the similarity of two
objects and not their distance.
In that case we have the following:
Simple matching coefficient similarity:
s( i , j ) a d
a b c d
Jaccard coefficient similarity:
s( i , j ) a
a b c
Note that: s(i,j) = 1 – d(i,j)
Dissimilarity between Binary
Variables
Example (Jaccard coefficient)
all attributes are asymmetric binary
1 denotes presence or positive test
0 denotes absence or negative test
0 1
d( jack , mary ) 0 . 33
2 0 1
1 1
d( jack , jim ) 0 . 67
1 1 1
12
d( jim , mary ) 0 . 75
1 1 2
A simpler definition
Each variable is mapped to a bitmap (binary vector)
Jack: 101000
Mary: 101010
Jim:110000
Simple match distance:
number of non - common bit positions
d (i , j )
total number of bits
Jaccard coefficient:
number of 1' s in i j
d (i , j ) 1
number of 1' s in i j
Variables of Mixed Types
A database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio-scaled.
One may use a weighted formula to combine their effects.
Major Clustering Approaches
Partitioning algorithms: Construct random partitions and then
iteratively refine them by some criterion
Hierarchical algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
Density-based: based on connectivity and density functions
Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
K-means Clustering
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest
centroid
Number of clusters, K, must be specified
K-means Clustering
Algorithm k-Means Clustering Algorithm
Input: a database D, of m records, r1, ..., rm and a desired number of
clusters k
Output: set of k clusters that minimizes the squared error criterion
Begin
randomly choose k records as the centroids for the k clusters;
repeat
assign each record, ri, to a cluster such that the distance between ri
and the cluster centroid (mean) is the smallest among the k clusters;
recalculate the centroid (mean) for each cluster based on the records
assigned to the cluster;
until no change;
End;
K-means Clustering Example
Sample 2-dimensional records for clustering example
RID Age Years_of_service
1 30 5
2 50 25
3 50 15 C1
4 25 5
5 30 10
6 55 25 C2
Assume that the number of desired clusters k is 2.
Let the algorithm choose records with RID 3 for cluster C1 and RID 6
for cluster C2 as the initial cluster centroids
The remaining records will be assigned to one of those clusters
during the first iteration of the repeat loop
K-means Clustering Example
The Euclidean distance between record rj and rk in n-
dimensional space is calculated as:
rj and rk represent the records wanted to calculate the distance between t
rk indicate to the C1 and C2 in current example
Record distance from C1 distance from C2 it joins cluster
1 22.4 32.0 C1
rj 2 10.0 5.0 C2
4 25.5 36.6 C1
5 20.6 29.2 C1
K-means Clustering Example
Now,the new means (centroids) for the two clusters are computed. The
mean for a cluster, Ci, with n records of m dimensions is the vector:
In our example records ( 1,3,4,5) belong to C1
And (2,6) belong to C2
so C1(new) =(1/4(30+50+25+30) , 1/2(50+55)=(33.75, 8.75)
C2(new) =(1/4(5+15+5+10) , 1/2(25+25)=(52.5, 25)
K-means Clustering Example
A second iteration proceeds to get the distance of each record with new
centroids
In the following table: calculate the distance of each record from the new
C1 and C2, and assign each to the suitable cluster
Record distance from C1 distance from C2 it joins cluster
1
2
3
4
5
6
calculate new C1 and C2 ?
tip : C1 will be (28.3, 6.7) and C2 will be (51.7, 21.7)
K-means Clustering Example
Move to the next iteration and do as the previous slide
Stop if you get same results
K-means Clustering – Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
The terminating condition is usually the squared-error criterion. For clusters
C1, ..., Ck with means m1, ..., mk, the error is defined as:
rj is a data point in cluster Ci and mi is the corresponds mean of
the cluster
Solutions to Initial Centroids
Problem
Multiple runs
Helps, but probability is not on your side
Sample and use hierarchical clustering to
determine initial centroids
Select more than k initial centroids and then
select among these initial centroids
Select most widely separated