0% found this document useful (0 votes)

56 views29 pages

Cluster Analysis

Cluster analysis is an unsupervised machine learning technique used to group unlabeled data points into clusters. The goal is to maximize the similarity of data points within each cluster while maximizing the dissimilarity between clusters. Good clustering produces high intra-cluster similarity and low inter-cluster similarity. Cluster analysis is commonly used for pattern recognition, image processing, bioinformatics, and document classification. Common requirements for clustering in data mining include scalability, ability to handle different data types, and discovery of clusters with arbitrary shapes.

Uploaded by

Osama Qahatany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views29 pages

Cluster Analysis

Uploaded by

Osama Qahatany

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Cluster Analysis

What is Cluster Analysis?

 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Grouping a set of data objects into clusters
 Clustering is unsupervised classification: no predefined
classes
 Clustering is used:
 As a stand-alone tool to get insight into data distribution
 Visualization of clusters may unveil important information
 As a preprocessing step for other algorithms
 Efficient indexing or compression often relies on clustering
Some Applications of
Clustering
 Pattern Recognition
 Image Processing
 cluster images based on their visual content
 Bio-informatics
 WWW and IR
 document classification
 cluster Weblog data to discover groups of similar access patterns
What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Requirements of Clustering in
Data Mining
 Scalability
 Ability to deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 usability
Outliers
 Outliers are objects that do not belong to any cluster
or form clusters of very small cardinality

cluster

outliers

 In some applications we are interested in discovering

outliers, not clusters (outlier analysis)
Data Structures
attributes/dimensions
 data matrix
 x 11 ... x
1f
... x
1p


tuples/objects
 (two modes)  
 ... ... ... ... ... 
x ... x ... x 
 i1 if ip 
 ... ... ... ... ... 
x ... x ... x

 n1 nf np 
objects
 dissimilarity or distance
 0 
matrix  d(2,1) 0

 
objects
 (one mode)  d(3,1 ) d ( 3, 2) 0 
 
Assuming simmetric distance  : : : 
d(i,j) = d(j, i) d ( n , 1 ) d (n ,2) ... ... 0

Measuring Similarity in
Clustering
 Dissimilarity/Similarity metric:

 The dissimilarity d(i, j) between two objects i and j is expressed in

terms of a distance function, which is typically a metric:
 d(i, j)0 (non-negativity)
 d(i, i)=0 (isolation)
 d(i, j)= d(j, i) (symmetry)
 d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)

 The deﬁnitions of distance functions are usually different

for interval-scaled, boolean, categorical, ordinal and ratio-
scaled variables.

 Weights may be associated with different variables based

on applications and data semantics.
Type of data in cluster
analysis
 Interval-scaled variables
 e.g., salary, height

 Binary variables
 e.g., gender (M/F), has_cancer(T/F)

 Nominal (categorical) variables

 e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)

 Ordinal variables
 e.g., military rank (soldier, sergeant, lutenant, captain, etc.)

 Ratio-scaled variables
 population growth (1,10,100,1000,...)

 Variables of mixed types

 multiple attributes with various types
Similarity and Dissimilarity Between
Objects
 Distance metrics are normally used to measure the
similarity or dissimilarity between two data objects
 The most popular conform to Minkowski distance:


1/ p

p p p
L p (i , j )   x |  | x  x | ...  | x  x |

| x


 i1
j1 i2 j2 in jn 
where i = (xi1 , x , …, xin ) and j = (xj1 , xj2 , …, xjn ) are two n-dimensional

i2


data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block) distance:

L ( i , j ) | x  x |  | x  x | ...  | x  x |
1 i1 j1 i2 j2 in jn
Similarity and Dissimilarity
Between Objects (Cont.)
 If p = 2, L2 is the Euclidean distance:
d ( i , j )  (| x x | x  x ...  | x  x
2 2 2
| | | )
i1 j1 i2 j2 in jn

 Properties
 d(i,j) 0
 d(i,i) =0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)
 Also one can use weighted distance:
d (i , j )  ( w | x  x | w x ...  w n | x  x
2 2 2
|x | | )
1 i1 j1 2 i2 j2 in jn
Binary Variables
 A binary variable has two states: 0 absent, 1 present
 A contingency table for binary data object j
1 0 sum
i= (0011101001)
1 a b a b
J=(1001100110)
object i 0 c d c d
sum a c b d p

 Simple matching coefﬁcient distance :

d (i , j )  b c
 Jaccard coefﬁcient distance : a b c  d

d (i , j )  b c
a b c
Binary Variables
 Another approach is to deﬁne the similarity of two
objects and not their distance.
 In that case we have the following:
 Simple matching coefﬁcient similarity:

s( i , j )  a d
a b c  d
 Jaccard coefﬁcient similarity:

s( i , j )  a
a b c

Note that: s(i,j) = 1 – d(i,j)

Dissimilarity between Binary
Variables
 Example (Jaccard coefﬁcient)

 all attributes are asymmetric binary

 1 denotes presence or positive test
 0 denotes absence or negative test
0 1
d( jack , mary )   0 . 33
2  0 1
1 1
d( jack , jim )   0 . 67
1 1 1
12
d( jim , mary )   0 . 75
1 1  2
A simpler deﬁnition
 Each variable is mapped to a bitmap (binary vector)

 Jack: 101000
 Mary: 101010
 Jim:110000
 Simple match distance:
number of non - common bit positions
d (i , j ) 
total number of bits

 Jaccard coefﬁcient:
number of 1' s in i j
d (i , j ) 1
number of 1' s in i j
Variables of Mixed Types
 A database may contain all the six types of variables
 symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratio-scaled.
 One may use a weighted formula to combine their effects.
Major Clustering Approaches
 Partitioning algorithms: Construct random partitions and then
iteratively reﬁne them by some criterion
 Hierarchical algorithms: Create a hierarchical decomposition of the
set of data (or objects) using some criterion
 Density-based: based on connectivity and density functions
Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
K-means Clustering

 Partitional clustering approach

 Each cluster is associated with a centroid (center point)
 Each point is assigned to the cluster with the closest
centroid
 Number of clusters, K, must be speciﬁed
K-means Clustering
Algorithm k-Means Clustering Algorithm
Input: a database D, of m records, r1, ..., rm and a desired number of
clusters k
Output: set of k clusters that minimizes the squared error criterion
Begin
randomly choose k records as the centroids for the k clusters;
repeat
assign each record, ri, to a cluster such that the distance between ri
and the cluster centroid (mean) is the smallest among the k clusters;
recalculate the centroid (mean) for each cluster based on the records
assigned to the cluster;
until no change;
End;
K-means Clustering Example

Sample 2-dimensional records for clustering example

RID Age Years_of_service
1 30 5
2 50 25
3 50 15 C1
4 25 5
5 30 10
6 55 25 C2

Assume that the number of desired clusters k is 2.

 Let the algorithm choose records with RID 3 for cluster C1 and RID 6
for cluster C2 as the initial cluster centroids
The remaining records will be assigned to one of those clusters
during the ﬁrst iteration of the repeat loop
K-means Clustering Example
The Euclidean distance between record rj and rk in n-
dimensional space is calculated as:

rj and rk represent the records wanted to calculate the distance between t

rk indicate to the C1 and C2 in current example

Record distance from C1 distance from C2 it joins cluster

1 22.4 32.0 C1
rj 2 10.0 5.0 C2
4 25.5 36.6 C1
5 20.6 29.2 C1
K-means Clustering Example

 Now,the new means (centroids) for the two clusters are computed. The
mean for a cluster, Ci, with n records of m dimensions is the vector:

 In our example records ( 1,3,4,5) belong to C1

 And (2,6) belong to C2
so C1(new) =(1/4(30+50+25+30) , 1/2(50+55)=(33.75, 8.75)
 C2(new) =(1/4(5+15+5+10) , 1/2(25+25)=(52.5, 25)
K-means Clustering Example

 A second iteration proceeds to get the distance of each record with new
centroids
 In the following table: calculate the distance of each record from the new
C1 and C2, and assign each to the suitable cluster

Record distance from C1 distance from C2 it joins cluster

1
2
3
4
5
6
calculate new C1 and C2 ?
tip : C1 will be (28.3, 6.7) and C2 will be (51.7, 21.7)
K-means Clustering Example

 Move to the next iteration and do as the previous slide

 Stop if you get same results
K-means Clustering – Details

 Initial centroids are often chosen randomly.

 Clusters produced vary from one run to another.
 The centroid is (typically) the mean of the points in the
cluster.
 ‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
 Most of the convergence happens in the ﬁrst few iterations.
 Often the stopping condition is changed to ‘Until relatively few
points change clusters’
 Complexity is O( n * K * I * d )
 n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Evaluating K-means Clusters
 The terminating condition is usually the squared-error criterion. For clusters
C1, ..., Ck with means m1, ..., mk, the error is deﬁned as:

 rj is a data point in cluster Ci and mi is the corresponds mean of

the cluster
Solutions to Initial Centroids
Problem
 Multiple runs
 Helps, but probability is not on your side
 Sample and use hierarchical clustering to
determine initial centroids
 Select more than k initial centroids and then
select among these initial centroids
 Select most widely separated

Aa Math Ia
No ratings yet
Aa Math Ia
18 pages
Clustering
No ratings yet
Clustering
84 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
WFN - 2122 - F6 M2 Mock Paper - Karena Yeung
No ratings yet
WFN - 2122 - F6 M2 Mock Paper - Karena Yeung
23 pages
Clustering
No ratings yet
Clustering
80 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
2 Cluster Analysis
No ratings yet
2 Cluster Analysis
55 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Clustering Algorithm and Analyasis
No ratings yet
Clustering Algorithm and Analyasis
12 pages
Algebra: Distributive Property Guide
No ratings yet
Algebra: Distributive Property Guide
3 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
152 pages
Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Clustering
No ratings yet
Clustering
55 pages
Clustering
0% (1)
Clustering
127 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
K Medoids
No ratings yet
K Medoids
101 pages
Clustering
No ratings yet
Clustering
29 pages
19 - Clustering in Operation Research
No ratings yet
19 - Clustering in Operation Research
11 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Proof of Goldbach Conjecture
No ratings yet
Proof of Goldbach Conjecture
9 pages
Chapter 8 Test 1
No ratings yet
Chapter 8 Test 1
6 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Clustering
No ratings yet
Clustering
47 pages
Clustering 1
No ratings yet
Clustering 1
75 pages
Unit 2
No ratings yet
Unit 2
89 pages
DM 4
No ratings yet
DM 4
76 pages
Clustering for Data Analysts
No ratings yet
Clustering for Data Analysts
69 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
DM Chapter 5 (Clustering)
No ratings yet
DM Chapter 5 (Clustering)
40 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Analysis of Cluteruing
No ratings yet
Analysis of Cluteruing
16 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Week 9 - Clustering
No ratings yet
Week 9 - Clustering
63 pages
Year 7 Semestral Assessment - Non-Calculator
No ratings yet
Year 7 Semestral Assessment - Non-Calculator
5 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Clustering
No ratings yet
Clustering
125 pages
Cluster Analysis Techniques Guide
No ratings yet
Cluster Analysis Techniques Guide
97 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Clustering
No ratings yet
Clustering
75 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
No ratings yet
Chp-10 (Topic Not in Book) Types of Data in Cluster Analysis.
13 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
ID: 6d99b141
No ratings yet
ID: 6d99b141
35 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Module 5
No ratings yet
Module 5
98 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
On Teaching Finite Element Method in Plasticity With Mathematica
No ratings yet
On Teaching Finite Element Method in Plasticity With Mathematica
10 pages
Lect 4
No ratings yet
Lect 4
34 pages
Intro to Cluster Analysis
No ratings yet
Intro to Cluster Analysis
12 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Math Practice for Kids
No ratings yet
Math Practice for Kids
11 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
24 pages
Digital Logic Design Assignment
0% (1)
Digital Logic Design Assignment
2 pages
Comprehensive Statistics Guide
No ratings yet
Comprehensive Statistics Guide
81 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Important PYQs - Limit
No ratings yet
Important PYQs - Limit
14 pages
SPSS Instruction - Chapter 8
No ratings yet
SPSS Instruction - Chapter 8
20 pages
Module 5 Assignment 2
No ratings yet
Module 5 Assignment 2
3 pages
One'S Complement: CC - 148/294 Under Makaut, WB
No ratings yet
One'S Complement: CC - 148/294 Under Makaut, WB
10 pages
MG Mathematics Year 11 Cat1 T3 2024
No ratings yet
MG Mathematics Year 11 Cat1 T3 2024
3 pages
Bartkowiak 2015
No ratings yet
Bartkowiak 2015
9 pages
Combinatorial Nullstellensatz-Slides
No ratings yet
Combinatorial Nullstellensatz-Slides
28 pages
Transform World-Space Coordinates To Local Rigid Body Coordinates
No ratings yet
Transform World-Space Coordinates To Local Rigid Body Coordinates
3 pages
Edexcel Maths Foundation Checklist
No ratings yet
Edexcel Maths Foundation Checklist
4 pages
Xiong Qiang 2007
No ratings yet
Xiong Qiang 2007
142 pages
Hippo Video Assessment (Round 1)
No ratings yet
Hippo Video Assessment (Round 1)
4 pages
IGCSE (9-1) Maths - Practice Paper 2F
No ratings yet
IGCSE (9-1) Maths - Practice Paper 2F
22 pages
Single vs Multi-Layer Perceptrons
No ratings yet
Single vs Multi-Layer Perceptrons
57 pages
Anal21.Dvi-calculus On Banach Space
No ratings yet
Anal21.Dvi-calculus On Banach Space
21 pages
Assignment1 Doubling v2
No ratings yet
Assignment1 Doubling v2
7 pages
Math 7 - Week 1 - Lesson 2
No ratings yet
Math 7 - Week 1 - Lesson 2
16 pages
Solving Quadratic Equations Using The Quadratic Formula Dominoes
No ratings yet
Solving Quadratic Equations Using The Quadratic Formula Dominoes
3 pages
Formation Academic Year: X y Z X y Z U P
No ratings yet
Formation Academic Year: X y Z X y Z U P
1 page

Cluster Analysis

Uploaded by

Cluster Analysis

Uploaded by

Cluster Analysis

What is Cluster Analysis?

 In some applications we are interested in discovering

 The dissimilarity d(i, j) between two objects i and j is expressed in

 The deﬁnitions of distance functions are usually different

 Weights may be associated with different variables based

 Nominal (categorical) variables

 Variables of mixed types

data objects, and p is a positive integer

 If p = 1, L1 is the Manhattan (or city block) distance:

 Simple matching coefﬁcient distance :

Note that: s(i,j) = 1 – d(i,j)

 all attributes are asymmetric binary

 Partitional clustering approach

Sample 2-dimensional records for clustering example

Assume that the number of desired clusters k is 2.

rj and rk represent the records wanted to calculate the distance between t

Record distance from C1 distance from C2 it joins cluster

 In our example records ( 1,3,4,5) belong to C1

Record distance from C1 distance from C2 it joins cluster

 Move to the next iteration and do as the previous slide

 Initial centroids are often chosen randomly.

 rj is a data point in cluster Ci and mi is the corresponds mean of

You might also like