Lecture 2.
Similarity Measures
for Cluster Analysis
Lecture 2. Similarity Measures for Cluster Analysis
Basic Concept: Measuring Similarity between Objects
Distance on Numeric Data: Minkowski Distance
Proximity Measure for Symmetric vs. Asymmetric Binary Variables
Distance between Categorical Attributes, Ordinal Attributes, and
Mixed Types
Proximity Measure between Two Vectors: Cosine Similarity
Correlation Measures between Two Variables: Covariance and
Correlation Coefficient
Summary
2
Session 1: Basic Concepts:
Measuring Similarity
between Objects
What Is Good Clustering?
A good clustering method will produce high quality clusters which should have
High intra-class similarity: Cohesive within clusters
Low inter-class similarity: Distinctive between clusters
Quality function
There is usually a separate “quality” function that measures the “goodness” of
a cluster
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
There exist many similarity measures and/or functions for different applications
Similarity measure is critical for cluster analysis
4
Similarity, Dissimilarity, and Proximity
Similarity measure or similarity function
A real-valued function that quantifies the similarity between two objects
Measure how two data objects are alike: The higher value, the more alike
Often falls in the range [0,1]: 0: no similarity; 1: completely similar
Dissimilarity (or distance) measure
Numerical measure of how different two data objects are
In some sense, the inverse of similarity: The lower, the more alike
Minimum dissimilarity is often 0 (i.e., completely similar)
Range [0, 1] or [0, ∞) , depending on the definition
Proximity usually refers to either similarity or dissimilarity
5
Session 2: Distance on Numeric
Data: Minkowski Distance
Data Matrix and Dissimilarity Matrix
Data matrix x11 x12 ... x1l
A data matrix of n data points with l dimensions x21 x22 ... x2l
D
Dissimilarity (distance) matrix
n data points, but registers only the distance d(i, j) xn1 xn 2 ... xnl
(typically metric)
0
Usually symmetric, thus a triangular matrix
d (2,1) 0
Distance functions are usually different for real, boolean,
categorical, ordinal, ratio, and vector variables
d ( n,1) d ( n, 2) ... 0
Weights can be associated with different variables based
on applications and data semantics
7
Example: Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix (by Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
8
Distance on Numeric Data: Minkowski Distance
Minkowski distance: A popular distance measure
d (i, j ) p | xi1 x j1 | p | xi 2 x j 2 | p | xil x jl | p
where i = (xi1, xi2, …, xil) and j = (xj1, xj2, …, xjl) are two l-dimensional data
objects, and p is the order (the distance so defined is also called L-p norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positivity)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
Note: There are nonmetric dissimilarities, e.g., set differences
9
Special Cases of Minkowski Distance
p = 1: (L1 norm) Manhattan (or city block) distance
E.g., the Hamming distance: the number of bits that are different between two
binary vectors
d (i, j ) | xi1 x j1 | | xi 2 x j 2 | | xil x jl |
p = 2: (L2 norm) Euclidean distance
d (i, j ) | xi1 x j1 |2 | xi 2 x j 2 |2 | xil x jl |2
p : (Lmax norm, L norm) “supremum” distance
The maximum difference between any component (attribute) of the vectors
l
d (i, j ) lim | xi1 x j1 | | xi 2 x j 2 | | xil x jl | max | xif xif |
p p p p
p f 1
10
Example: Minkowski Distance at Special Cases
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum (L)
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
11
Session 3: Proximity Measure
for Symmetric vs. Asymmetric
Binary Variables
Proximity Measure for Binary Attributes
A contingency table for binary data
Object j
Object i
Distance measure for symmetric binary variables:
Distance measure for asymmetric binary variables:
Jaccard coefficient (similarity measure for asymmetric
binary variables):
Note: Jaccard coefficient is the same as “coherence”: (a concept discussed in Pattern Discovery)
13
Example: Dissimilarity between Asymmetric Binary Variables
Mary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N 1 0 ∑row
Mary F Y N P N P N 1 2 0 2
Jack
Jim M Y P N N N N 0 1 3 4
Gender is a symmetric attribute (not counted in) ∑col 3 3 6
Jim
The remaining attributes are asymmetric binary
1 0 ∑row
Let the values Y and P be 1, and the value N be 0 1 1 1 2
Jack 0 1 3 4
Distance:
∑col 2 4 6
01 Mary
d ( jack , mary ) 0.33
2 01 1 0 ∑row
11
d ( jack , jim ) 0.67 1 1 1 2
111
1 2 Jim 0 2 2 4
d ( jim , mary ) 0.75
11 2 ∑col 3 3 6
14
Session 4: Distance between
Categorical Attributes, Ordinal
Attributes, and Mixed Types
Proximity Measure for Categorical Attributes
Categorical data, also called nominal attributes
Example: Color (red, yellow, blue, green), profession, etc.
Method 1: Simple matching
m: # of matches, p: total # of variables
p
d (i, j) p m
Method 2: Use a large number of binary attributes
Creating a new binary attribute for each of the M nominal states
16
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior)
Can be treated like interval-scaled
Replace an ordinal variable value by its rank: rif {1,..., M f }
Map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
rif 1
zif
M f 1
Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1
Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3
Compute the dissimilarity using methods for interval-scaled variables
17
Attributes of Mixed Type
A dataset may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric, and ordinal
One may use a weighted formula to combine their effects:
p
ij dij
w (f) (f)
f 1
d (i, j ) p
ij
w (f)
f 1
If f is numeric: Use the normalized distance
If f is binary or nominal: dij(f) = 0 if xif = xjf; or dij(f) = 1 otherwise
If f is ordinal
rif 1
Compute ranks zif (where zif )
M f 1
18
Treat zif as interval-scaled
Session 5: Proximity Measure
between Two Vectors: Cosine
Similarity
Cosine Similarity of Two Vectors
A document can be represented by a bag of terms or a long vector, with each
attribute recording the frequency of a particular term (such as word, keyword, or
phrase) in the document
Other vector objects: Gene features in micro-arrays
Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc.
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
d1 d 2
cos (d1 , d 2 )
|| d1 || || d 2 ||
where indicates vector dot product, ||d||: the length of vector d
20
Example: Calculating Cosine Similarity
Calculating Cosine Similarity: d1 d 2
cos (d1 , d 2 )
|| d1 || || d 2 ||
where indicates vector dot product, ||d||: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
First, calculate vector dot product
d1d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25
Then, calculate ||d1|| and ||d2||
|| d1 || 5 5 0 0 3 3 0 0 2 2 0 0 0 0 2 2 0 0 0 0 6.481
|| d 2 || 3 3 0 0 2 2 0 0 1 1 1 1 0 0 1 1 0 0 1 1 4.12
Calculate cosine similarity: cos(d1, d2 ) = 26/ (6.481 X 4.12) = 0.94
21
Session 6: Correlation Measures
between Two Variables: Covariance
and Correlation Coefficient
Variance for Single Variable
The variance of a random variable X provides a measure of how much the value of X
deviates from the mean or expected value of X:
( x ) 2 f ( x) if X is discrete
x
var( X ) E[(X ) ]
2 2
( x ) 2 f ( x)dx if X is continuous
where σ2 is the variance of X, σ is called standard deviation
µ is the mean, and µ = E[X] is the expected value of X
That is, variance is the expected value of the square deviation from the mean
It can also be written as: 2 var( X ) E[(X ) 2 ] E[X 2 ] 2 E[X 2 ] [ E ( x)]2
Sample variance is the average squared deviation of the data value xi from the
n
1
sample mean ˆ ( xi ˆ ) 2
2
n i 1
23
Covariance for Two Variables
Covariance between two variables X1 and X2
12 E[( X 1 1 )( X 2 2 )] E[ X 1 X 2 ] 12 E[ X 1 X 2 ] E[ X 1 ]E[ X 2 ]
where µ1 = E[X1] is the respective mean or expected value of X1; similarly for µ2
Sample covariance between X1 and X2:
Sample covariance is a generalization of the sample variance:
1 n 1 n
ˆ11 ( xi1 ˆ1 )( xi1 ˆ1 ) ( xi1 ˆ1 ) 2 ˆ12
n i 1 n i 1
Positive covariance: If σ12 > 0
Negative covariance: If σ12 <0
Independence: If X1 and X2 are independent, σ12 = 0 but the reverse is not true
Some pairs of random variables may have a covariance 0 but are not independent
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
24
Example: Calculation of Covariance
Suppose two stocks X1 and X2 have the following values in one week:
(2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
Covariance formula
12 E[( X 1 1 )( X 2 2 )] E[ X 1 X 2 ] 12 E[ X 1 X 2 ] E[ X 1 ]E[ X 2 ]
Its computation can be simplified as: 12 E[ X 1 X 2 ] E[ X 1 ]E[ X 2 ]
E(X1) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
E(X2) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
σ12 = (2×5 + 3×8 + 5×10 + 4×11 + 6×14)/5 − 4 × 9.6 = 4
Thus, X1 and X2 rise together since σ12 > 0
25
Correlation between Two Numerical Variables
Correlation between two variables X1 and X2 is the standard covariance, obtained by
normalizing the covariance with the standard deviation of each variable
12 12
12
1 2 12 2 2 n
ˆ12 (x i1 ˆ1 )( xi 2 ˆ 2 )
Sample correlation for two attributes X1 and X2: ˆ12 i 1
ˆ1ˆ 2 n n
(x
i 1
i1 ˆ1 ) 2
(x
i 1
i2 ˆ 2 ) 2
where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 ,
σ1 and σ2 are the respective standard deviation of X1 and X2
If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s)
The higher, the stronger correlation
If ρ12 = 0: independent (under the same assumption as discussed in co-variance)
If ρ12 < 0: negatively correlated
26
Visualizing Changes of Correlation Coefficient
Correlation coefficient value range:
[–1, 1]
A set of scatter plots shows sets of
points and their correlation
coefficients changing from –1 to 1
27
Covariance Matrix
The variance and covariance information for the two variables X1 and X2 can be
summarized as 2 X 2 covariance matrix as
X 1 1
E[( X )( X ) ] E[(
T
)( X 1 1 X 2 2 )]
X 2 2
E[( X 1 1 )( X 1 1 )] E[( X 1 1 )( X 2 2 )]
E[( X 2 2 )( X 1 1 )] E[( X 2 2 )( X 2 2
)]
12 12
2
21 2
In general, considering d numerical attributes, we have,
X1 X 2 ... X d
x11 x12 ... x1d 12 12 ... 1d
2
... 2 d
D x21 x22 ... x2 d E[( X )( X )T ] 21 2
2
xn1 xn 2 ... xnd n1 n 2 ... nd
28
Session 7: Summary
Summary: Similarity Measures for Cluster Analysis
Basic Concept: Measuring Similarity between Objects
Distance on Numeric Data: Minkowski Distance
Proximity Measure for Symmetric vs. Asymmetric Binary Variables
Distance between Categorical Attributes, Ordinal Attributes, and
Mixed Types
Proximity Measure between Two Vectors: Cosine Similarity
Correlation Measures between Two Variables: Covariance and
Correlation Coefficient
Summary
30
Recommended Readings
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster
Analysis, John Wiley & Sons, 1990
Mohammed J. Zaki and Wagner Meira, Jr.. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press, 2014
Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 3rd ed. , 2011
Charu Aggarwal and Chandran K. Reddy (eds.). Data Clustering: Algorithms and
Applications. CRC Press, 2014
31