0% found this document useful (0 votes)

114 views31 pages

Lecture 2. Similarity Measures For Cluster Analysis

This document discusses different methods for measuring similarity between objects for cluster analysis. It covers distance measures for numeric data like Minkowski distance, proximity measures for binary variables like Jaccard coefficient, and distances between categorical attributes. The document emphasizes that similarity measures are critical for cluster analysis as they define how objects are grouped together. It provides examples to illustrate distance calculations for special cases of Minkowski distance and dissimilarity between binary variables.

Uploaded by

MUKTAR REZVI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views31 pages

Lecture 2. Similarity Measures For Cluster Analysis

Uploaded by

MUKTAR REZVI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Lecture 2.

Similarity Measures
for Cluster Analysis
Lecture 2. Similarity Measures for Cluster Analysis
 Basic Concept: Measuring Similarity between Objects

 Distance on Numeric Data: Minkowski Distance

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and

Mixed Types
 Proximity Measure between Two Vectors: Cosine Similarity

 Correlation Measures between Two Variables: Covariance and

Correlation Coefficient
 Summary
2
Session 1: Basic Concepts:
Measuring Similarity
between Objects
What Is Good Clustering?
 A good clustering method will produce high quality clusters which should have

 High intra-class similarity: Cohesive within clusters

 Low inter-class similarity: Distinctive between clusters
 Quality function
 There is usually a separate “quality” function that measures the “goodness” of
a cluster
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
 There exist many similarity measures and/or functions for different applications
 Similarity measure is critical for cluster analysis

4
Similarity, Dissimilarity, and Proximity
 Similarity measure or similarity function

 A real-valued function that quantifies the similarity between two objects

 Measure how two data objects are alike: The higher value, the more alike
 Often falls in the range [0,1]: 0: no similarity; 1: completely similar
 Dissimilarity (or distance) measure

 Numerical measure of how different two data objects are

 In some sense, the inverse of similarity: The lower, the more alike
 Minimum dissimilarity is often 0 (i.e., completely similar)
 Range [0, 1] or [0, ∞) , depending on the definition
 Proximity usually refers to either similarity or dissimilarity

5
Session 2: Distance on Numeric
Data: Minkowski Distance
Data Matrix and Dissimilarity Matrix
 Data matrix  x11 x12 ... x1l 
 
 A data matrix of n data points with l dimensions  x21 x22 ... x2l 
D
 Dissimilarity (distance) matrix  
 
 n data points, but registers only the distance d(i, j)  xn1 xn 2 ... xnl 
(typically metric)
 0 
 Usually symmetric, thus a triangular matrix  
 d (2,1) 0 
 Distance functions are usually different for real, boolean,  
categorical, ordinal, ratio, and vector variables  
 d ( n,1) d ( n, 2) ... 0 
 Weights can be associated with different variables based
on applications and data semantics

7
Example: Data Matrix and Dissimilarity Matrix
Data Matrix

point attribute1 attribute2

x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix (by Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

8
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
d (i, j )  p | xi1  x j1 | p  | xi 2  x j 2 | p   | xil  x jl | p
where i = (xi1, xi2, …, xil) and j = (xj1, xj2, …, xjl) are two l-dimensional data
objects, and p is the order (the distance so defined is also called L-p norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positivity)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
 Note: There are nonmetric dissimilarities, e.g., set differences

 p = 2: (L2 norm) Euclidean distance

d (i, j )  | xi1  x j1 |2  | xi 2  x j 2 |2   | xil  x jl |2

10
Example: Minkowski Distance at Special Cases
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2 L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0

Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum (L)
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
11
Session 3: Proximity Measure
for Symmetric vs. Asymmetric
Binary Variables
Proximity Measure for Binary Attributes
 A contingency table for binary data
Object j

Object i

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric

binary variables):

 Note: Jaccard coefficient is the same as “coherence”: (a concept discussed in Pattern Discovery)

13
Example: Dissimilarity between Asymmetric Binary Variables
Mary
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N 1 0 ∑row
Mary F Y N P N P N 1 2 0 2
Jack
Jim M Y P N N N N 0 1 3 4
 Gender is a symmetric attribute (not counted in) ∑col 3 3 6
Jim
 The remaining attributes are asymmetric binary
1 0 ∑row
 Let the values Y and P be 1, and the value N be 0 1 1 1 2
Jack 0 1 3 4
 Distance:
∑col 2 4 6
01 Mary
d ( jack , mary )   0.33
2 01 1 0 ∑row
11
d ( jack , jim )   0.67 1 1 1 2
111
1 2 Jim 0 2 2 4
d ( jim , mary )   0.75
11 2 ∑col 3 3 6
14
Session 4: Distance between
Categorical Attributes, Ordinal
Attributes, and Mixed Types
Proximity Measure for Categorical Attributes
 Categorical data, also called nominal attributes

 Example: Color (red, yellow, blue, green), profession, etc.

 Method 1: Simple matching

 m: # of matches, p: total # of variables

p
d (i, j)  p m

 Method 2: Use a large number of binary attributes

 Creating a new binary attribute for each of the M nominal states

16
Ordinal Variables
 An ordinal variable can be discrete or continuous

 Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior)

 Can be treated like interval-scaled

 Replace an ordinal variable value by its rank: rif  {1,..., M f }

 Map the range of each variable onto [0, 1] by replacing i-th object in
the f-th variable by
rif  1
zif 
M f 1
 Example: freshman: 0; sophomore: 1/3; junior: 2/3; senior 1

 Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3

 Compute the dissimilarity using methods for interval-scaled variables

17
Attributes of Mixed Type
 A dataset may contain all attribute types

 Nominal, symmetric binary, asymmetric binary, numeric, and ordinal

 One may use a weighted formula to combine their effects:
p

 ij dij
w (f) (f)

f 1
d (i, j )  p

 ij
w (f)

f 1

 If f is numeric: Use the normalized distance

 If f is binary or nominal: dij(f) = 0 if xif = xjf; or dij(f) = 1 otherwise
 If f is ordinal
rif  1
 Compute ranks zif (where zif  )
M f 1
18
 Treat zif as interval-scaled
Session 5: Proximity Measure
between Two Vectors: Cosine
Similarity
Cosine Similarity of Two Vectors
 A document can be represented by a bag of terms or a long vector, with each
attribute recording the frequency of a particular term (such as word, keyword, or
phrase) in the document

 Other vector objects: Gene features in micro-arrays

 Applications: Information retrieval, biologic taxonomy, gene feature mapping, etc.
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
d1  d 2
cos (d1 , d 2 ) 
|| d1 ||  || d 2 ||
where  indicates vector dot product, ||d||: the length of vector d
20
Example: Calculating Cosine Similarity
 Calculating Cosine Similarity: d1  d 2
cos (d1 , d 2 ) 
|| d1 ||  || d 2 ||
where  indicates vector dot product, ||d||: the length of vector d
 Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
 First, calculate vector dot product
d1d2 = 5 X 3 + 0 X 0 + 3 X 2 + 0 X 0 + 2 X 1 + 0 X 1 + 0 X 1 + 2 X 1 + 0 X 0 + 0 X 1 = 25
 Then, calculate ||d1|| and ||d2||

|| d1 || 5  5  0  0  3  3  0  0  2  2  0  0  0  0  2  2  0  0  0  0  6.481

|| d 2 || 3  3  0  0  2  2  0  0  1 1  1 1  0  0  1 1  0  0  1 1  4.12
 Calculate cosine similarity: cos(d1, d2 ) = 26/ (6.481 X 4.12) = 0.94
21
Session 6: Correlation Measures
between Two Variables: Covariance
and Correlation Coefficient
Variance for Single Variable
 The variance of a random variable X provides a measure of how much the value of X
deviates from the mean or expected value of X:
  ( x   ) 2 f ( x) if X is discrete

 x
  var( X )  E[(X   ) ]   
2 2

  ( x   ) 2 f ( x)dx if X is continuous

 
 where σ2 is the variance of X, σ is called standard deviation
µ is the mean, and µ = E[X] is the expected value of X
 That is, variance is the expected value of the square deviation from the mean
 It can also be written as:  2  var( X )  E[(X   ) 2 ]  E[X 2 ]   2  E[X 2 ]  [ E ( x)]2
 Sample variance is the average squared deviation of the data value xi from the
n
1
sample mean ˆ   ( xi  ˆ ) 2
2

n i 1
23
Covariance for Two Variables
 Covariance between two variables X1 and X2
 12  E[( X 1  1 )( X 2  2 )]  E[ X 1 X 2 ]  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]
where µ1 = E[X1] is the respective mean or expected value of X1; similarly for µ2
 Sample covariance between X1 and X2:
 Sample covariance is a generalization of the sample variance:
1 n 1 n
ˆ11   ( xi1  ˆ1 )( xi1  ˆ1 )   ( xi1  ˆ1 ) 2  ˆ12
n i 1 n i 1
 Positive covariance: If σ12 > 0
 Negative covariance: If σ12 <0
 Independence: If X1 and X2 are independent, σ12 = 0 but the reverse is not true
 Some pairs of random variables may have a covariance 0 but are not independent
 Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
24
Example: Calculation of Covariance
 Suppose two stocks X1 and X2 have the following values in one week:

 (2, 5), (3, 8), (5, 10), (4, 11), (6, 14)
 Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
 Covariance formula
 12  E[( X 1  1 )( X 2  2 )]  E[ X 1 X 2 ]  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 Its computation can be simplified as:  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 E(X1) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
 E(X2) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
 σ12 = (2×5 + 3×8 + 5×10 + 4×11 + 6×14)/5 − 4 × 9.6 = 4
 Thus, X1 and X2 rise together since σ12 > 0
25
Correlation between Two Numerical Variables
 Correlation between two variables X1 and X2 is the standard covariance, obtained by
normalizing the covariance with the standard deviation of each variable
 12  12
12  
 1 2  12 2 2 n

ˆ12  (x i1  ˆ1 )( xi 2  ˆ 2 )
 Sample correlation for two attributes X1 and X2: ˆ12   i 1

ˆ1ˆ 2 n n

 (x
i 1
i1  ˆ1 ) 2
 (x
i 1
i2  ˆ 2 ) 2

where n is the number of tuples, µ1 and µ2 are the respective means of X1 and X2 ,
σ1 and σ2 are the respective standard deviation of X1 and X2
 If ρ12 > 0: A and B are positively correlated (X1’s values increase as X2’s)

 The higher, the stronger correlation

 If ρ12 = 0: independent (under the same assumption as discussed in co-variance)

 If ρ12 < 0: negatively correlated

26
Visualizing Changes of Correlation Coefficient

 Correlation coefficient value range:

[–1, 1]
 A set of scatter plots shows sets of
points and their correlation
coefficients changing from –1 to 1

27
Covariance Matrix
 The variance and covariance information for the two variables X1 and X2 can be
summarized as 2 X 2 covariance matrix as
X 1  1
  E[( X   )( X   ) ]  E[(
T
)( X 1  1 X 2  2 )]
X 2  2
 E[( X 1  1 )( X 1  1 )] E[( X 1  1 )( X 2  2 )] 
 
 E[( X 2   2 )( X 1  1 )] E[( X 2   2 )( X 2   2 
)]
  12  12 
 2 
  21  2 
 In general, considering d numerical attributes, we have,
X1 X 2 ... X d
 x11 x12 ... x1d    12  12 ...  1d 
   
  2
...  2 d 
D   x21 x22 ... x2 d    E[( X   )( X   )T ]   21 2

   
   2 

 xn1 xn 2 ... xnd    n1  n 2 ...  nd 
28
Session 7: Summary
Summary: Similarity Measures for Cluster Analysis
 Basic Concept: Measuring Similarity between Objects
 Distance on Numeric Data: Minkowski Distance

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and

Mixed Types
 Proximity Measure between Two Vectors: Cosine Similarity

 Correlation Measures between Two Variables: Covariance and

Correlation Coefficient
 Summary
30
Recommended Readings
 L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster

Analysis, John Wiley & Sons, 1990

 Mohammed J. Zaki and Wagner Meira, Jr.. Data Mining and Analysis: Fundamental
Concepts and Algorithms. Cambridge University Press, 2014

 Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques.

Morgan Kaufmann, 3rd ed. , 2011

 Charu Aggarwal and Chandran K. Reddy (eds.). Data Clustering: Algorithms and

Applications. CRC Press, 2014

Respect FocusedTherapy CH 1
100% (1)
Respect FocusedTherapy CH 1
15 pages
Similarity and Dissimilarity Measures: Distance
No ratings yet
Similarity and Dissimilarity Measures: Distance
50 pages
Kelly Strategy for Investors
50% (2)
Kelly Strategy for Investors
7 pages
CSE 1 PPT MiniTest 12feb24 Similarity
No ratings yet
CSE 1 PPT MiniTest 12feb24 Similarity
11 pages
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
No ratings yet
UBS Business Plan - Stategic Planning and Financing Basis - Model For Generating A Business Plan - (UBS AG) PDF
26 pages
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
No ratings yet
CSC 452 DM Lecture02 Know Your Data B 13102020 014200pm
26 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
2015 고등 영어독해와작문 (안병규) 교과서PDF
No ratings yet
2015 고등 영어독해와작문 (안병규) 교과서PDF
184 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
43 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
17 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Proximity Measures in Data Mining and Machine Learning
No ratings yet
Proximity Measures in Data Mining and Machine Learning
4 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Mrcs Part B Osce Anatomy
No ratings yet
Mrcs Part B Osce Anatomy
287 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
STAT243 Chapter 2 - Section 2.4
No ratings yet
STAT243 Chapter 2 - Section 2.4
41 pages
Formulas at A Glance - IDS
No ratings yet
Formulas at A Glance - IDS
5 pages
Automatic Door Solutions Guide
No ratings yet
Automatic Door Solutions Guide
5 pages
Aviation Safety Performance Indicators
No ratings yet
Aviation Safety Performance Indicators
49 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Lab 2
No ratings yet
Lab 2
21 pages
Math Test: Rounding & Operations
No ratings yet
Math Test: Rounding & Operations
4 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
COC III Set Up Computer Server
No ratings yet
COC III Set Up Computer Server
77 pages
02data Part4
No ratings yet
02data Part4
28 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
R22 BEFA All Units Questions & Answers 03-8-2024
No ratings yet
R22 BEFA All Units Questions & Answers 03-8-2024
87 pages
The Role of Financial Institution in Enhancing Business Activities in Nigeria
100% (5)
The Role of Financial Institution in Enhancing Business Activities in Nigeria
54 pages
Similarity
No ratings yet
Similarity
20 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Lec 5
No ratings yet
Lec 5
24 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
Data Similarity & Dissimilarity Guide
No ratings yet
Data Similarity & Dissimilarity Guide
27 pages
Similarity and Distance Metrics
No ratings yet
Similarity and Distance Metrics
20 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Data Mining: Distance & Similarity
No ratings yet
Data Mining: Distance & Similarity
25 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Ec PDF
No ratings yet
Ec PDF
1,602 pages
Data Link Control: Mcgraw-Hill ©the Mcgraw-Hill Companies, Inc., 2004
No ratings yet
Data Link Control: Mcgraw-Hill ©the Mcgraw-Hill Companies, Inc., 2004
33 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
True or False Items
No ratings yet
True or False Items
17 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Sliding Window Protocols Overview
No ratings yet
Sliding Window Protocols Overview
39 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Participant Handbook: Iot Hardware Analyst
No ratings yet
Participant Handbook: Iot Hardware Analyst
152 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Thuyết Trình Anh Văn Sáng Thứ 5
No ratings yet
Thuyết Trình Anh Văn Sáng Thứ 5
7 pages
Science Quiz Bee
No ratings yet
Science Quiz Bee
5 pages
CHAPTER 3 - Unveiling Art (Subject, Content, Style and Presentation Methods)
No ratings yet
CHAPTER 3 - Unveiling Art (Subject, Content, Style and Presentation Methods)
2 pages
Sliding Window Protocols Guide
No ratings yet
Sliding Window Protocols Guide
39 pages
Software Requirements Specification (SRS)
No ratings yet
Software Requirements Specification (SRS)
5 pages
Sample ICT Action Plan
100% (2)
Sample ICT Action Plan
2 pages
Fender
No ratings yet
Fender
14 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Current Affairs Weekly Q&A PDF February 2023 2nd Week by AffairsCloud 1
No ratings yet
Current Affairs Weekly Q&A PDF February 2023 2nd Week by AffairsCloud 1
79 pages
Ocular Ischemic Syndrome Case Report
No ratings yet
Ocular Ischemic Syndrome Case Report
18 pages
MITinformation Brochure 2 June 2023
No ratings yet
MITinformation Brochure 2 June 2023
18 pages
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
No ratings yet
ANZ J. Surg. 2008 78 (Suppl. 1) A68-A80
13 pages
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
No ratings yet
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
2 pages
Gotaq QPCR Master Mix Quick Protocol
No ratings yet
Gotaq QPCR Master Mix Quick Protocol
1 page
Multiplication&division PDF
No ratings yet
Multiplication&division PDF
2 pages
Personal Statement and Study Plan Form PDF
No ratings yet
Personal Statement and Study Plan Form PDF
1 page
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Education, Arts, and Sciences
No ratings yet
Education, Arts, and Sciences
1 page
Vipin Kumar Resume
No ratings yet
Vipin Kumar Resume
1 page
Ps 1320 Gbnlfresd
No ratings yet
Ps 1320 Gbnlfresd
8 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Application Form PDF
No ratings yet
Application Form PDF
2 pages
Graduate School Admission Form
No ratings yet
Graduate School Admission Form
2 pages

Lecture 2. Similarity Measures For Cluster Analysis

Uploaded by

Lecture 2. Similarity Measures For Cluster Analysis

Uploaded by

Lecture 2.

 Distance on Numeric Data: Minkowski Distance

 Distance between Categorical Attributes, Ordinal Attributes, and

 Correlation Measures between Two Variables: Covariance and

 High intra-class similarity: Cohesive within clusters

 A real-valued function that quantifies the similarity between two objects

 Numerical measure of how different two data objects are

point attribute1 attribute2

Dissimilarity Matrix (by Euclidean Distance)

 p = 2: (L2 norm) Euclidean distance

d (i, j )  | xi1  x j1 |2  | xi 2  x j 2 |2   | xil  x jl |2

 Distance measure for symmetric binary variables:

 Distance measure for asymmetric binary variables:

 Jaccard coefficient (similarity measure for asymmetric

 Example: Color (red, yellow, blue, green), profession, etc.

 Method 1: Simple matching

 m: # of matches, p: total # of variables

 Method 2: Use a large number of binary attributes

 Creating a new binary attribute for each of the M nominal states

 Order is important, e.g., rank (e.g., freshman, sophomore, junior, senior)

 Can be treated like interval-scaled

 Replace an ordinal variable value by its rank: rif  {1,..., M f }

 Then distance: d(freshman, senior) = 1, d(junior, senior) = 1/3

 Nominal, symmetric binary, asymmetric binary, numeric, and ordinal

 If f is numeric: Use the normalized distance

 Other vector objects: Gene features in micro-arrays

 Its computation can be simplified as:  12  E[ X 1 X 2 ]  E[ X 1 ]E[ X 2 ]

 The higher, the stronger correlation

 If ρ12 < 0: negatively correlated

 Correlation coefficient value range:

 Proximity Measure for Symmetric vs. Asymmetric Binary Variables

 Distance between Categorical Attributes, Ordinal Attributes, and

 Correlation Measures between Two Variables: Covariance and

Analysis, John Wiley & Sons, 1990

Morgan Kaufmann, 3rd ed. , 2011

Applications. CRC Press, 2014

You might also like