0% found this document useful (0 votes)

403 views41 pages

Cluster Analysis: Biological Data Analysis and Chemometrics

This document provides an overview of cluster analysis methods and algorithms. It discusses two primary methods of cluster analysis (hierarchical and divisive clustering) and ordination. It also describes advantages of cluster analysis and different clustering algorithms like agglomerative clustering. Finally, it provides examples of calculating distances and similarities between objects and evaluating clustering results.

Uploaded by

Deepthi Pakalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

403 views41 pages

Cluster Analysis: Biological Data Analysis and Chemometrics

Uploaded by

Deepthi Pakalapati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Cluster analysis

Based on H.C. Romesburg: Cluster analysis for researchers, Lifetime Learning Publications, Belmont, CA, 1984 P.H.A. Sneath and R.R. Sokal: Numericxal Taxonomy, Freeman, San Fransisco, CA, 1973

Jens C. Frisvad BioCentrum-DTU

Biological data analysis and chemometrics

Two primary methods

Cluster analysis (no projection)
Hierarchical clustering Divisive clustering Fuzzy clustering

Ordination (projection)
Principal component analysis Correspondence analysis Multidimensional scaling

Advantages of cluster analysis

Good for a quick overview of data Good if there are many groups in data Good if unusual similarity measures are needed Can be added on ordination plots (often as a minimum spanning tree, however) Good for the nearest neighbours, ordination better for the deeper relationships

Different clustering methods

NCLAS: Agglomerative clustering by distance optimization HMCL: Agglomerative clustering by homogeneity optimization INFCL: Agglomerative clustering by information theory criteria MINGFC: Agglomerative clustering by global optimization ASSIN: Divisive monothetic clustering PARREL: Partitioning by global optimization FCM: Fuzzy c-means clustering MINSPAN: Minimum spanning tree REBLOCK: Block clustering (k-means clustering)

SAHN clustering
Sequential agglomerative hierarchic nonoverlapping clustering

Single linkage
Nearest neighbor, minimum method Close to minimum spanning tree Contracting space Chaining possible J = 0.5, K = 0.5, = 0, = -0.5 UJ,K = min Ujk

U( J ,K ) L = JUJ ,L + KUK ,L + UJ ,K + UJ ,L UK ,L

Complete linkage
Furthest neighbor, maximum method Dilating space J = 0.5, K = 0.5, = 0, = 0.5 UJ,K = max Ujk

Average linkage
Aritmetic average
Unweighted: UPGMA (group average) Weighted: WPGMA

Centroid
Unweighted centroid (Centroid) Weighted centroid (Median)

From Sneath and Sokal, 1973, Numerical taxonomy

Ordinary clustering
Obtain the data matrix Transform or standardize the data matrix Select the best resemblance or distance measure Compute the resemblance matrix Execute the clustering method (often UPGMA = average linkage) Rearrange the data and resemblance matrices Compute the cophenetic correlation coefficient

Binary similarity coefficients

(between two objects i and j)

j i 1

Matches and mismatches

m = a + b (number of matches) u = c + d (number of mismatches) n = m + u = a + b + c + d (total sample size) Similarity (often 0 to 1) Dissimilarity (distance) (often 0 to 1) Correlation (-1 to 1)

Simple matching coefficient

SM = (a + d) / (a + b + c + d) = m / n Euclidean distance for binary data: D = 1-SM = (b +c) / (a + b + c + d) = u / n

Avoiding zero zero comparisons

Jaccard = J = a / (a +b +c) Srensen or Dice: DICE = 2a / (2a + b + c)

Correlation coefficients
Yule: (ad bc) / (ad + bc)

PHI = ( ad bc ) / ( a + b )(c + d )( a + c )(b + d )

Other binary coefficients

Hamann = H = (a + d b c) / (a + b + c + d) Rogers and Tanimoto = RT = (a + d) / (a + 2b + 2c + d) Russel and Rao = RR = a / (a + b + c + d) Kulzynski 1 = K1 = a / (b + c) UN1 = (2a + 2d) / (2a + b + c + 2d) UN2 = a / (a + 2b + 2c) UN3 = (a + d) / (b + c)

Distances for quantitative (interval) data Euclidean and taxonomic distance

EUCLID = Eij =
DIST = d ij = 1 n

( xki + xkj )

( x ki + x kj ) 2 k

Bray-Curtis and Canberra distance

BRAYCURT = d ij = k xki xkj / k ( x ki + xkj )

1 CANBERRA = k xki xkj / k ( xki + xkj ) n

Average Manhattan distance (city block)

1 MANHAT = M ij = k xki xkj n

Chi-squared distance

xki xkj x. x. j i CHISQ = d ij = k xk

Cosine coefficient

COSINE = cij = k xki xkj /

k xki

k xkj

Step 1. Obtain the data matrix

Object 1 1 Feature 2 2 3 4 5

10 5

20 20

30 10

30 15

5 10

Objects and features

The five objects are plots of farm land The features are
1. Water-holding capacity (%) 2. Weight % soil organic matter

Objective: find the two most similar plots

Resemblance matrix 1 1 2 18.0 3 4 2 3 4 5

14.1 11.2 18.0

5.00 25.0

25.5

20.6 22.4

5 7.07

Revised resemblance matrix 1 1 2 5 (34) 2 5 (34)

18.0 7.07 21.5

18.0 12.7

25.3

Revised resemblance matrix

2 2

(34)

(15)

12.7 18.0

23.4

(34)

(15)

Rvised resemblance matrix (15) (234)

(15)

21.6
(234)

Cophenetic correlation coefficient (Pearson product-moment correlation coefficient)

A comparison of the similarities according to the similarity matrix and the similarities according to the dendrogram

rX ,Y =

xy (1/ n)(x)(y) (x (1/ n)(x) )(y (1/ n)( y) )

2 2 2 2

NTSYS
Import matrix Transpose matrix if objects are rows (they are supposed to be columns in NTSYS) (transp in transformation / general) Consider log1 or autoscaling (standardization) Select similarity or distance measure (similarity) Produce similarity matrix

NTSYS (continued)
Select clustering procedure (often UPGMA) (clustering) Calculate cophenetic matrix (clustering) Compare similarity matrix with cophenetic matix (made from the dendrogram) and write down the cophenetic correlation (graphics, matrix comparison) Write dendrogram (graphics, treeplot)

Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
57 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
Lecture 7 Clustring
No ratings yet
Lecture 7 Clustring
10 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
7 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Machine Learning Clustering Guide
No ratings yet
Machine Learning Clustering Guide
80 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
SEEM2460 Unsupervised Learning Clustering
No ratings yet
SEEM2460 Unsupervised Learning Clustering
76 pages
Lecture-9 Cluster Analysis - LAK
No ratings yet
Lecture-9 Cluster Analysis - LAK
4 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
K Means
No ratings yet
K Means
3 pages
Clustering
0% (1)
Clustering
127 pages
Cluster Analysis in Construction
No ratings yet
Cluster Analysis in Construction
23 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
L18 19 Clustering
No ratings yet
L18 19 Clustering
48 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
CSCI 2304 7. K Means Clustering
No ratings yet
CSCI 2304 7. K Means Clustering
39 pages
19 - Clustering in Operation Research
No ratings yet
19 - Clustering in Operation Research
11 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Clustering
No ratings yet
Clustering
35 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Classification Chap3
No ratings yet
Classification Chap3
110 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Introduction To Ecological Multivariate Analysis
No ratings yet
Introduction To Ecological Multivariate Analysis
79 pages
Camm 3e Ch04 PPT
No ratings yet
Camm 3e Ch04 PPT
46 pages
K Medoids
No ratings yet
K Medoids
101 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Chapter 13 Manual Ver1.0
No ratings yet
Chapter 13 Manual Ver1.0
8 pages
MDA Session 4
No ratings yet
MDA Session 4
5 pages
Cluster Analysis
No ratings yet
Cluster Analysis
60 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Unsupervised Methods Overview
No ratings yet
Unsupervised Methods Overview
26 pages
Clustering
No ratings yet
Clustering
47 pages
Cluster Analysis Unit 4.
No ratings yet
Cluster Analysis Unit 4.
16 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
07 Clustering
No ratings yet
07 Clustering
54 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
91 pages
10 Lecture AI 10
No ratings yet
10 Lecture AI 10
48 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
38 pages
Clustering
No ratings yet
Clustering
55 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
AIML Lec-15
No ratings yet
AIML Lec-15
40 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Asit Kumar Das - M4 BDA Clustering
No ratings yet
Asit Kumar Das - M4 BDA Clustering
99 pages
Solutions To Tutorial 4 Cluster Analysis
100% (1)
Solutions To Tutorial 4 Cluster Analysis
12 pages
Genetic Diversity Index2
No ratings yet
Genetic Diversity Index2
19 pages
Fraction Concepts
No ratings yet
Fraction Concepts
17 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
19 pages
Daa
No ratings yet
Daa
39 pages
17 Trees
No ratings yet
17 Trees
46 pages
HackWithInfy Preparation-Basics
No ratings yet
HackWithInfy Preparation-Basics
3 pages
Sorting
No ratings yet
Sorting
32 pages
Question Bank Data Structures - 24 - 25 - Odd
0% (1)
Question Bank Data Structures - 24 - 25 - Odd
12 pages
Understanding Rainbow Tables
100% (1)
Understanding Rainbow Tables
1 page
Quiz Format (1) DAA4A
No ratings yet
Quiz Format (1) DAA4A
1 page
Module 1 Notes
No ratings yet
Module 1 Notes
27 pages
ADA Lab Manual
No ratings yet
ADA Lab Manual
32 pages
Algorithm Study Guide
No ratings yet
Algorithm Study Guide
32 pages
Lesson Plan - ML
No ratings yet
Lesson Plan - ML
12 pages
Chapt. 6 & 7 HW
No ratings yet
Chapt. 6 & 7 HW
5 pages
OR - Chapter 2
No ratings yet
OR - Chapter 2
52 pages
Graph Theory Basics
No ratings yet
Graph Theory Basics
25 pages
FP-Growth Example
0% (1)
FP-Growth Example
3 pages
01 Linear Data Structures
No ratings yet
01 Linear Data Structures
28 pages
NGB GB DGBNFG
No ratings yet
NGB GB DGBNFG
5 pages
Second Semester Mat-C-415: Numerical Computations
No ratings yet
Second Semester Mat-C-415: Numerical Computations
1 page
Module 3 Deadlocks Bankers Algorithm
No ratings yet
Module 3 Deadlocks Bankers Algorithm
49 pages
Mytreenode: Find The Distance Between Two Nodes in Binary Tree: Class
No ratings yet
Mytreenode: Find The Distance Between Two Nodes in Binary Tree: Class
2 pages
Assignment 03
No ratings yet
Assignment 03
3 pages
FINAL CPB20203 JAN 2022 (Question)
No ratings yet
FINAL CPB20203 JAN 2022 (Question)
3 pages
Decision Maths 1 Chapter 1 Algorithms
No ratings yet
Decision Maths 1 Chapter 1 Algorithms
56 pages
Design Analysis and Algorithms PYQ
No ratings yet
Design Analysis and Algorithms PYQ
5 pages
Bcs304-Dsa-Module 3
No ratings yet
Bcs304-Dsa-Module 3
61 pages
Data Structures (Trees)
No ratings yet
Data Structures (Trees)
11 pages
Particulars Mittal 2015
No ratings yet
Particulars Mittal 2015
29 pages
Learning Algorithms 1st Edition George Heineman PDF Download
No ratings yet
Learning Algorithms 1st Edition George Heineman PDF Download
52 pages

Cluster Analysis: Biological Data Analysis and Chemometrics

Uploaded by

Cluster Analysis: Biological Data Analysis and Chemometrics

Uploaded by

Cluster analysis

Jens C. Frisvad BioCentrum-DTU

Biological data analysis and chemometrics

Two primary methods

Advantages of cluster analysis

Different clustering methods

From Sneath and Sokal, 1973, Numerical taxonomy

Binary similarity coefficients

Matches and mismatches

Simple matching coefficient

Avoiding zero zero comparisons

PHI = ( ad bc ) / ( a + b )(c + d )( a + c )(b + d )

Other binary coefficients

Distances for quantitative (interval) data Euclidean and taxonomic distance

Bray-Curtis and Canberra distance

BRAYCURT = d ij = k xki xkj / k ( x ki + xkj )

1 CANBERRA = k xki xkj / k ( xki + xkj ) n

Average Manhattan distance (city block)

1 MANHAT = M ij = k xki xkj n

xki xkj x. x. j i CHISQ = d ij = k xk

COSINE = cij = k xki xkj /

Step 1. Obtain the data matrix

Objects and features

Objective: find the two most similar plots

Resemblance matrix 1 1 2 18.0 3 4 2 3 4 5

14.1 11.2 18.0

Revised resemblance matrix 1 1 2 5 (34) 2 5 (34)

18.0 7.07 21.5

Revised resemblance matrix

Rvised resemblance matrix (15) (234)

Cophenetic correlation coefficient (Pearson product-moment correlation coefficient)

xy (1/ n)(x)(y) (x (1/ n)(x) )(y (1/ n)( y) )

You might also like