Clustering
Clustering
B IOINFORMATICS
Types of clustering
Clustering algorithms
Input:
Output:
Simply comparing the new gene sequences to known DNA sequences often does not necessarily reveal
the function of a gene: for 40% of sequenced genes, functionality cannot be ascertained by only
comparing to sequences of other known genes
Genes that perform similar or complementary function to known genes (reference) will be expressed
(transcribed) at high levels together with known genes
Genes that perform antagonistic functions (e.g. down-regulation) may be expressed at high levels at an
earlier or later time point when compared to known genes
Expression level is estimated by measuring the amount of mRNA for that particular gene
Wash cDNA over the microarray containing thousands of high density probes
that hybridize to complementary strands in the sample and immobilize them on
the surface.
INTENSITY TABLE
1,6,7 4 7.7 0.9 11.2 10.9 9.2 9.5 12.5 1.6 1.1
5 0 PAIRWISE DISTANCES
10 1 6 7 2 4 9 10 3 5 8
3,5,8 1 0.0 2.3 5.1 8.1 7.7 6.1 7.0 9.2 8.9 10.9
6 2.3 0.0 5.6 9.5 9.2 7.7 8.5 11.1 10.8 12.7
5
0 7 5.1 5.6 0.0 10.1 9.5 8.3 9.3 8.1 8.0 9.3
0 2 8.1 9.5 10.1 0.0 0.9 2.0 1.0 12.0 11.8 13.3
5
10 0 4 7.7 9.2 9.5 0.9 0.0 1.6 1.1 11.2 10.9 12.5
9 6.1 7.7 8.3 2.0 1.6 0.0 1.1 10.5 10.3 12.0
10 7.0 8.5 9.3 1.0 1.1 1.1 0.0 11.5 11.3 12.9
3 9.2 11.1 8.1 12.0 11.2 10.5 11.5 0.0 0.5 1.7
6 8.9 10.8 8.0 11.8 10.9 10.3 11.3 0.5 0.0 2.1
8 10.9 12.7 9.3 13.3 12.5 12.0 12.9 1.7 2.1 0.0
REARRANGED DISTANCES
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C LUSTERING P RINCIPLES
Homogeneity: elements of the same cluster are maximally close to
each other
RELATIVE IMPORTANCE
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
BECAUSE
� �
d(x, y) + d(x, y)
x,y∈the same cluster x,y∈different clusters
�
= d(x, y) = D = const
x,y
WE CAN
SIMPLIFY
� �
min α d(x, y) − β d(x, y)
clustering
x,y∈the same cluster x,y∈different clusters
Divisive: Start with one cluster and iteratively divide it into smaller
clusters
Input: A set V with N points (v1, v2 ... vn), the desired number of
clusters K and a distance measure between any two points d(v,w)
�N
1
D(V, X) = min d (vi , xk )
2
N i=1 k
Lloyd’s algorithm.
A. Compute the distance from each data point to the current cluster center Ci
(1 ≤ i ≤ K) and assign the point to the nearest cluster
B. After the assignment of all data points, compute new centers for each
cluster by taking the centroid of all the points in that cluster
2.00
Center
22 2
2
2 2 2
y
1 1 1
1 1
1
1.00 Center 1
1 11 1
1
1
1
1
0.00
0.00 1.00 2.00
2.00
22
1 Center 2
2 2 2
y
1 1 2
1 1
1
1.00 Center 11
1 11
1
1
1
1
0.00
0.00 1.00 2.00
2.00
11
1
Center
2 2 2 2
y
1 1 2
1 1
1
1.00 Center 1 1
1 11
2
2
1
2
0.00
0.00 1.00 2.00
2.00
11
1
2 2 2
y
1 1 2
1 1
1 Center 1
1.00 Center 2
1 11 2
2
2
2
2
0.00
0.00 1.00 2.00
2.00
11
1
2 2 2
y
1 1 2
1 1 1
Center
1
1.00 Center 2
1 12 2
2
2
2
2
0.00
0.00 1.00 2.00
2.00
11
1
2 2 2
y
1 1
Center 2
1 1 1
1
1.00 Center 2
1 22 2
2
2
2
2
0.00
0.00 1.00 2.00
11 22 22
1 2 2
2 2 2 3 3 3 3 3 33
y
y
1 Center
1 1 2 Center
2 2 2 Center
2 2 Center
1 1 2 2 2 3 2 2 3
1 2 2
1.00 1.00 Center 3 1.00
2 Center 2
1 22 2 11 1 2 11 1
2 Center 1 3
2 3 Center111
2 1 1
2 3 1
x x x
The problems is that K=N always achieves the value of 0 (each point is a
cluster), so we always keep increasing K.
11.25
SED
7.5
3.75
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
K
CSE/BIMM/BENG 181 MAY 24, 2011 SERGEI L KOSAKOVSKY POND [[email protected]]
C ONSERVATIVE K-M EANS
A LGORITHM
The smaller the clustering cost of a partition of data points is the better
that clustering is
HTTP://WWW.SCIENCEMAG.ORG/CGI/REPRINT/310/5750/979.PDF
Orangutan - - - - 0.188616
Gibbon - - - - -
We then replace the two just joined sequences with their ancestor
We need to compute the distances from the new ancestor to the remaining sequences
Human Chimpanzee Gorilla Orangutan Gibbon
Human - 0.0882682 0.102793 0.159598 0.179688
Chimpanzee - - 0.106145 0.170759 0.1875
Gorilla - - - 0.166295 0.1875
Orangutan - - - - 0.188616
Gibbon - - - - -
Orangutan - - - 0.188616
Gibbon - - - -
Orangutan Gibbon
Human-Chimpanzee-Gorilla 0.165551 0.184896
Orangutan - 0.188616
Gibbon - -
HTTP://UPLOAD.WIKIMEDIA.ORG/WIKIPEDIA/COMMONS/4/48/HEATMAP.PNG
BUILD A MINIMUM
SPANNING TREE
AND DELETE
LONGEST EDGES
TO CREATE
PARTITIONS