Hierarchical clustering
implementation
Introduction to Computer Science
Robert Sedgewick and Kevin Wayne
http://www.cs.Princeton.EDU/IntroCS
In this method the
distance between two clusters is determined by the
distance of the two closest objects (nearest
neighbors) in the different clusters.
Single linkage (nearest neighbor):
In this method, the
distances between clusters are determined by the
greatest distance between any two objects in the
different clusters (i.e., by the "furthest neighbors").
Complete linkage (furthest neighbor):
Group average linkage: In
this method, the distance
between two clusters is calculated as the average
distance between all pairs of objects in the two
different clusters.
Single-Link Hierarchical Clustering
Iteration.
Closest pair of clusters (i, j) is one with the smallest dist value.
Replace row i by min of row i and row j.
Infinity out row j and column j.
Update dmin[i] and change dmin[i'] to i if previously dmin[i'] = j.
Closest
pair
0
1
2
3
4
dmin
1
3
4
1
3
dist
5.5
2.14
5.6
2.14
5.5
0
1
2
3
4
dmin
1
0
4
1
dist
5.5
5.5
5.6
5.5
gene0
1
2
3
4
0
5.5
7.3
8.9
5.8
1
5.5
6.1
2.14
5.6
2
7.3
6.1
7.8
5.6
3
8.9
2.14
7.8
5.5
4
5.8
5.6
5.6
5.5
-
0
node1
2
3
4
0
5.5
7.3
5.8
1
5.5
6.1
5.5
2
7.3
6.1
5.6
3
-
4
5.8
5.5
5.6
-
Gene1 closest
to gene3,
dist=2.14
i=1, j=3
New min dist
Single-Link Clustering: Java Implementation
Single-link clustering.
Read in the data.
public static void main(String[] args) {
int M = StdIn.readInt();
int N = StdIn.readInt();
// read in N vectors of dimension M
Vector[] vectors = new Vector[N];
String[] names
= new String[N];
for (int i = 0; i < N; i++) {
names[i] = StdIn.readString();
double[] d = new double[M];
for (int j = 0; j < M; j++)
d[j] = StdIn.readDouble();
vectors[i] = new Vector(d);
}
Single-Link Clustering: Java Implementation
Single-link clustering.
Read in the data.
Precompute d[i][j] = distance between cluster i and j.
For each cluster i, maintain index dmin[i] of closest cluster.
double INFINITY = Double.POSITIVE_INFINITY;
double[][] d = new double[N][N];
int[] dmin = new int[N];
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
if (i == j) d[i][j] = INFINITY;
else
d[i][j] = vectors[i].distanceTo(vectors[j]);
if (d[i][j] < d[i][dmin[i]]) dmin[i] = j;
}
}
Single-Link Clustering: Main Loop
for (int s = 0; s < N-1; s++) {
// find closest pair of clusters (i1, i2)
int i1 = 0;
for (int i = 0; i < N; i++)
if (d[i][dmin[i]] < d[i1][dmin[i1]]) i1 = i;
int i2 = dmin[i1];
// overwrite row i1 with minimum of entries in row i1 and i2
for (int j = 0; j < N; j++)
if (d[i2][j] < d[i1][j]) d[i1][j] = d[j][i1] = d[i2][j];
d[i1][i1] = INFINITY;
// infinity-out old row i2 and column i2
for (int i = 0; i < N; i++)
d[i2][i] = d[i][i2] = INFINITY;
// update dmin and replace ones that previous pointed to
// i2 to point to i1
for (int j = 0; j < N; j++) {
if (dmin[j] == i2) dmin[j] = i1;
if (d[i1][j] < d[i1][dmin[i1]]) dmin[i1] = j;
}
}
6
Store Centroids in Each Internal Node
Cluster analysis.
Centroids distance / similarity.
Easy modification to TreeNode data
structure.
Store Vector in each node.
leaf nodes: directly corresponds to a gene
internal nodes: centroid = average of all leaf
nodes beneath it
Maintain count field in each TreeNode, which
equals the number of leaf nodes beneath it.
When setting z to be parent of x and y,
set z.count = x.count + y.count
set z.vector = p + (1-)q, where p = x.vector and
q = y.vector, and = x.count / z.count
Analysis and Micro-Optimizations
Running time. Proportional to MN2 (N genes, M arrays)
Memory. Proportional to N2.
Ex. [M = 50, N = 6,000] Takes 280MB, 48 sec on
fast PC.
input size proportional to MN
Some optimizations.
Use float instead of double
Store only lower triangular part of distance matrix
Use squares of distances instead of distances.
use float to decrease memory usage by a factor of 2x, but
How
much do you think would this help?
probably doesn't make it faster
storing only lower triangular part decreases memory usage by a
factor of 2x and makes things somewhat faster
only about 10% of time is spent precomputing distance matrix, so
avoiding square roots will help, but not that much
8
Sequence!
Some slides from Mona Singh, Serafim Batzoglou, Olga Troyanskaya
Introduction to Computer Science
Robert Sedgewick and Kevin Wayne
http://www.cs.Princeton.EDU/IntroCS
Bio-Sequences
Complete genomes of >1000 organisms
www.ncbi.nlm.nih.gov/Genomes/index.html
> 100 billion bases in Genbank (ncbi)
>509,000 proteins in SWISSPROT (hand
curated); >9,300,000 proteins in TREMBL
(computer annotated).
us.expasy.org/sprot
Next Gen Sequencers
>20 billion bases per run!
Illuminas Spring 2009
charge for sequencing your
genome:
$48,000 30 fold
coverage
Illumina/Solexa High Throughput
Sequencing Machine
Biomolecules as Strings
Macromolecules are the chemical
building blocks of cells
Proteins
20 amino acids
Nucleic acids
4 nucleotides {A, C, G, ,T}
Role of Evolution
Molecular structures and mechanisms are
reused and changed during evolution
Often mechanisms that are conserved can be
detected based on sequence similarity
Powerful tool for annotation
Ex: Protein Sequences
Horse vs Human Myoglobin (Global alignment of sequences)
GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASED
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED
LKKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHP
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHP
GDFGADAQGAMTKALELFRNDIAAKYKELGFQG
GDFGADAQGAMNKALELFRKDMASNYKELGFQG
Same protein in two different organisms, can ID based on sequence
similarity 88% identical
Myoglobin - intracellular storage of oxygen
Global alignment: Issues with transferring
annotations
Horse Myoglobin vs Human Hemoglobin Alpha
MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDL
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG---KKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHPG
--HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
DFGADAQGAMTKALELFRNDIAAKYKELGFQG
EFTPAVHASLDKFLASVSTVLTSKYR------
~25% identical; other similar amino acids
Myoglobin - intracellular storage of oxygen
Hemoglobin - transports oxygen
Basic Tool to Detect Sequence
Similarity: Alignments
Given:
a pair (or more) of sequences (DNA or
protein)
a method for scoring the similarity of a
pair of characters (=bases or amino acids)
Determine: correspondences between
characters in the sequences such that the
similarity score is maximized
Pairwise global aligment
Given two sequences, a scoring scheme with a
gap function, line up the sequences (with
insertion of gaps) to maximize the score
E.g., match = 1
mismatch = -1
gap = -2
E.g., say your two sequences are
AACAGTTACC, TAAGGTCA
AACAGTTACC
TA-AGGT-CA
Score = ?
Nave way to find optimal alignments
1.
Enumerate all possible alignments
2.
Score all possible alignments
3.
Take best scoring alignment
4.
5.
Problem: There are too many possible
alignments between 2 sequences !!
Solution: dynamic programming
RECALL: homework assignment from last term!
Pairwise Alignment
Needleman & Wunsch, Journal of Molecular
Biology, 1970
Dynamic programming (DP): general technique
to solve an instance of a problem by taking
advantage of computed solutions for
smaller subparts of the problem
Here, determine alignment of two sequences
by determining alignment of all suffixes of
the sequences
(suffixes are subparts well save solutions for )
Dynamic Programming Idea
Say aligning AAAC with AGC
Consider what happens in the first column
Three possible options; each corresponds to
different alignment of first column, choose each
one and add this to best alignment of suffixes
A AAC
- AAAC
A GC
A GC
A AAC
- AGC
Score of
aligning
these characters
Consider best
Alignment of
these suffixes
Dynamic Programming Idea
- AAAC
A GC
A AAC
A GC
A AAC
- AGC
If we knew answers to
these three subproblems,
then wed know the best
alignment score between
AAAC and AGC
Consider minimum of
these
three cases
Dynamic Programming Idea
Given an m-character sequence s, and an ncharacter sequence t construct an (m+1) x
(n+1) matrix sim where well store answers
to subproblems
sim[ i, j ] = score of the best alignment
of the suffix im of s with the suffix jn
of t.
Aligning AAAC with AGC
t
C
Best alignment
score of AC
with GC
A
Best alignment
score of AAAG
with C
A
A
C
Dynamic Programming Rule
(gap cost)
sim[i, j]
+g
(gap
cost)
sim[i+1, j]
+g
sim[i, j+1]
+ sc(s[i],t[j])
(similarity score
between
s[i] and t[j])
sim[i+1, j+1]
How long does DP take?
Query sequence of length n
Target sequence of length m
Dynamic programming matrix
26
How long does DP take?
Query sequence of length n
There are nm
entries in the
matrix.
Target sequence of length m
Each entry requires
a constant number c
of operations.
Dynamic programming matrix
The total number of required operations is approximate nmc.
We say that the algorithm is order nm or O(nm).
27
Local Alignment
Just described global alignment, where we
are looking for best match between
sequences from one end to the other.
Often (and more commonly), we will want a
local alignment, the best match between
subsequences of s and t.
Local Alignment DP Algorithm
Original formulation: Smith & Waterman,
Journal of Molecular Biology, 1981
Interpretation of array values is different
from global sequence alignment
sim [ i, j ] = score of the best alignment of
a prefix of the i..m suffix of s and a
prefix of the jn suffix of t
Algorithm is simple modification of DP just
described - whenever score goes below 0,
start from scratch !
I.e., consider four cases and take max
Database search
Given a sequence of interest, can you
find other similar sequences (to get a
hint about structure/function)?
E.g, NCBI BLAST site
Input sequence, gives back all significant
sequence matches
Performs local alignments
Heuristic Methods for Sequence Database
Searching
Quadratic algorithm too slow for large
databases with high query traffic heuristic
methods do fast approximation to dynamic
programming
FASTA [Pearson & Lipman (1988) PNAS 85,
p2444]
http://www2.ebi.ac.uk/fasta3
BLAST [Altschul et al. (1990) JMB 215,
p403]
http://www.ncbi.nlm.nih.gov/BLAST
Speeding up searches
Give up optimality, use heuristics
For a query sequence, require its
matches to share a k-mer exactly
(e.g., k=11)
Fundamental innovation: use hashing (or
other search data structures) to find
(quickly) places in database where
each k-mer in the query sequence
occurs
32
BLAST algorithm
Remove low-complexity regions.
Make a list of all words of length 3 amino acids or 11 nucleotides.
Augment the list to include similar words.
Scan the database for occurrences of the words
Connect nearby occurrences.
Extend the matches.
Prune the list of matches using a score threshold.
Evaluate the significance of each remaining match.
Very important !
Perform Smith-Waterman to get an alignment.
33
BLAST Notes
May fail to find all high-scoring segment pairs
-Heuristic approach
Empirically, more than an order of magnitude faster
than Smith-Waterman
Large impact:
NCBIs BLAST server handles thousands of
queries a day
most used (and cited) bioinformatics program