0% found this document useful (0 votes)

61 views34 pages

Hierarchical Clustering Implementation

This document discusses hierarchical clustering and single-link hierarchical clustering algorithms. It provides details on how single-link hierarchical clustering works, including determining the distance between clusters based on the closest objects in different clusters. It also describes how to implement single-link clustering in Java, including reading in data, precomputing distances, and the main clustering loop. Finally, it discusses some optimizations that can be done, such as using floats instead of doubles to reduce memory usage.

Uploaded by

usman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views34 pages

Hierarchical Clustering Implementation

Uploaded by

usman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Hierarchical clustering

implementation

Introduction to Computer Science

Robert Sedgewick and Kevin Wayne

http://www.cs.Princeton.EDU/IntroCS

In this method the

distance between two clusters is determined by the
distance of the two closest objects (nearest
neighbors) in the different clusters.
Single linkage (nearest neighbor):

In this method, the

distances between clusters are determined by the
greatest distance between any two objects in the
different clusters (i.e., by the "furthest neighbors").
Complete linkage (furthest neighbor):

Group average linkage: In

this method, the distance

between two clusters is calculated as the average
distance between all pairs of objects in the two
different clusters.

Single-Link Hierarchical Clustering

Iteration.
Closest pair of clusters (i, j) is one with the smallest dist value.
Replace row i by min of row i and row j.
Infinity out row j and column j.
Update dmin[i] and change dmin[i'] to i if previously dmin[i'] = j.
Closest
pair

0
1
2
3
4

dmin
1
3
4
1
3

dist
5.5
2.14
5.6
2.14
5.5

0
1
2
3
4

dmin
1
0
4
1

dist
5.5
5.5
5.6
5.5

gene0
1
2
3
4

0
5.5
7.3
8.9
5.8

1
5.5
6.1
2.14
5.6

2
7.3
6.1
7.8
5.6

3
8.9
2.14
7.8
5.5

4
5.8
5.6
5.6
5.5
-

0
node1
2
3
4

0
5.5
7.3
5.8

1
5.5
6.1
5.5

2
7.3
6.1
5.6

3
-

4
5.8
5.5
5.6
-

Gene1 closest
to gene3,
dist=2.14
i=1, j=3
New min dist

Single-Link Clustering: Java Implementation

Single-link clustering.
Read in the data.
Precompute d[i][j] = distance between cluster i and j.
For each cluster i, maintain index dmin[i] of closest cluster.
double INFINITY = Double.POSITIVE_INFINITY;
double[][] d = new double[N][N];
int[] dmin = new int[N];
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
if (i == j) d[i][j] = INFINITY;
else
d[i][j] = vectors[i].distanceTo(vectors[j]);
if (d[i][j] < d[i][dmin[i]]) dmin[i] = j;
}
}

Single-Link Clustering: Main Loop

for (int s = 0; s < N-1; s++) {
// find closest pair of clusters (i1, i2)
int i1 = 0;
for (int i = 0; i < N; i++)
if (d[i][dmin[i]] < d[i1][dmin[i1]]) i1 = i;
int i2 = dmin[i1];
// overwrite row i1 with minimum of entries in row i1 and i2
for (int j = 0; j < N; j++)
if (d[i2][j] < d[i1][j]) d[i1][j] = d[j][i1] = d[i2][j];
d[i1][i1] = INFINITY;
// infinity-out old row i2 and column i2
for (int i = 0; i < N; i++)
d[i2][i] = d[i][i2] = INFINITY;
// update dmin and replace ones that previous pointed to
// i2 to point to i1
for (int j = 0; j < N; j++) {
if (dmin[j] == i2) dmin[j] = i1;
if (d[i1][j] < d[i1][dmin[i1]]) dmin[i1] = j;
}
}
6

Store Centroids in Each Internal Node

Cluster analysis.
Centroids distance / similarity.
Easy modification to TreeNode data
structure.
Store Vector in each node.

leaf nodes: directly corresponds to a gene

internal nodes: centroid = average of all leaf
nodes beneath it

Maintain count field in each TreeNode, which

equals the number of leaf nodes beneath it.
When setting z to be parent of x and y,
set z.count = x.count + y.count
set z.vector = p + (1-)q, where p = x.vector and
q = y.vector, and = x.count / z.count

Analysis and Micro-Optimizations

Running time. Proportional to MN2 (N genes, M arrays)
Memory. Proportional to N2.
Ex. [M = 50, N = 6,000] Takes 280MB, 48 sec on
fast PC.
input size proportional to MN
Some optimizations.
Use float instead of double
Store only lower triangular part of distance matrix
Use squares of distances instead of distances.

use float to decrease memory usage by a factor of 2x, but

How
much do you think would this help?
probably doesn't make it faster

storing only lower triangular part decreases memory usage by a

factor of 2x and makes things somewhat faster
only about 10% of time is spent precomputing distance matrix, so
avoiding square roots will help, but not that much
8

Sequence!

Some slides from Mona Singh, Serafim Batzoglou, Olga Troyanskaya

Introduction to Computer Science

Robert Sedgewick and Kevin Wayne

http://www.cs.Princeton.EDU/IntroCS

Bio-Sequences
Complete genomes of >1000 organisms

www.ncbi.nlm.nih.gov/Genomes/index.html

> 100 billion bases in Genbank (ncbi)

>509,000 proteins in SWISSPROT (hand

curated); >9,300,000 proteins in TREMBL
(computer annotated).
us.expasy.org/sprot

Next Gen Sequencers

>20 billion bases per run!

Illuminas Spring 2009
charge for sequencing your
genome:
$48,000 30 fold
coverage
Illumina/Solexa High Throughput
Sequencing Machine

Biomolecules as Strings
Macromolecules are the chemical
building blocks of cells

Proteins

20 amino acids

Nucleic acids

4 nucleotides {A, C, G, ,T}

Role of Evolution
Molecular structures and mechanisms are
reused and changed during evolution
Often mechanisms that are conserved can be
detected based on sequence similarity
Powerful tool for annotation

Ex: Protein Sequences

Horse vs Human Myoglobin (Global alignment of sequences)
GLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASED
GLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASED

LKKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHP
LKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHP
GDFGADAQGAMTKALELFRNDIAAKYKELGFQG
GDFGADAQGAMNKALELFRKDMASNYKELGFQG

Same protein in two different organisms, can ID based on sequence

similarity 88% identical
Myoglobin - intracellular storage of oxygen

Global alignment: Issues with transferring

annotations
Horse Myoglobin vs Human Hemoglobin Alpha
MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHLKTEAEMKASEDL
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG---KKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISDAIIHVLHSKHPG
--HGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
DFGADAQGAMTKALELFRNDIAAKYKELGFQG
EFTPAVHASLDKFLASVSTVLTSKYR------

~25% identical; other similar amino acids

Myoglobin - intracellular storage of oxygen
Hemoglobin - transports oxygen

Basic Tool to Detect Sequence

Similarity: Alignments

Given:
a pair (or more) of sequences (DNA or
protein)
a method for scoring the similarity of a
pair of characters (=bases or amino acids)
Determine: correspondences between
characters in the sequences such that the
similarity score is maximized

Pairwise global aligment

Given two sequences, a scoring scheme with a
gap function, line up the sequences (with
insertion of gaps) to maximize the score
E.g., match = 1
mismatch = -1
gap = -2
E.g., say your two sequences are
AACAGTTACC, TAAGGTCA

AACAGTTACC
TA-AGGT-CA
Score = ?

Nave way to find optimal alignments

Enumerate all possible alignments

Score all possible alignments

Take best scoring alignment

Problem: There are too many possible

alignments between 2 sequences !!
Solution: dynamic programming

RECALL: homework assignment from last term!

Pairwise Alignment

Needleman & Wunsch, Journal of Molecular

Biology, 1970
Dynamic programming (DP): general technique
to solve an instance of a problem by taking
advantage of computed solutions for
smaller subparts of the problem
Here, determine alignment of two sequences
by determining alignment of all suffixes of
the sequences
(suffixes are subparts well save solutions for )

Dynamic Programming Idea

Say aligning AAAC with AGC
Consider what happens in the first column
Three possible options; each corresponds to
different alignment of first column, choose each
one and add this to best alignment of suffixes

A AAC

- AAAC

A GC

A AAC
- AGC

Score of
aligning
these characters

Consider best
Alignment of
these suffixes

Dynamic Programming Idea

- AAAC
A GC
A AAC

A GC
A AAC
- AGC

If we knew answers to
these three subproblems,
then wed know the best
alignment score between
AAAC and AGC
Consider minimum of
these
three cases

Dynamic Programming Idea

Given an m-character sequence s, and an ncharacter sequence t construct an (m+1) x

(n+1) matrix sim where well store answers
to subproblems

sim[ i, j ] = score of the best alignment

of the suffix im of s with the suffix jn
of t.

Aligning AAAC with AGC

C
Best alignment
score of AC
with GC

A
Best alignment
score of AAAG
with C

A
A
C

Dynamic Programming Rule

(gap cost)

sim[i, j]
+g
(gap
cost)

sim[i+1, j]

sim[i, j+1]

+ sc(s[i],t[j])
(similarity score
between
s[i] and t[j])

sim[i+1, j+1]

How long does DP take?

Query sequence of length n

Target sequence of length m

Dynamic programming matrix

How long does DP take?

Query sequence of length n

There are nm
entries in the
matrix.

Target sequence of length m

Each entry requires
a constant number c
of operations.

Dynamic programming matrix

The total number of required operations is approximate nmc.

We say that the algorithm is order nm or O(nm).
27

Local Alignment

Just described global alignment, where we

are looking for best match between
sequences from one end to the other.
Often (and more commonly), we will want a
local alignment, the best match between
subsequences of s and t.

Local Alignment DP Algorithm

Original formulation: Smith & Waterman,
Journal of Molecular Biology, 1981
Interpretation of array values is different
from global sequence alignment

sim [ i, j ] = score of the best alignment of

a prefix of the i..m suffix of s and a
prefix of the jn suffix of t
Algorithm is simple modification of DP just
described - whenever score goes below 0,
start from scratch !
I.e., consider four cases and take max

Database search
Given a sequence of interest, can you
find other similar sequences (to get a
hint about structure/function)?

E.g, NCBI BLAST site

Input sequence, gives back all significant

sequence matches
Performs local alignments

Heuristic Methods for Sequence Database

Searching
Quadratic algorithm too slow for large
databases with high query traffic heuristic
methods do fast approximation to dynamic
programming

FASTA [Pearson & Lipman (1988) PNAS 85,

p2444]
http://www2.ebi.ac.uk/fasta3

BLAST [Altschul et al. (1990) JMB 215,

p403]
http://www.ncbi.nlm.nih.gov/BLAST

Speeding up searches
Give up optimality, use heuristics

For a query sequence, require its

matches to share a k-mer exactly
(e.g., k=11)
Fundamental innovation: use hashing (or
other search data structures) to find
(quickly) places in database where
each k-mer in the query sequence
occurs

BLAST algorithm

Remove low-complexity regions.

Make a list of all words of length 3 amino acids or 11 nucleotides.

Augment the list to include similar words.

Scan the database for occurrences of the words

Connect nearby occurrences.

Extend the matches.

Prune the list of matches using a score threshold.

Evaluate the significance of each remaining match.

Very important !

Perform Smith-Waterman to get an alignment.

BLAST Notes
May fail to find all high-scoring segment pairs
-Heuristic approach
Empirically, more than an order of magnitude faster
than Smith-Waterman
Large impact:
NCBIs BLAST server handles thousands of
queries a day
most used (and cited) bioinformatics program

Sequence Alignment
No ratings yet
Sequence Alignment
92 pages
Bioinformatics Sequence Alignments
No ratings yet
Bioinformatics Sequence Alignments
37 pages
Sequence Alignment Methods and Algorithms
75% (4)
Sequence Alignment Methods and Algorithms
37 pages
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
No ratings yet
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
18 pages
Unit I Algorithms
No ratings yet
Unit I Algorithms
42 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Lecture5 Newest
No ratings yet
Lecture5 Newest
124 pages
Sequence Alignment Methods
No ratings yet
Sequence Alignment Methods
32 pages
Bioinformatics Basics PDF
No ratings yet
Bioinformatics Basics PDF
10 pages
Multiple Sequence Alignment
No ratings yet
Multiple Sequence Alignment
89 pages
Bioinformatics Sequence Alignment
No ratings yet
Bioinformatics Sequence Alignment
90 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Alignment Methods: Introduction To Global and Local Sequence Alignment Methods
No ratings yet
Alignment Methods: Introduction To Global and Local Sequence Alignment Methods
57 pages
Unit Iv
No ratings yet
Unit Iv
98 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
What Is Dynamic Programming?
No ratings yet
What Is Dynamic Programming?
7 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Notes On Dynamic-Programming Sequence Alignment
No ratings yet
Notes On Dynamic-Programming Sequence Alignment
8 pages
PCB Lect02 Pairwise Allign
No ratings yet
PCB Lect02 Pairwise Allign
51 pages
Pattern Matching Techniques and Their Applications To Computational Molecular Biology - A Review
No ratings yet
Pattern Matching Techniques and Their Applications To Computational Molecular Biology - A Review
8 pages
Lab5 Ch2 Sequence Similarity PDF
No ratings yet
Lab5 Ch2 Sequence Similarity PDF
95 pages
W03 Pairwise
No ratings yet
W03 Pairwise
55 pages
Sequence Alignment Techniques
No ratings yet
Sequence Alignment Techniques
49 pages
Tabby
No ratings yet
Tabby
11 pages
LO5 Pairwise Sequence Alignment
No ratings yet
LO5 Pairwise Sequence Alignment
11 pages
Daa Assignment 10 Aryan Project
No ratings yet
Daa Assignment 10 Aryan Project
11 pages
Unit 2.1
No ratings yet
Unit 2.1
77 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Sequence Alignment
No ratings yet
Sequence Alignment
36 pages
Importance and Significance of Sequence Alignment - pptx12
No ratings yet
Importance and Significance of Sequence Alignment - pptx12
15 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Algorithm Design and Scoring Matrices PDF
No ratings yet
Algorithm Design and Scoring Matrices PDF
31 pages
Lecture 3
No ratings yet
Lecture 3
39 pages
Unit 3 Sequence Alignment and Phylogenetic Tree
No ratings yet
Unit 3 Sequence Alignment and Phylogenetic Tree
70 pages
Alignment Methods
No ratings yet
Alignment Methods
33 pages
String Alignment Techniques
No ratings yet
String Alignment Techniques
76 pages
Sequence Alignment
No ratings yet
Sequence Alignment
9 pages
Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions
No ratings yet
Sequence Comparison: Motivation: Finding Similarity Between Sequences Is Important For Many Biological Questions
47 pages
36) Corpet 1988
No ratings yet
36) Corpet 1988
10 pages
BMB 822 - Bioinformatics and Computing - Lecture Notes
No ratings yet
BMB 822 - Bioinformatics and Computing - Lecture Notes
94 pages
Pairwise Alignment 2017
No ratings yet
Pairwise Alignment 2017
49 pages
L8 Msa
No ratings yet
L8 Msa
52 pages
Analytical
No ratings yet
Analytical
24 pages
Ultrametricity
No ratings yet
Ultrametricity
35 pages
Computational Biology Alignment
No ratings yet
Computational Biology Alignment
34 pages
Advanced Gene Sequence Alignment
No ratings yet
Advanced Gene Sequence Alignment
36 pages
Ch10 Clustering
No ratings yet
Ch10 Clustering
45 pages
Chapter 05
No ratings yet
Chapter 05
86 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
MIT6 047F15 Lecture03
No ratings yet
MIT6 047F15 Lecture03
56 pages
Bioinformatics Alignment Methods
No ratings yet
Bioinformatics Alignment Methods
11 pages
Sequence Alignment Basics
No ratings yet
Sequence Alignment Basics
27 pages
Module 3 CSE3069 (Bioinformatics)
No ratings yet
Module 3 CSE3069 (Bioinformatics)
57 pages
Sequence Alignment: Lecture 2, Thursday April 3, 2003
No ratings yet
Sequence Alignment: Lecture 2, Thursday April 3, 2003
39 pages
Bio 3
No ratings yet
Bio 3
51 pages
Bioinfo Generic Skill
No ratings yet
Bioinfo Generic Skill
10 pages
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
No ratings yet
Accelerating DNA Pairwise Sequence Alignment Using FPGA and A Customized Convolutional Neural Network - ScienceDirect
9 pages
Sequence Alignment
No ratings yet
Sequence Alignment
63 pages
Sugeetha
No ratings yet
Sugeetha
3 pages
Life Science
0% (1)
Life Science
327 pages
Abhilasha
No ratings yet
Abhilasha
1 page
Polymerase Chain Reaction (PCR) : Principle of The PCR
No ratings yet
Polymerase Chain Reaction (PCR) : Principle of The PCR
5 pages
CEL 2106 - Portfolio 2 - Descriptions &amp Forms
No ratings yet
CEL 2106 - Portfolio 2 - Descriptions &amp Forms
9 pages
Plasmo DB
No ratings yet
Plasmo DB
15 pages
Previous Year Question Papers 2007
No ratings yet
Previous Year Question Papers 2007
36 pages
Biosimilar: Siriwan Chaisomboonpan Bureau of Drug and Narcotic 25 December 2008
No ratings yet
Biosimilar: Siriwan Chaisomboonpan Bureau of Drug and Narcotic 25 December 2008
31 pages
Accelerated Maturation of Cheese
100% (1)
Accelerated Maturation of Cheese
15 pages
Pharma Strategy at Ranbaxy
No ratings yet
Pharma Strategy at Ranbaxy
24 pages
Preliminary Study of C. (Morphocarabus) Zawadzkii Seriatissimus
No ratings yet
Preliminary Study of C. (Morphocarabus) Zawadzkii Seriatissimus
1 page
Lesson Plan
No ratings yet
Lesson Plan
2 pages
Biotechnology
No ratings yet
Biotechnology
7 pages
2X Taq FroggaMix
No ratings yet
2X Taq FroggaMix
4 pages
Formulation
No ratings yet
Formulation
6 pages
Dhiraj Kumar, Chengliang Gong-Trends in Insect Molecular Biology and Biotechnology-Springer International Publishing (2018) PDF
No ratings yet
Dhiraj Kumar, Chengliang Gong-Trends in Insect Molecular Biology and Biotechnology-Springer International Publishing (2018) PDF
376 pages
Mow VC Huiskes RBasic Orthopaedic Biomechanics and PDF
No ratings yet
Mow VC Huiskes RBasic Orthopaedic Biomechanics and PDF
1 page
Paper Integratorio G1-GG 1er C 2023 - Turno Mañana
No ratings yet
Paper Integratorio G1-GG 1er C 2023 - Turno Mañana
25 pages
Pharma Product Launches 1995-1997
No ratings yet
Pharma Product Launches 1995-1997
107 pages
Dna Topology: Introduction To
No ratings yet
Dna Topology: Introduction To
31 pages
Exercise 12
No ratings yet
Exercise 12
7 pages
Rishi Agarwal CV
No ratings yet
Rishi Agarwal CV
2 pages
Accepted Manuscript: Journal of Molecular Graphics and Modelling
No ratings yet
Accepted Manuscript: Journal of Molecular Graphics and Modelling
34 pages
Agricultural Biotechnology - Economic Growth Through New Products, Partnerships and Workforce Development
No ratings yet
Agricultural Biotechnology - Economic Growth Through New Products, Partnerships and Workforce Development
276 pages
Successful PCR Guide: 3rd Edition
100% (3)
Successful PCR Guide: 3rd Edition
60 pages
Universiti Malaysia Sarawak Pre-Transcript
No ratings yet
Universiti Malaysia Sarawak Pre-Transcript
3 pages
Lecture Notes BMED PDF
No ratings yet
Lecture Notes BMED PDF
2 pages
Simba Range Selection
No ratings yet
Simba Range Selection
1 page
Sourdough Biotechnology-Springer US (2013)
No ratings yet
Sourdough Biotechnology-Springer US (2013)
299 pages

Hierarchical Clustering Implementation

Uploaded by

Hierarchical Clustering Implementation

Uploaded by

Hierarchical clustering

Introduction to Computer Science

Robert Sedgewick and Kevin Wayne

In this method the

In this method, the

Group average linkage: In

this method, the distance

Single-Link Hierarchical Clustering

Single-Link Clustering: Java Implementation

Single-Link Clustering: Java Implementation

Single-Link Clustering: Main Loop

Store Centroids in Each Internal Node

leaf nodes: directly corresponds to a gene

Maintain count field in each TreeNode, which

Analysis and Micro-Optimizations

use float to decrease memory usage by a factor of 2x, but

storing only lower triangular part decreases memory usage by a

Some slides from Mona Singh, Serafim Batzoglou, Olga Troyanskaya

Introduction to Computer Science

Robert Sedgewick and Kevin Wayne

> 100 billion bases in Genbank (ncbi)

>509,000 proteins in SWISSPROT (hand

Next Gen Sequencers

>20 billion bases per run!

4 nucleotides {A, C, G, ,T}

Ex: Protein Sequences

Same protein in two different organisms, can ID based on sequence

Global alignment: Issues with transferring

~25% identical; other similar amino acids

Basic Tool to Detect Sequence

Pairwise global aligment

Nave way to find optimal alignments

Enumerate all possible alignments

Score all possible alignments

Take best scoring alignment

Problem: There are too many possible

RECALL: homework assignment from last term!

Needleman & Wunsch, Journal of Molecular

Dynamic Programming Idea

Dynamic Programming Idea

Dynamic Programming Idea

Given an m-character sequence s, and an ncharacter sequence t construct an (m+1) x

sim[ i, j ] = score of the best alignment

Aligning AAAC with AGC

Dynamic Programming Rule

How long does DP take?

Query sequence of length n

Target sequence of length m

Dynamic programming matrix

How long does DP take?

Query sequence of length n

Target sequence of length m

Dynamic programming matrix

The total number of required operations is approximate nmc.

Just described global alignment, where we

Local Alignment DP Algorithm

sim [ i, j ] = score of the best alignment of

E.g, NCBI BLAST site

Input sequence, gives back all significant

Heuristic Methods for Sequence Database

FASTA [Pearson & Lipman (1988) PNAS 85,

BLAST [Altschul et al. (1990) JMB 215,

For a query sequence, require its

Remove low-complexity regions.

Make a list of all words of length 3 amino acids or 11 nucleotides.

Augment the list to include similar words.

Scan the database for occurrences of the words

Connect nearby occurrences.

Extend the matches.

Prune the list of matches using a score threshold.

Evaluate the significance of each remaining match.

Perform Smith-Waterman to get an alignment.

You might also like