CS-434 BIOINFORMATICS
DR. UROOJ AINUDDIN
ALGORITHMS
FOR CHAPTER 4
DATABASE
SEARCHING
Sensitivity: the ability to find as
many correct hits or true positives
as possible.
Selectivity, also called specificity:
Requirements the ability to exclude incorrect hits
for database or false positives.
search
Speed: the time it takes to get
results from database searches.
Ideally, one wants to have the greatest
sensitivity, selectivity, and speed in database
searches.
An increase in sensitivity is associated with
decrease in selectivity. A very inclusive
search tends to include many false positives.
Balancing
requirements Similarly, an improvement in speed often
comes at the cost of lowered sensitivity and
selectivity.
A compromise between the three criteria
often has to be made.
Heuristic algorithms
A heuristic algorithm is a computational strategy to find
an empirical or near optimal solution by using rules of
thumb.
Essentially, this type of algorithms take shortcuts by
reducing the search space according to some criteria.
However, the shortcut strategy is not guaranteed to find
the best or most accurate solution.
It is often used because of the need for obtaining results
within a realistic time frame without significantly
sacrificing the accuracy of the computational output.
Why shift to heuristics from exhaustive
algorithms?
Searching a large database using the dynamic
programming methods, although accurate and reliable,
is too slow and impractical when computational
resources are limited.
Speed of searching is an important issue, for which
heuristic methods must be used.
The heuristic algorithms perform faster searches because
they examine only a fraction of the possible alignments
examined in dynamic programming.
Heuristic sequence search programs
Currently, there are two major heuristic algorithms for performing
database searches:
1. BLAST and
2. FASTA.
These methods are not guaranteed to find the optimal alignment or
true homologs but are 50–100 times faster than dynamic
programming.
The increased computational speed comes at a moderate expense
of sensitivity and specificity of the search.
Both programs can provide a reasonably good indication of
sequence similarity by identifying similar sequence segments.
Word method
Both BLAST and FASTA use a heuristic word method for fast pairwise
sequence alignment.
It finds short stretches of identical or nearly identical letters in two
sequences. These short strings of characters are called words.
The basic assumption is that two related sequences must have at
least one word in common.
By first identifying word matches, a longer alignment can be
obtained by extending similarity regions from the words.
Once regions of high sequence similarity are found, adjacent high-
scoring regions can be joined into a full alignment.
Basic Local Alignment Search Tool
(BLAST)
The BLAST program was developed by Stephen Altschul of NCBI in 1990
and has since become one of the most popular programs for sequence
analysis.
BLAST uses heuristics to align a query sequence with all sequences in a
database.
The objective is to find high-scoring ungapped segments among
related sequences.
The existence of such segments above a given threshold indicates
pairwise similarity beyond random chance, which helps to discriminate
related sequences from unrelated sequences in a database.
What is chance?
To assess if a given alignment constitutes evidence for
homology, it helps to know how strongly an alignment
can be expected from chance alone.
In this context, "chance" can mean the comparison of:
1. Real but non-homologous sequences;
2. Real sequences that are shuffled to preserve
compositional properties; or
3. Sequences that are generated randomly based upon
a DNA or protein sequence model.
Steps of BLAST
The first step is to create
a list of words from the
query sequence.
Each word is typically
three residues for protein
sequences and eleven
residues for DNA
sequences.
The list includes every
possible word extracted
from the query
sequence.
This step is also called
seeding.
Steps of BLAST
First step:
Words created from
MRDPYNKLIS are:
1. MRD
2. RDP
3. DPY
4. PYN
5. YNK
6. NKL
7. KLI
8. LIS
Steps of BLAST
First step:
If L is the length of the
query sequence and K is
the length of the word,
the number of words
formed are L-K+1.
Steps of BLAST
The second step is to
search a sequence
database for the
occurrence of words
derived from the query.
This step identifies
database sequences
containing matching
words.
Steps of BLAST
7+7+6 7+3+6 7+3+0 7+3+0
Second step:
The matching of protein
words is scored by
BLOSUM62.
For DNA words, match
score is 5 and mismatch
score is -4.
Steps of BLAST
The third step is to classify
the database words
found in the second
step.
A database word is
considered a significant
match to a query word if
its score is above a
threshold T.
If T is taken to be 17, only
PYN is a match.
If T is 13, both PYN and
PFN are matches.
Steps of BLAST
The fourth step involves
pairwise alignment by
extending from the
words in both directions
while counting the
alignment score using
BLOSUM62.
The extension continues
until the score of the
alignment drops below a
threshold due to
mismatches.
Steps of BLAST
Fourth step:
The drop threshold is 22
for proteins and 20 for
DNA.
The resulting contiguous
aligned segment pair
without gaps is called
high-scoring segment
pair (HSP).
Output
In the original version of BLAST, the highest scored HSPs
are presented as the final report.
They are also called maximum scoring pairs.
A recent improvement in the implementation of BLAST is
the ability to provide gapped alignment.
This improved version of BLAST presents HSPs that may
contain gaps.
Gapped BLAST
In gapped BLAST, the highest scored segment is chosen
to be extended in both directions using dynamic
programming where gaps may be introduced.
The extension continues if the alignment score is above
a certain threshold; otherwise it is terminated.
However, the overall score is allowed to drop below the
threshold only if it is temporary and rises again to attain
above threshold values.
Variants of BLAST
BLAST is a family of programs that includes BLASTN,
BLASTP, BLASTX, TBLASTN, and TBLASTX.
BLASTN queries nucleotide sequences with a nucleotide
sequence database.
BLASTP uses protein sequences as queries to search
against a protein sequence database.
BLASTX uses nucleotide sequences as queries and
translates them in all six reading frames to produce
translated protein sequences, which are used to query a
protein sequence database.
Reading frame
In molecular biology, a reading frame is a way of dividing the
sequence of nucleotides in a nucleic acid (DNA or RNA) molecule
into a set of consecutive, non-overlapping triplets.
Where these triplets equate to amino acids or stop signals during
translation, they are called codons.
A single strand of a nucleic acid molecule has a phosphoryl end,
called the 5′-end, and a hydroxyl or 3′-end.
There are three reading frames that can be read in this 5′→3′
direction, each beginning from a different nucleotide in a triplet.
In a double stranded nucleic acid, an additional three reading
frames may be read from the other, complementary strand in the
5′→3′ direction.
Reading frames
From 5’ to 3’ on this strand:
AGG-TGA-CAC-CGC-AAG-CCT-TAT-ATT-AGC
A GGT-GAC-ACC-GCA-AGC-CTT-ATA-TTA GC
AG GTG-ACA-CCG-CAA-GCC-TTA-TAT-TAG C
From 5’ to 3’ on the complementary strand:
GCT-AAT-ATA-AGG-CTT-GCG-GTG-TCA-CCT
G CTA-ATA-TAA-GGC-TTG-CGG-TGT-CAC CT
GC TAA-TAT-AAG-GCT-TGC-GGT-GTC-ACC T
Variants of BLAST
TBLASTN queries protein sequences to a nucleotide
sequence database with the sequences translated in all
six reading frames.
TBLASTX uses nucleotide sequences, which are
translated in all six frames, to search against a
nucleotide sequence database that has all the
sequences translated in six frames.
BLAST web server
https://blast.ncbi.nlm.nih.gov/Blast.cgi
The BLAST web server has been designed in such
a way as to simplify the task of program
selection.
The programs are organized based on the type
of query sequences to be translated.
In addition, programs for special purposes are
grouped separately.
BLAST output format
The BLAST output includes a graphic summary and the
uncovered alignments.
The graphic summary contains colored horizontal bars
that allow quick identification of the number of
database hits and the degrees of similarity of the hits.
The length of the bars represents the spans of sequence
alignments relative to the query sequence.
Each hit includes the accession number, title of the
database record, score, and E-value.
BLAST output format
In the alignment section, the query sequence is
on the top of the pair and the database
sequence is at the bottom of the pair labeled as
Subject.
In between the two sequences, matching
identical residues are written out at their
corresponding positions.
Interpreting E-values
The BLAST output provides a list of pairwise sequence
matches ranked by statistical significance.
The significance scores help to distinguish evolutionarily
related sequences from unrelated ones.
The E-value determines the likelihood that a given
sequence match is purely by chance.
E = m n P, where m is the total number of residues in a
database, n is the number of residues in the query
sequence, and P is the probability that an HSP alignment is
a result of random chance.
Interpreting E-values
The lower the E-value, the more significant the match is.
If E < 10−50, the database match IS a result of homologous
relationships!
If 10−50 < E < 0.01, the match can be a result of homology.
If 0.01 < E < 10, the match is considered not significant, but may
hint at a tentative remote homology relationship. Additional
evidence is needed to confirm the tentative relationship.
If E > 10, the sequences under consideration are either unrelated
or related by extremely distant relationships that fall below the
limit of detection of the current method.
Shortcoming of E-value
Because the E-value is proportionally affected by the
database size, as the database grows, the E-value for a
given sequence match also increases, and the match
loses significance.
Because the genuine evolutionary relationship between
the two sequences remains constant, the decrease in
credibility of the match as the database grows means
that one may “lose” previously detected homologs as
the database enlarges.
Thus, an alternative to E-value calculations is needed.
Interpreting bit scores
The bit score measures sequence similarity
independent of query sequence length and
database size.
It is directly proportional to the alignment score.
The higher the bit score, the more significant the
match is.
It provides a constant statistical indicator for
searching databases of different sizes or for
searching the same database at different times.
FAST-All (FASTA)
FASTA is a DNA and protein sequence alignment software
package first described by David J. Lipman and William R.
Pearson in 1985.
Its legacy is the FASTA format which is now ubiquitous in
bioinformatics.
FASTA takes a given sequence and searches a
corresponding sequence database by using local
sequence alignment to find similar database sequences.
Steps of FASTA
The first step is to
identify k-tups
between two
sequences by using
the hashing strategy.
We construct a
lookup table that
shows the position of
each k-tup for the
two sequences.
Steps of FASTA
First step:
The positional difference
is obtained by
subtracting the position
in the first sequence from
that in the second
sequence and is
expressed as the offset.
The k-tups that have the
same offset values are
then linked to reveal a
diagonal on a dot plot.
Steps of FASTA
A M P S D G L
First step: G •
The positional difference P •
is obtained by
subtracting the position S •
in the first sequence from
that in the second D •
sequence and is
expressed as the offset. N
The k-tups that have the A •
same offset values are
then linked to reveal a T
diagonal on a dot plot.
Steps of FASTA
First step:
All possible ungapped
alignments are found
between two
sequences.
Frequently, users take k-
tup to be at least 2
residue long for protein
sequences and at least 4
or 6 residues long for
nucleotide sequences.
Steps of FASTA
The second step is to
narrow down the high
similarity regions
between the two
sequences.
The top ten diagonals
are identified as high
similarity regions.
For amino acid
sequences, the
diagonals are rescored
using a substitution
matrix.
Steps of FASTA
In the third step,
neighboring high-scoring
segments are selected
and joined to form a
single alignment.
This step allows
introducing gaps
between the diagonals
while applying gap
penalties.
The score of the gapped
alignment is calculated
again.
Steps of FASTA
In the fourth step, the
gapped alignment is
refined further using the
Smith–Waterman
algorithm to produce a
final alignment.
The last step is to perform
a statistical evaluation of
the final alignment as in
BLAST, which produces
the E-value.
FASTA web server
https://www.ebi.ac.uk/Tools/sss/fasta/
The FASTA web server is maintained by
EMBL's European Bioinformatics Institute.
The database used for FASTA is UniProt.
Comparison of BLAST and FASTA
BLAST uses a substitution matrix to find matching words,
whereas FASTA identifies identical words using the dot
plot.
By default, FASTA scans smaller window sizes. Thus, it
gives more sensitive results than BLAST, with a better
coverage rate for homologs.
FASTA is usually slower than BLAST.
BLAST sometimes gives multiple best-scoring alignments
from the same sequence; FASTA returns only one final
alignment.