0% found this document useful (0 votes)

28 views41 pages

Database Searching

The document discusses algorithms for database searching, including heuristic algorithms like BLAST and FASTA. It describes how heuristic algorithms sacrifice some sensitivity and specificity for speed. The document then provides details on how BLAST works, including its word search method and use of E-values to evaluate matches.

Uploaded by

letsvansh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views41 pages

Database Searching

Uploaded by

letsvansh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

CS-434 BIOINFORMATICS

DR. UROOJ AINUDDIN

ALGORITHMS
FOR CHAPTER 4

DATABASE
SEARCHING
Sensitivity: the ability to find as
many correct hits or true positives
as possible.

Selectivity, also called specificity:

Requirements the ability to exclude incorrect hits
for database or false positives.
search
Speed: the time it takes to get
results from database searches.
Ideally, one wants to have the greatest
sensitivity, selectivity, and speed in database
searches.

An increase in sensitivity is associated with

decrease in selectivity. A very inclusive
search tends to include many false positives.

Balancing
requirements Similarly, an improvement in speed often
comes at the cost of lowered sensitivity and
selectivity.

A compromise between the three criteria

often has to be made.
Heuristic algorithms

 A heuristic algorithm is a computational strategy to find

an empirical or near optimal solution by using rules of
thumb.
 Essentially, this type of algorithms take shortcuts by
reducing the search space according to some criteria.
However, the shortcut strategy is not guaranteed to find
the best or most accurate solution.
 It is often used because of the need for obtaining results
within a realistic time frame without significantly
sacrificing the accuracy of the computational output.
Why shift to heuristics from exhaustive
algorithms?
 Searching a large database using the dynamic
programming methods, although accurate and reliable,
is too slow and impractical when computational
resources are limited.
 Speed of searching is an important issue, for which
heuristic methods must be used.
 The heuristic algorithms perform faster searches because
they examine only a fraction of the possible alignments
examined in dynamic programming.
Heuristic sequence search programs

 Currently, there are two major heuristic algorithms for performing

database searches:
1. BLAST and
2. FASTA.
 These methods are not guaranteed to find the optimal alignment or
true homologs but are 50–100 times faster than dynamic
programming.
 The increased computational speed comes at a moderate expense
of sensitivity and specificity of the search.
 Both programs can provide a reasonably good indication of
sequence similarity by identifying similar sequence segments.
Word method

 Both BLAST and FASTA use a heuristic word method for fast pairwise
sequence alignment.
 It finds short stretches of identical or nearly identical letters in two
sequences. These short strings of characters are called words.
 The basic assumption is that two related sequences must have at
least one word in common.
 By first identifying word matches, a longer alignment can be
obtained by extending similarity regions from the words.
 Once regions of high sequence similarity are found, adjacent high-
scoring regions can be joined into a full alignment.
Basic Local Alignment Search Tool
(BLAST)

 The BLAST program was developed by Stephen Altschul of NCBI in 1990

and has since become one of the most popular programs for sequence
analysis.
 BLAST uses heuristics to align a query sequence with all sequences in a
database.
 The objective is to find high-scoring ungapped segments among
related sequences.
 The existence of such segments above a given threshold indicates
pairwise similarity beyond random chance, which helps to discriminate
related sequences from unrelated sequences in a database.
What is chance?

 To assess if a given alignment constitutes evidence for

homology, it helps to know how strongly an alignment
can be expected from chance alone.
 In this context, "chance" can mean the comparison of:
1. Real but non-homologous sequences;
2. Real sequences that are shuffled to preserve
compositional properties; or
3. Sequences that are generated randomly based upon
a DNA or protein sequence model.
Steps of BLAST

 The first step is to create

a list of words from the
query sequence.
 Each word is typically
three residues for protein
sequences and eleven
residues for DNA
sequences.
 The list includes every
possible word extracted
from the query
sequence.
 This step is also called
seeding.
Steps of BLAST

 First step:
 Words created from
MRDPYNKLIS are:
1. MRD
2. RDP
3. DPY
4. PYN
5. YNK
6. NKL
7. KLI
8. LIS
Steps of BLAST

 First step:
 If L is the length of the
query sequence and K is
the length of the word,
the number of words
formed are L-K+1.
Steps of BLAST

 The second step is to

search a sequence
database for the
occurrence of words
derived from the query.
 This step identifies
database sequences
containing matching
words.
Steps of BLAST

7+7+6 7+3+6 7+3+0 7+3+0

 Second step:
 The matching of protein
words is scored by
BLOSUM62.
 For DNA words, match
score is 5 and mismatch
score is -4.
Steps of BLAST

 The third step is to classify

the database words
found in the second
step.
 A database word is
considered a significant
match to a query word if
its score is above a
threshold T.
 If T is taken to be 17, only
PYN is a match.
 If T is 13, both PYN and
PFN are matches.
Steps of BLAST

 The fourth step involves

pairwise alignment by
extending from the
words in both directions
while counting the
alignment score using
BLOSUM62.
 The extension continues
until the score of the
alignment drops below a
threshold due to
mismatches.
Steps of BLAST

 Fourth step:
 The drop threshold is 22
for proteins and 20 for
DNA.
 The resulting contiguous
aligned segment pair
without gaps is called
high-scoring segment
pair (HSP).
Output

 In the original version of BLAST, the highest scored HSPs

are presented as the final report.
 They are also called maximum scoring pairs.
 A recent improvement in the implementation of BLAST is
the ability to provide gapped alignment.
 This improved version of BLAST presents HSPs that may
contain gaps.
Gapped BLAST

 In gapped BLAST, the highest scored segment is chosen

to be extended in both directions using dynamic
programming where gaps may be introduced.
 The extension continues if the alignment score is above
a certain threshold; otherwise it is terminated.
 However, the overall score is allowed to drop below the
threshold only if it is temporary and rises again to attain
above threshold values.
Variants of BLAST

 BLAST is a family of programs that includes BLASTN,

BLASTP, BLASTX, TBLASTN, and TBLASTX.
 BLASTN queries nucleotide sequences with a nucleotide
sequence database.
 BLASTP uses protein sequences as queries to search
against a protein sequence database.
 BLASTX uses nucleotide sequences as queries and
translates them in all six reading frames to produce
translated protein sequences, which are used to query a
protein sequence database.
Reading frame

 In molecular biology, a reading frame is a way of dividing the

sequence of nucleotides in a nucleic acid (DNA or RNA) molecule
into a set of consecutive, non-overlapping triplets.
 Where these triplets equate to amino acids or stop signals during
translation, they are called codons.
 A single strand of a nucleic acid molecule has a phosphoryl end,
called the 5′-end, and a hydroxyl or 3′-end.
 There are three reading frames that can be read in this 5′→3′
direction, each beginning from a different nucleotide in a triplet.
 In a double stranded nucleic acid, an additional three reading
frames may be read from the other, complementary strand in the
5′→3′ direction.
Reading frames

 From 5’ to 3’ on this strand:

 AGG-TGA-CAC-CGC-AAG-CCT-TAT-ATT-AGC
 A GGT-GAC-ACC-GCA-AGC-CTT-ATA-TTA GC
 AG GTG-ACA-CCG-CAA-GCC-TTA-TAT-TAG C
 From 5’ to 3’ on the complementary strand:
 GCT-AAT-ATA-AGG-CTT-GCG-GTG-TCA-CCT
 G CTA-ATA-TAA-GGC-TTG-CGG-TGT-CAC CT
 GC TAA-TAT-AAG-GCT-TGC-GGT-GTC-ACC T
Variants of BLAST

 TBLASTN queries protein sequences to a nucleotide

sequence database with the sequences translated in all
six reading frames.
 TBLASTX uses nucleotide sequences, which are
translated in all six frames, to search against a
nucleotide sequence database that has all the
sequences translated in six frames.
BLAST web server
https://blast.ncbi.nlm.nih.gov/Blast.cgi
The BLAST web server has been designed in such
a way as to simplify the task of program
selection.
The programs are organized based on the type
of query sequences to be translated.
In addition, programs for special purposes are
grouped separately.
BLAST output format

 The BLAST output includes a graphic summary and the

uncovered alignments.
 The graphic summary contains colored horizontal bars
that allow quick identification of the number of
database hits and the degrees of similarity of the hits.
 The length of the bars represents the spans of sequence
alignments relative to the query sequence.
 Each hit includes the accession number, title of the
database record, score, and E-value.
BLAST output format

In the alignment section, the query sequence is

on the top of the pair and the database
sequence is at the bottom of the pair labeled as
Subject.
In between the two sequences, matching
identical residues are written out at their
corresponding positions.
Interpreting E-values

 The BLAST output provides a list of pairwise sequence

matches ranked by statistical significance.
 The significance scores help to distinguish evolutionarily
related sequences from unrelated ones.
 The E-value determines the likelihood that a given
sequence match is purely by chance.
 E = m n P, where m is the total number of residues in a
database, n is the number of residues in the query
sequence, and P is the probability that an HSP alignment is
a result of random chance.
Interpreting E-values

 The lower the E-value, the more significant the match is.
 If E < 10−50, the database match IS a result of homologous
relationships!
 If 10−50 < E < 0.01, the match can be a result of homology.
 If 0.01 < E < 10, the match is considered not significant, but may
hint at a tentative remote homology relationship. Additional
evidence is needed to confirm the tentative relationship.
 If E > 10, the sequences under consideration are either unrelated
or related by extremely distant relationships that fall below the
limit of detection of the current method.
Shortcoming of E-value

 Because the E-value is proportionally affected by the

database size, as the database grows, the E-value for a
given sequence match also increases, and the match
loses significance.
 Because the genuine evolutionary relationship between
the two sequences remains constant, the decrease in
credibility of the match as the database grows means
that one may “lose” previously detected homologs as
the database enlarges.
 Thus, an alternative to E-value calculations is needed.
Interpreting bit scores

 The bit score measures sequence similarity

independent of query sequence length and
database size.
 It is directly proportional to the alignment score.
 The higher the bit score, the more significant the
match is.
 It provides a constant statistical indicator for
searching databases of different sizes or for
searching the same database at different times.
FAST-All (FASTA)

 FASTA is a DNA and protein sequence alignment software

package first described by David J. Lipman and William R.
Pearson in 1985.
 Its legacy is the FASTA format which is now ubiquitous in
bioinformatics.
 FASTA takes a given sequence and searches a
corresponding sequence database by using local
sequence alignment to find similar database sequences.
Steps of FASTA

 The first step is to

identify k-tups
between two
sequences by using
the hashing strategy.
 We construct a
lookup table that
shows the position of
each k-tup for the
two sequences.
Steps of FASTA

 First step:
 The positional difference
is obtained by
subtracting the position
in the first sequence from
that in the second
sequence and is
expressed as the offset.
 The k-tups that have the
same offset values are
then linked to reveal a
diagonal on a dot plot.
Steps of FASTA
A M P S D G L
 First step: G •
 The positional difference P •
is obtained by
subtracting the position S •
in the first sequence from
that in the second D •
sequence and is
expressed as the offset. N
 The k-tups that have the A •
same offset values are
then linked to reveal a T
diagonal on a dot plot.
Steps of FASTA

 First step:
 All possible ungapped
alignments are found
between two
sequences.
 Frequently, users take k-
tup to be at least 2
residue long for protein
sequences and at least 4
or 6 residues long for
nucleotide sequences.
Steps of FASTA

 The second step is to

narrow down the high
similarity regions
between the two
sequences.
 The top ten diagonals
are identified as high
similarity regions.
 For amino acid
sequences, the
diagonals are rescored
using a substitution
matrix.
Steps of FASTA

 In the third step,

neighboring high-scoring
segments are selected
and joined to form a
single alignment.
 This step allows
introducing gaps
between the diagonals
while applying gap
penalties.
 The score of the gapped
alignment is calculated
again.
Steps of FASTA

 In the fourth step, the

gapped alignment is
refined further using the
Smith–Waterman
algorithm to produce a
final alignment.
 The last step is to perform
a statistical evaluation of
the final alignment as in
BLAST, which produces
the E-value.
FASTA web server
https://www.ebi.ac.uk/Tools/sss/fasta/
The FASTA web server is maintained by
EMBL's European Bioinformatics Institute.
The database used for FASTA is UniProt.
Comparison of BLAST and FASTA

 BLAST uses a substitution matrix to find matching words,

whereas FASTA identifies identical words using the dot
plot.
 By default, FASTA scans smaller window sizes. Thus, it
gives more sensitive results than BLAST, with a better
coverage rate for homologs.
 FASTA is usually slower than BLAST.
 BLAST sometimes gives multiple best-scoring alignments
from the same sequence; FASTA returns only one final
alignment.

5 Database Similarity Search BLAST
No ratings yet
5 Database Similarity Search BLAST
47 pages
Blast & Fasta
No ratings yet
Blast & Fasta
47 pages
BLAST
100% (1)
BLAST
4 pages
BT302 L5 Blast
No ratings yet
BT302 L5 Blast
38 pages
Blast (Basic Local Alignment Search Tool)
No ratings yet
Blast (Basic Local Alignment Search Tool)
28 pages
BLAST Background
100% (1)
BLAST Background
27 pages
Lecture 4: Blast: Ly Le, PHD
No ratings yet
Lecture 4: Blast: Ly Le, PHD
60 pages
Database Similarity Searching
No ratings yet
Database Similarity Searching
4 pages
Blast
No ratings yet
Blast
115 pages
Sequence Alignment and Searching
No ratings yet
Sequence Alignment and Searching
54 pages
BlastBioinformatics A Powerful Tool For Sequence Alignment
No ratings yet
BlastBioinformatics A Powerful Tool For Sequence Alignment
10 pages
Blast
No ratings yet
Blast
60 pages
Bt7 Ncbi Blast
No ratings yet
Bt7 Ncbi Blast
60 pages
BLAST Presentation
No ratings yet
BLAST Presentation
18 pages
Bioinformatics: Arushi Dinesh Kasi Shruthi
No ratings yet
Bioinformatics: Arushi Dinesh Kasi Shruthi
28 pages
BLAST Script
No ratings yet
BLAST Script
10 pages
Blast 170122070200
No ratings yet
Blast 170122070200
22 pages
BE Blast
No ratings yet
BE Blast
11 pages
BLAST
No ratings yet
BLAST
17 pages
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
100% (1)
04B. Bioinformatics-Lecture 4 (Alternative) - Blast
38 pages
BLAST
No ratings yet
BLAST
30 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Search Sequence Database
No ratings yet
Search Sequence Database
6 pages
Fundamentals of Bioinformatics - L5
No ratings yet
Fundamentals of Bioinformatics - L5
56 pages
Unit Iv - Blast
No ratings yet
Unit Iv - Blast
21 pages
Lecture 4
No ratings yet
Lecture 4
106 pages
Lecture 05
No ratings yet
Lecture 05
36 pages
Bio 2
No ratings yet
Bio 2
39 pages
Lecture - 02 - Comparative Sequence Analysis
No ratings yet
Lecture - 02 - Comparative Sequence Analysis
28 pages
Lecture 8 ACB
No ratings yet
Lecture 8 ACB
5 pages
Week 3 LocalAlignment
No ratings yet
Week 3 LocalAlignment
25 pages
Blast
No ratings yet
Blast
18 pages
Bioinformatics Session8
No ratings yet
Bioinformatics Session8
33 pages
Bs982 l08 Basic Blast
No ratings yet
Bs982 l08 Basic Blast
38 pages
ItoBI Lec10 1
No ratings yet
ItoBI Lec10 1
17 pages
Lecture2022 - 3 /!
No ratings yet
Lecture2022 - 3 /!
60 pages
Lecture 9... Basic Local Alignment Tool (BLAST) - 1
No ratings yet
Lecture 9... Basic Local Alignment Tool (BLAST) - 1
11 pages
Final Blast PDF
No ratings yet
Final Blast PDF
31 pages
Bioinformatics: Sequence Alignment Basics
No ratings yet
Bioinformatics: Sequence Alignment Basics
14 pages
Lecture 8 - BLAST - MSA
No ratings yet
Lecture 8 - BLAST - MSA
15 pages
BLAST: Sequence Alignment Tool Guide
No ratings yet
BLAST: Sequence Alignment Tool Guide
12 pages
Lab Report 03
No ratings yet
Lab Report 03
18 pages
ALLIENU Blast and Fasta
No ratings yet
ALLIENU Blast and Fasta
27 pages
Merin 1
No ratings yet
Merin 1
10 pages
BLAST - A Heuristic Algorithm
No ratings yet
BLAST - A Heuristic Algorithm
18 pages
Blast Fasta
No ratings yet
Blast Fasta
27 pages
Ncbi Blast Name: Rohith ND Roll No:20054
No ratings yet
Ncbi Blast Name: Rohith ND Roll No:20054
11 pages
Fast Heuristic Local Alignment Algorithms: Stephen F
No ratings yet
Fast Heuristic Local Alignment Algorithms: Stephen F
18 pages
BLAST Guide for Bioinformatics Students
No ratings yet
BLAST Guide for Bioinformatics Students
36 pages
Blast Analisis II
No ratings yet
Blast Analisis II
15 pages
Bioinformatics: Blast and Sequence Analysis
No ratings yet
Bioinformatics: Blast and Sequence Analysis
45 pages
Introduction To Bioinformatics 3. Sequence Alignment #1
No ratings yet
Introduction To Bioinformatics 3. Sequence Alignment #1
24 pages
Bioinformatics Tools for Biologists
No ratings yet
Bioinformatics Tools for Biologists
26 pages
Sequence DB Search
No ratings yet
Sequence DB Search
38 pages
Lecture/Lab: BLAST: Materials Last Updated June 2007
No ratings yet
Lecture/Lab: BLAST: Materials Last Updated June 2007
11 pages
Bioinformatics Lab 2 (Evelyn)
No ratings yet
Bioinformatics Lab 2 (Evelyn)
9 pages
Bioinformatics Lab 2
No ratings yet
Bioinformatics Lab 2
9 pages

Database Searching

Uploaded by

Database Searching

Uploaded by

CS-434 BIOINFORMATICS

DR. UROOJ AINUDDIN

Selectivity, also called specificity:

An increase in sensitivity is associated with

A compromise between the three criteria

 A heuristic algorithm is a computational strategy to find

 Currently, there are two major heuristic algorithms for performing

 The BLAST program was developed by Stephen Altschul of NCBI in 1990

 To assess if a given alignment constitutes evidence for

 The first step is to create

 The second step is to

7+7+6 7+3+6 7+3+0 7+3+0

 The third step is to classify

 The fourth step involves

 In the original version of BLAST, the highest scored HSPs

 In gapped BLAST, the highest scored segment is chosen

 BLAST is a family of programs that includes BLASTN,

 In molecular biology, a reading frame is a way of dividing the

 From 5’ to 3’ on this strand:

 TBLASTN queries protein sequences to a nucleotide

 The BLAST output includes a graphic summary and the

In the alignment section, the query sequence is

 The BLAST output provides a list of pairwise sequence

 Because the E-value is proportionally affected by the

 The bit score measures sequence similarity

 FASTA is a DNA and protein sequence alignment software

 The first step is to

 The second step is to

 In the third step,

 In the fourth step, the

 BLAST uses a substitution matrix to find matching words,

You might also like