Optimal Alignment and Heuristic Solutions

The document discusses the complexities of constructing substitution matrices for amino acid sequences, emphasizing the importance of identifying conserved blocks and calculating amino acid pair frequencies to create a BLOSUM matrix. It also highlights the computational challenges of optimal alignment algorithms and introduces heuristic solutions like BLAST for faster sequence alignment. The BLAST algorithm involves compiling subsequences, finding matches, extending hits, and evaluating statistical significance to report significant matches in a database.

Uploaded by

samuele hofner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views7 pages

Optimal Alignment and Heuristic Solutions

Uploaded by

samuele hofner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Optimal alignment and heuristic solutions

BLOSUM matrices

When dealing with DNA sequences, the simple scoring schemes are usually very effective, since there
are only four possible nucleotides (A, T, C, G), and each substitution can be treated similarly. However,
in the case of amino acid sequences, the scoring method becomes more complex, since there are twenty
different amino acids, each with unique chemical properties; hence, substituting one amino acid for
another may have different effects depending on how similar their properties are.

To study protein sequences, scientists group related sequences into families and identify blocks, which
are regions where sequences are highly similar and well-aligned without gaps; these conserved blocks
help in understanding functional and structural similarities between proteins.
However, constructing a substitution matrix (a table showing how likely one amino acid is to be
replaced by another) is tricky because it is necessary to have an alignment to create the matrix, but the
alignment itself depends on a matrix, creating a circular problem.

To break this cycle, scientists start with a simple scoring system, assigning 1 when two amino acids
match, and 0 when they do not. Using this basic method, they identify blocks and then count how often
each amino acid pair appears together in the same column; these counts help build a more refined
substitution matrix, which can then improve future sequence alignments.

Let’s consider the following example.

The following steps must be followed:

 identifying blocks. The six given sequences (on the rows) are aligned, and in particular it is
necessary to focus on columns (position in the block) where the amino acids (A, B, C in this case)
appear together. These columns helps define blocks, which are regions of conserved sequences.
 amino acid counting. The total number of amino acids observed in the block is 6 * 4 = 24, where
A appears 14 times (14/24), B appears 4 times (4/24), and C appears 6 times (6/24).
 aligned pair counting. Given these four columns, each aligning six sequences, it is necessary to
count all possible pairs in each column. The number of possible pairs per columns is (6 2) =
15, so with four columns, the total number of aligned pairs is 4 * 15 = 60.
 pairwise frequency calculation. Now, it is possible to count how often each pair appears
together.
AA: 26/60 BB: 3/60
AB: 8/60 BC: 6/60
AC: 10/60 CC: 7/60

Given these counts, it is now possible to determine how likely each amino acid pair is to appear together:

 if AA appears frequently, it means that A is often conserved.

 if AB appears less frequently, it means that substitution A with B is rare.
 if AC appears often, it means that A and C are interchangeable more frequently.
Repeating this process for many sequences allows to create a substitution matrix, which assigns scores
to each possible amino acid replacement, helping in future sequence alignments by providing scores
based on real biological data rather than just treating all substitutions equally.
Once having calculated the frequency of occurrence for each character and each couple, then it is possible
to compute the observed proportions, by counting how often each amino acid pair appears in the aligned
sequences. These values are them converted into proportions by dividing the count by the total number
of aligned pairs: for example, the AA pair appears 26 times out of 60 pairs, so its observed proportion is
26/60. Afterwards. expected proportions can be calculated as well: the expected frequency of a pair is
based on how often each individual amino acid appears in the sequences, and is computed by multiplying
the individual frequencies of the two amino acids. Moreover, for different amino acids (like AB), the
product is further multiplied by 2:

The ratio of observed proportion vs expected proportion indicates whether an amino acid pair appears
more or less frequently than expected by chance:

 if a pair appears more often than expected, it suggests the substitution is favorable (the amino
acids of the pair are similar).
 if a pair appears less often than expected, it suggests the substitution is unfavorable (the amino
acids of the pair are different).

In order to compute the logarithm for scoring, the logarithm in base 2 of the ratio is computed:

This final value is rounded to the nearest integer and used as a score in the substitution matrix.
In this way, the obtained scores can be organized in the BLOSUM matrix, which is a 20 x 20 matrix and
is symmetrical, meaning that the score for AB is the same as for BA. The scores are used in sequence
alignment to determine how likely one amino acid is to be replaced by another in evolution:

 negative scores reflect less likely substitutions in real biological sequences.

 positive scores reflect more likely substitutions in real biological sequences.

When building a substitution matrix, it is necessary to avoid bias from overrepresented sequences,
meaning that if some sequences are too similar, they would unfairly dominate the counts. To balance this,
sequences that are too similar (above a certain threshold) are grouped into clusters.

Indeed each block, it is preferable to have sequences with the same amount of evolutionary distance
between them, meaning that sequences that are close to each other above a certain threshold can be
grouped together. For example, if the threshold is 85%, then sequences that show identity above 85%
form a unique cluster; it follows that in the count of the occurrences, sequences that belong to a cluster
of n sequences contribute as 1/n, preventing redundant sequences from skewing the matrix.
If clusters with X% identity are used then the resulting matrix is a BLOSUMX, and this is why various
types of BLOSUM matrices exist, depending on the level of the threshold. Larger-numbered matrices are
built from sequences that are highly similar (recently diverged), and are useful for comparing closely-
related proteins. On the other hand, small-numbered matrices are build from sequences that are more
distantly related (older evolutionary divergence), and are better for the detection of distant homologies-

Hence, the higher the threshold, the higher the number of sequences with different levels of identity that
entirely contribute to the computation. Conversely, if the threshold is lower, more sequences will be
clustered and neglected (less weighted), and therefore less sequences will contribute to the overall
computation. The consequence, from a practical standpoint, is that a BLOSUM matrix with a low number
is more suitable to align and compare sequences that are less related and are far from each other from an
evolutionary point of view.
Optimal alignment algorithms

The optimal alignment algorithms seen so far find the best possible alignment between two sequences,
and thus represent a very good solution. However, they are quite expensive from a computational
standpoint, in terms of time and space and space complexity; indeed, for instance, for two sequences of
length 1.000 nucleotides, they require 1.000.000 operations.

Optimal alignment algorithms are useful for matching a sequence against a database: for example,
given a protein-coding sequence, it is possible to search public repositories for all similar proteins, in
order to get a clue about its possible function. However, the query sequence should be aligned to each
entry in the repository, and even though it is possible to apply optimal algorithms, a lot of time and a
high computational power are required. In order to overcome this difficulty, heuristic solutions have
been developed, which are approximate solutions that work faster.

If in optimal algorithms, the alignment that is obtained corresponds for sure to the highest score, heuristic
approaches do no guarantee such certainty, but they still works properly. This is the case of BLAST
(Basic Local Alignment Search Tool), a set of algorithms that have been developed in order to obtain
the alignment of the query sequence with many other sequences contained in public databases. The family
of algorithms includes five types of implementation:

 BLASTP, which compares a protein query to a database of proteins.

 BLASTIN, which compares both strands of a DNA query against a DNA database.

 BLASTX, which translates a DNA sequence into six protein sequences, suing all six possible
reading frames, and then compares each of these proteins to a protein database.

Six translations are possible because, given a nucleotide sequence, translation can be started not
only from the first three nucleotides, but also from the second nucleotide, or from the third one,
thus leading to different codons, and, consequently, to a different amino acid sequence that will
be translated. This means that from a nucleotide sequence in a single DNA strand, three possible
reading frames are present. In this cases, six possible translations may occur, since both DNA
strands are considered.
 TBLASTN, which translates every DNA sequence in a database into six potential proteins, and
then compares the protein query against each of those translated proteins.

 TBLASTX, the most computationally intensive BLAST algorithm, which translates DNA from
both a query and a database into six potential proteins, and then performs 36 protein-protein
database searches.

The BLAST algorithm consists of multiple steps:

 compilation of subsequences: a list of subsequences of length W (“words” ) is created from the

query sequence, where each word has a score at or above a threshold T. In this example, the query
word is in blue.

 finding matching database sequences: all database sequences that containing any of the words
from the compiled list are identified.
 extension of hits: each identified hit (match) is extended in both directions to determine if it is
part of a longer alignment. During this extension, the score of the alignment might drop from the
highest score achieved so far; however, the drop is allowed to a certain limit, specified by the
value X; the alignment continues to be considered valid as long as the score does not drop too
drastically. The extended alignment, after
considering the score drop, is referred to as a
High Scoring Pair (HSP), and identifies a
significant region of similarity between the
query sequence and the database sequence.
The extension is stopped as soon as the score decreases
more than X when compared with the highest value
obtained during the extension process.
 statistical significance evaluation: the statistical
significance (expect value) of each match found is
assessed.
 reporting significant matches: only those matches that
satisfy a user-defined significance threshold are
reported.

In the original (1990) implementation of BLAST, all hits were

extended in either direction. In a 1997 refinement of BLAST,
two independent hits are required, instead: indeed, hits must occur in close proximity to each other, and
with this modification, only one seventh as many extensions occur, greatly speeding the time required
for a search.

In BLAST, raw scores (S) reflect the quality of sequence alignments based on a substitution matrix, with
higher scores indicating better matches between the query and database sequences. However, raw scores
can vary depending on factors like the scoring matrix and database size, so in order to make scores
comparable across different searches, bit scores (S) are used, which normalize the raw scores by
accounting for these factors. This allows bit scores to represent the significance of a match in a way that
can be consistently compared, even when different databases or scoring matrices are used, ensuring more
reliable evaluation of High Scoring Pairs (HSPs).
λ and K refer to the specific substitution matrix that was used and tp the width of the source space
(number of sequences in the target database), respectively.

BLAST search output

The following is an example of a BLAST search output:

The Query line represents the sequence of interest that is being investigated, while the Sbjct (subject)
line contains the reference sequence in the database that is used for comparison; the intermediate line
between these two reports all the aligned characters, and the number on the left indicates the last amino
acid position in each sequence. The + sign signals that two amino acids are not the same in the two
sequences, but are similar in the substitution matrix (both having a positive score); hence, they do not
represent a dramatic mismatch and instead possess a certain degree of similarity. Conversely, when the
score of the substitution matrix is not positive, an empty space is left.
The percentages of perfect matches (known as identities) is also indicated (137/147 in this case), and
refers to the overall alignment; the percentage of positives also accounts for perfect matches.

Unit Iii
No ratings yet
Unit Iii
14 pages
Mount - 2008 - Using BLOSUM in Sequence Alignments
No ratings yet
Mount - 2008 - Using BLOSUM in Sequence Alignments
5 pages
Substitution Matrix
No ratings yet
Substitution Matrix
10 pages
BLOSUM
No ratings yet
BLOSUM
3 pages
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
No ratings yet
Comparison of The PAM and BLOSUM Amino Acid Substitution Matrices
4 pages
Blosum 2014
No ratings yet
Blosum 2014
3 pages
16 Unnamed 08 08 2024
No ratings yet
16 Unnamed 08 08 2024
13 pages
Introduction To Bioinformatics: Sequence Alignment
No ratings yet
Introduction To Bioinformatics: Sequence Alignment
29 pages
Alignment of Sequences
No ratings yet
Alignment of Sequences
33 pages
04 CAP5510 Fall21
No ratings yet
04 CAP5510 Fall21
37 pages
Chap 03 BioInfo
No ratings yet
Chap 03 BioInfo
15 pages
Sequence Analysis - Pairwise Alignment
No ratings yet
Sequence Analysis - Pairwise Alignment
26 pages
Pam Blasta Fasta
No ratings yet
Pam Blasta Fasta
10 pages
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
No ratings yet
Dr. Zoya Khalid Zoya - Khalid@nu - Edu.pk
51 pages
Lecture 3 and 4 LSM2241
No ratings yet
Lecture 3 and 4 LSM2241
6 pages
Protein Alignment Scoring Guide
No ratings yet
Protein Alignment Scoring Guide
3 pages
Introduction To Bioinformatics Lecture 3
No ratings yet
Introduction To Bioinformatics Lecture 3
20 pages
SequenceAlignment 2
No ratings yet
SequenceAlignment 2
21 pages
Frid Seminar
No ratings yet
Frid Seminar
30 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Unit2 2
No ratings yet
Unit2 2
30 pages
Msa MTech
No ratings yet
Msa MTech
17 pages
Sequence Alignment & BLAST Guide
No ratings yet
Sequence Alignment & BLAST Guide
37 pages
Lecture 9-10 (Sequence Alignment)
No ratings yet
Lecture 9-10 (Sequence Alignment)
48 pages
PAM and BLOSUM Matrices
No ratings yet
PAM and BLOSUM Matrices
3 pages
W03 Pairwise
No ratings yet
W03 Pairwise
55 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
Sequence Alignment
No ratings yet
Sequence Alignment
24 pages
Importance and Significance of Sequence Alignment - pptx12
No ratings yet
Importance and Significance of Sequence Alignment - pptx12
15 pages
2-Substitution Matrices and Python - 2017
No ratings yet
2-Substitution Matrices and Python - 2017
65 pages
Using Scoring Matrices
No ratings yet
Using Scoring Matrices
3 pages
Bioinformatics 2
No ratings yet
Bioinformatics 2
26 pages
Alignment
No ratings yet
Alignment
58 pages
Sequence Alignment: "Continuing.." (5th Week)
No ratings yet
Sequence Alignment: "Continuing.." (5th Week)
61 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Second - Done - W14a - Substitution Patterns
No ratings yet
Second - Done - W14a - Substitution Patterns
36 pages
PAM Abd BLOSUM
No ratings yet
PAM Abd BLOSUM
3 pages
Sequence Alignment & Scoring Matrices
No ratings yet
Sequence Alignment & Scoring Matrices
30 pages
Bioinformatics I
No ratings yet
Bioinformatics I
39 pages
Pam Blosum Matrix
No ratings yet
Pam Blosum Matrix
48 pages
Sequence Comparison
No ratings yet
Sequence Comparison
39 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Full PDF
No ratings yet
Full PDF
5 pages
Protein Sequence Evolution Analysis
No ratings yet
Protein Sequence Evolution Analysis
44 pages
Running BLAST Through Perl
No ratings yet
Running BLAST Through Perl
35 pages
AsBioinfo Ders 7 ALLIGNMENT - 1
No ratings yet
AsBioinfo Ders 7 ALLIGNMENT - 1
9 pages
Unit3 Final
No ratings yet
Unit3 Final
114 pages
Sequence Alignment
No ratings yet
Sequence Alignment
17 pages
Gene Sequence Analysis Guide
No ratings yet
Gene Sequence Analysis Guide
14 pages
Bioinformatics in PAM AND BLOSUM
100% (15)
Bioinformatics in PAM AND BLOSUM
17 pages
Bioinfo Ders 7 ALLIGNMENT - 1
No ratings yet
Bioinfo Ders 7 ALLIGNMENT - 1
55 pages
Sequence Alignment
No ratings yet
Sequence Alignment
63 pages
Bioinformatics Module 2 Notes
No ratings yet
Bioinformatics Module 2 Notes
28 pages
Bioinformatics Chaper3
No ratings yet
Bioinformatics Chaper3
34 pages
Methods For Applying Multiple Sequence Alignment
No ratings yet
Methods For Applying Multiple Sequence Alignment
17 pages
BIF401 MID Term Exam 2022 Preparation by BADSHA ALI
No ratings yet
BIF401 MID Term Exam 2022 Preparation by BADSHA ALI
6 pages
Bioinformatics Seminar3rdOct18
No ratings yet
Bioinformatics Seminar3rdOct18
25 pages
Alignments & Phylogenetic Trees: Lesk, A. 2 Ed
No ratings yet
Alignments & Phylogenetic Trees: Lesk, A. 2 Ed
18 pages
Mechanisms of Mutation
No ratings yet
Mechanisms of Mutation
11 pages
Sequence Alignment
No ratings yet
Sequence Alignment
8 pages
Anatomy of Thehead & Neck
No ratings yet
Anatomy of Thehead & Neck
26 pages
Anatomy of Thehead & Neck 2
No ratings yet
Anatomy of Thehead & Neck 2
41 pages
General Physics 2 Module 2
No ratings yet
General Physics 2 Module 2
9 pages
Tolerances For Diecastings Din 1688
No ratings yet
Tolerances For Diecastings Din 1688
5 pages
03 References OHS01001ENGX v2 (AD02) Jan2025
No ratings yet
03 References OHS01001ENGX v2 (AD02) Jan2025
28 pages
Factors in Predicting Health Behaviors Lecture
No ratings yet
Factors in Predicting Health Behaviors Lecture
20 pages
Central Place Theory Christaller and Losch
No ratings yet
Central Place Theory Christaller and Losch
10 pages
Road Drainage System
No ratings yet
Road Drainage System
4 pages
School of Law, Narsee Monjee Institute of Management Studies, Bangalore
No ratings yet
School of Law, Narsee Monjee Institute of Management Studies, Bangalore
13 pages
Civil Engineering Broucher
No ratings yet
Civil Engineering Broucher
13 pages
Concave vs Convex Mirror Quiz
100% (4)
Concave vs Convex Mirror Quiz
5 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
5 pages
Remote Sensing: Radiometric Variations of On-Orbit FORMOSAT-5 RSI From Vicarious and Cross-Calibration Measurements
No ratings yet
Remote Sensing: Radiometric Variations of On-Orbit FORMOSAT-5 RSI From Vicarious and Cross-Calibration Measurements
18 pages
AA19320
No ratings yet
AA19320
6 pages
CAD Mock Preparation
No ratings yet
CAD Mock Preparation
5 pages
Philo Pointers
No ratings yet
Philo Pointers
3 pages
8 Vertical Stresses Below Applied Loads
No ratings yet
8 Vertical Stresses Below Applied Loads
13 pages
Material List Summary-Waptech
No ratings yet
Material List Summary-Waptech
5 pages
Chapter 1 A Letter To God Extract Based Questions For Class 10 First Flight
No ratings yet
Chapter 1 A Letter To God Extract Based Questions For Class 10 First Flight
10 pages
Civil Engineering MCQ'S: Ans. (C)
No ratings yet
Civil Engineering MCQ'S: Ans. (C)
2 pages
School Based Management (SBM) As Correlates To Academic Performance of Secondary Schools in Quezon City
No ratings yet
School Based Management (SBM) As Correlates To Academic Performance of Secondary Schools in Quezon City
30 pages
Port Ship Emissions Analysis
No ratings yet
Port Ship Emissions Analysis
13 pages
Scaffold Erection NC2 Cert
No ratings yet
Scaffold Erection NC2 Cert
1 page
Question 1213992
No ratings yet
Question 1213992
6 pages
Development of A Pico-Hydro Electric Generator Wit
No ratings yet
Development of A Pico-Hydro Electric Generator Wit
10 pages
Vroom-Yetton-Jago: Deciding How To Decide
100% (1)
Vroom-Yetton-Jago: Deciding How To Decide
11 pages
Diagram Fasa Dan Transisi-Baru-2023
No ratings yet
Diagram Fasa Dan Transisi-Baru-2023
23 pages
Elevating Branding Potential Through Color Psychology
No ratings yet
Elevating Branding Potential Through Color Psychology
3 pages
Urban Life Essay Guide
100% (2)
Urban Life Essay Guide
8 pages
ATTENDANCE 2nd Quarter (AutoRecovered)
No ratings yet
ATTENDANCE 2nd Quarter (AutoRecovered)
3 pages
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
No ratings yet
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
57 pages