Optimal alignment and heuristic solutions
BLOSUM matrices
When dealing with DNA sequences, the simple scoring schemes are usually very effective, since there
are only four possible nucleotides (A, T, C, G), and each substitution can be treated similarly. However,
in the case of amino acid sequences, the scoring method becomes more complex, since there are twenty
different amino acids, each with unique chemical properties; hence, substituting one amino acid for
another may have different effects depending on how similar their properties are.
To study protein sequences, scientists group related sequences into families and identify blocks, which
are regions where sequences are highly similar and well-aligned without gaps; these conserved blocks
help in understanding functional and structural similarities between proteins.
However, constructing a substitution matrix (a table showing how likely one amino acid is to be
replaced by another) is tricky because it is necessary to have an alignment to create the matrix, but the
alignment itself depends on a matrix, creating a circular problem.
To break this cycle, scientists start with a simple scoring system, assigning 1 when two amino acids
match, and 0 when they do not. Using this basic method, they identify blocks and then count how often
each amino acid pair appears together in the same column; these counts help build a more refined
substitution matrix, which can then improve future sequence alignments.
Let’s consider the following example.
The following steps must be followed:
identifying blocks. The six given sequences (on the rows) are aligned, and in particular it is
necessary to focus on columns (position in the block) where the amino acids (A, B, C in this case)
appear together. These columns helps define blocks, which are regions of conserved sequences.
amino acid counting. The total number of amino acids observed in the block is 6 * 4 = 24, where
A appears 14 times (14/24), B appears 4 times (4/24), and C appears 6 times (6/24).
aligned pair counting. Given these four columns, each aligning six sequences, it is necessary to
count all possible pairs in each column. The number of possible pairs per columns is (6 2) =
15, so with four columns, the total number of aligned pairs is 4 * 15 = 60.
pairwise frequency calculation. Now, it is possible to count how often each pair appears
together.
AA: 26/60 BB: 3/60
AB: 8/60 BC: 6/60
AC: 10/60 CC: 7/60
Given these counts, it is now possible to determine how likely each amino acid pair is to appear together:
if AA appears frequently, it means that A is often conserved.
if AB appears less frequently, it means that substitution A with B is rare.
if AC appears often, it means that A and C are interchangeable more frequently.
Repeating this process for many sequences allows to create a substitution matrix, which assigns scores
to each possible amino acid replacement, helping in future sequence alignments by providing scores
based on real biological data rather than just treating all substitutions equally.
Once having calculated the frequency of occurrence for each character and each couple, then it is possible
to compute the observed proportions, by counting how often each amino acid pair appears in the aligned
sequences. These values are them converted into proportions by dividing the count by the total number
of aligned pairs: for example, the AA pair appears 26 times out of 60 pairs, so its observed proportion is
26/60. Afterwards. expected proportions can be calculated as well: the expected frequency of a pair is
based on how often each individual amino acid appears in the sequences, and is computed by multiplying
the individual frequencies of the two amino acids. Moreover, for different amino acids (like AB), the
product is further multiplied by 2:
The ratio of observed proportion vs expected proportion indicates whether an amino acid pair appears
more or less frequently than expected by chance:
if a pair appears more often than expected, it suggests the substitution is favorable (the amino
acids of the pair are similar).
if a pair appears less often than expected, it suggests the substitution is unfavorable (the amino
acids of the pair are different).
In order to compute the logarithm for scoring, the logarithm in base 2 of the ratio is computed:
This final value is rounded to the nearest integer and used as a score in the substitution matrix.
In this way, the obtained scores can be organized in the BLOSUM matrix, which is a 20 x 20 matrix and
is symmetrical, meaning that the score for AB is the same as for BA. The scores are used in sequence
alignment to determine how likely one amino acid is to be replaced by another in evolution:
negative scores reflect less likely substitutions in real biological sequences.
positive scores reflect more likely substitutions in real biological sequences.
When building a substitution matrix, it is necessary to avoid bias from overrepresented sequences,
meaning that if some sequences are too similar, they would unfairly dominate the counts. To balance this,
sequences that are too similar (above a certain threshold) are grouped into clusters.
Indeed each block, it is preferable to have sequences with the same amount of evolutionary distance
between them, meaning that sequences that are close to each other above a certain threshold can be
grouped together. For example, if the threshold is 85%, then sequences that show identity above 85%
form a unique cluster; it follows that in the count of the occurrences, sequences that belong to a cluster
of n sequences contribute as 1/n, preventing redundant sequences from skewing the matrix.
If clusters with X% identity are used then the resulting matrix is a BLOSUMX, and this is why various
types of BLOSUM matrices exist, depending on the level of the threshold. Larger-numbered matrices are
built from sequences that are highly similar (recently diverged), and are useful for comparing closely-
related proteins. On the other hand, small-numbered matrices are build from sequences that are more
distantly related (older evolutionary divergence), and are better for the detection of distant homologies-
Hence, the higher the threshold, the higher the number of sequences with different levels of identity that
entirely contribute to the computation. Conversely, if the threshold is lower, more sequences will be
clustered and neglected (less weighted), and therefore less sequences will contribute to the overall
computation. The consequence, from a practical standpoint, is that a BLOSUM matrix with a low number
is more suitable to align and compare sequences that are less related and are far from each other from an
evolutionary point of view.
Optimal alignment algorithms
The optimal alignment algorithms seen so far find the best possible alignment between two sequences,
and thus represent a very good solution. However, they are quite expensive from a computational
standpoint, in terms of time and space and space complexity; indeed, for instance, for two sequences of
length 1.000 nucleotides, they require 1.000.000 operations.
Optimal alignment algorithms are useful for matching a sequence against a database: for example,
given a protein-coding sequence, it is possible to search public repositories for all similar proteins, in
order to get a clue about its possible function. However, the query sequence should be aligned to each
entry in the repository, and even though it is possible to apply optimal algorithms, a lot of time and a
high computational power are required. In order to overcome this difficulty, heuristic solutions have
been developed, which are approximate solutions that work faster.
If in optimal algorithms, the alignment that is obtained corresponds for sure to the highest score, heuristic
approaches do no guarantee such certainty, but they still works properly. This is the case of BLAST
(Basic Local Alignment Search Tool), a set of algorithms that have been developed in order to obtain
the alignment of the query sequence with many other sequences contained in public databases. The family
of algorithms includes five types of implementation:
BLASTP, which compares a protein query to a database of proteins.
BLASTIN, which compares both strands of a DNA query against a DNA database.
BLASTX, which translates a DNA sequence into six protein sequences, suing all six possible
reading frames, and then compares each of these proteins to a protein database.
Six translations are possible because, given a nucleotide sequence, translation can be started not
only from the first three nucleotides, but also from the second nucleotide, or from the third one,
thus leading to different codons, and, consequently, to a different amino acid sequence that will
be translated. This means that from a nucleotide sequence in a single DNA strand, three possible
reading frames are present. In this cases, six possible translations may occur, since both DNA
strands are considered.
TBLASTN, which translates every DNA sequence in a database into six potential proteins, and
then compares the protein query against each of those translated proteins.
TBLASTX, the most computationally intensive BLAST algorithm, which translates DNA from
both a query and a database into six potential proteins, and then performs 36 protein-protein
database searches.
The BLAST algorithm consists of multiple steps:
compilation of subsequences: a list of subsequences of length W (“words” ) is created from the
query sequence, where each word has a score at or above a threshold T. In this example, the query
word is in blue.
finding matching database sequences: all database sequences that containing any of the words
from the compiled list are identified.
extension of hits: each identified hit (match) is extended in both directions to determine if it is
part of a longer alignment. During this extension, the score of the alignment might drop from the
highest score achieved so far; however, the drop is allowed to a certain limit, specified by the
value X; the alignment continues to be considered valid as long as the score does not drop too
drastically. The extended alignment, after
considering the score drop, is referred to as a
High Scoring Pair (HSP), and identifies a
significant region of similarity between the
query sequence and the database sequence.
The extension is stopped as soon as the score decreases
more than X when compared with the highest value
obtained during the extension process.
statistical significance evaluation: the statistical
significance (expect value) of each match found is
assessed.
reporting significant matches: only those matches that
satisfy a user-defined significance threshold are
reported.
In the original (1990) implementation of BLAST, all hits were
extended in either direction. In a 1997 refinement of BLAST,
two independent hits are required, instead: indeed, hits must occur in close proximity to each other, and
with this modification, only one seventh as many extensions occur, greatly speeding the time required
for a search.
In BLAST, raw scores (S) reflect the quality of sequence alignments based on a substitution matrix, with
higher scores indicating better matches between the query and database sequences. However, raw scores
can vary depending on factors like the scoring matrix and database size, so in order to make scores
comparable across different searches, bit scores (S) are used, which normalize the raw scores by
accounting for these factors. This allows bit scores to represent the significance of a match in a way that
can be consistently compared, even when different databases or scoring matrices are used, ensuring more
reliable evaluation of High Scoring Pairs (HSPs).
λ and K refer to the specific substitution matrix that was used and tp the width of the source space
(number of sequences in the target database), respectively.
BLAST search output
The following is an example of a BLAST search output:
The Query line represents the sequence of interest that is being investigated, while the Sbjct (subject)
line contains the reference sequence in the database that is used for comparison; the intermediate line
between these two reports all the aligned characters, and the number on the left indicates the last amino
acid position in each sequence. The + sign signals that two amino acids are not the same in the two
sequences, but are similar in the substitution matrix (both having a positive score); hence, they do not
represent a dramatic mismatch and instead possess a certain degree of similarity. Conversely, when the
score of the substitution matrix is not positive, an empty space is left.
The percentages of perfect matches (known as identities) is also indicated (137/147 in this case), and
refers to the overall alignment; the percentage of positives also accounts for perfect matches.