40
CHAPTER 5
MSA and methods
TYPES OF SEQUENCE ALIGNMENTS
• Pair‐wise alignment
• Dot matrix method
• Dynamic programming
• Word methods
• Multiple sequence alignment
• Dynamic programming
• Progressive methods
• Iterative methods
MULTIPLE SEQUENCE ALIGNMENT
• A multiple sequence alignment is an
alignment of n > 2 sequences
obtained by inserting gaps (“‐”) into TYPES OF MSA
sequences such that the resulting
sequences have all length L and can Dynamic programming approach
be arranged in a matrix of N rows and • Computes an optimal alignment for
L columns where each column a given score function. Because of its
represents a homologous position. high running time , it is not typically
• The principle is that multiple used in practice.
alignments are achieved by
successive application of pairwise Progressive method
methods. • This approach repeatedly aligns two
sequences, two alignments, or a
PURPOSE OF MSA sequence with an alignment.
• In order to characterize protein Iterative method
families, identify shared regions of
• Works similarly to progressive
homology in a multiple sequence
methods but repeatedly realigns the
alignment
initial sequences as well as adding
• Determination of the consensus
new sequences to the growing MSA.
sequence of several aligned
sequences.
• Consensus sequences can help to
develop a sequence “finger print” PROGRESSIVE ALIGNMENT
which allows the identification of • The most widely used approach
members of distantly related protein
family (motifs) • Builds up a final MSA by combining
• MSA can help us to reveal biological pairwisealignments beginning with
facts about proteins, like analysis of the most similar pair and progressing
the secondary/tertiary structure to the most distantly related
41 CHAPTER 5
• Progressive alignment methods
• However, it does exceptionally well
require two stages:
when the data set contains contains
• ‐A first stage in which the sequences sequences with varied
relationships between the degrees of divergence.
sequences are represented as a tree,
called a guidetree
• ‐Second step in which the MSA is
built by adding the sequences
sequentially to the growing MSA
according to the guide tree.
MSA USING CLUSTAL W
• Works by progressive alignment.
• ClustalW was introduced by Julie D.
Thompson and Toby Gibson of EMBL,
EBI.
• Most closely related sequences are
aligned first, and then additional
sequences and groups of sequences
are added, guided by the initial • Neighbor-Joining method used to
alignments . calculate guide tree
• Uses alignment scores to produce a ➢ Less sensitive to unequal
phylogenetic tree. evolutionary rates in
• Aligns the sequences sequentially, different branches.
guided by the ➢ Significance: branch lengths
phylogeneticrelationships indicated are used to derive sequence
by the tree. weights.
• Gap penalties can be adjusted ➢ Accuracy of distance
based on specific amino acid calculations for guide tree:
residues, regions of hydrophobicity, • Tree constructed from
proximity to other gaps, or secondary pairwise distance matrix
structure. • Fast approximate alignment
• Clustal W is a general purpose • Full dynamic programming
multiple sequence alignment • User selectable
program for DNA or proteins.
• It produces biologically meaningful
multiple sequence alignments of Algorithm
divergent sequence. Basic method:
1. Distance matrix is calculated
• It calculate the best match for the • Distances are pairwise
selected sequence ,and lines them alignment scores
up so that the identities. • Gives divergence of each pair
of sequences.
• The algorithm ClustalW uses
provides a close–to-optimal result
almost every time.
42 CHAPTER 5
• Calculate the ‘distance’
2. Guide tree built from distance
between each pair of
matrix
3. Progressive alignment according sequences based on these
to guide tree isolated pairwise
• Branching order of tree alignments.
specifies alignment order • Generate a distance
• Alignment progresses matrix.
from leaves to root.
• Generate a Neighbor‐
Joining ‘guide tree’ from
Distance matrix/pairwise alignments these pairwise distances.
phase: • This guide tree gives the
• Two choices: fast approximation or order in which the
DP progressive alignment will
• Fast approximation:
be carried out.
• Defn a k-tuple match is a run of
identical residues, typically
• 1 to 2 for proteins
• 2 to 4 for nucleotide
sequences
• Scores are calculated as: (k-tuple
matches) – fixed penalty per gap
• Score is initially calculated as a
percent identity score.
• Distance = 1.0 – (score/100)
Distance matrix/pairwise alignments
phase
• Full DP alignment
• Alignment uses:
1. gap opening penalties
2. gap extension penalties
3. full amino acid weight matrix.
• Scores are calculated as:
(#identies)/(#residues), gaps not
included
• Score is initially calculated as a
percent identity score.
• Distance = 1.0 – (score/100)
WORKING OF CLUSTAL W
• First perform all possible
pairwise alignments
43
CHAPTER 5
• As far as possible, try to align
sequences of similar length.
• PileUp can align sequences of up
to 5000 residues, with 2000 gaps
(total 7000 characters).
• PileUp is a good program only for
similar (close) sequences.
• PileUp does global multiple
alignment, and therefore is good
for a group of similar sequences.
• PileUp will fail to find the best
local region of similarity (such as
a shared motif) among distant
related sequences.
• PileUp always aligns all of the
sequences you specified in the
input file, even if they are not
related.
• The alignment can be degraded if
some of the sequences are only
distantly related
PileUP Algorithm
• Each sequence is compared with
every other sequence. A similarity
score is calculated for each pair
based on the number of matching
residues.
• A consensus sequence is created
by selecting the most frequent
residue at each position. If there's • Since the alignment is calculated
a tie, a gap or a wildcard character on a progressive basis, the order
(e.g., X) is inserted. of the initial sequences can affect
• The consensus sequence is the final alignment.
compared with each individual • PileUp parameters: 2 gap
sequence. The alignment is penalties (gap insert and gap
adjusted based on the similarities extend) and an amino acid
and differences observed. This comparison matrix (e.g.
process is repeated until no BLOSUM62).
further changes are made.
44 CHAPTER 5
• PileUp will refuse to align
sequences that require too many
gaps or mismatches.
• PileUp will take quite a while to
align more than about 10
sequences .
• Clustal is a stand-alone (i.e. not
integrated into GCG*) multiple
alignment program that is superior in
some respects to PileUp.
• Works by progressive alignment: it
aligns a pair of sequences then aligns
the next one onto the first pair.
• Most closely related sequences are
aligned first, and then additional
sequences and groups of sequences
are added, guided by the initial
alignments.
• Uses alignment scores to produce a
phylogenetic tree.
Advantages and Disadvantages
• The algorithm is easy to understand
and implement.
• It's generally faster than more
complex methods like ClustalW.
• PileUP works well for alignments
involving a relatively small number of
sequences.
• It may not be as accurate as more
sophisticated methods, especially for
complex alignments with many gaps
or insertions
• PileUP is a simple and efficient
method for multiple sequence
alignment, suitable for basic tasks
and smaller datasets. However, for
more complex alignments or
advanced features, more
sophisticated methods like ClustalW
may be preferable.
1) Define or compare the species tree, gene tree, and protein tree:
Species trees represent the evolutionary relationships between species, tracking speciation
events that create new species over time. Gene trees, on the other hand, trace the evolutionary
history of specific genes, showing how they diverge through duplication or speciation. The
topology of gene trees can differ from species trees due to gene duplication events that occur
independently of speciation. Protein trees follow a similar path to gene trees but focus on the
evolutionary history of proteins, often constructed from protein-coding sequences. These trees
may show evolutionary patterns like horizontal gene transfer or gene duplication that affect
protein evolution. In summary, while species trees map the macroevolution of species, gene
and protein trees illustrate microevolutionary changes at the genetic and molecular levels,
sometimes revealing different evolutionary paths due to gene-specific factors like duplication or
recombination
2) How speciation events contribute to gene duplication:
Speciation creates new species through reproductive isolation, leading to divergence in genetic
makeup. Gene duplication, a separate process, results in additional copies of genes within a
genome. Both speciation and gene duplication can occur independently, but when they
coincide, the duplication event can create paralogs—duplicate genes within the same species.
These paralogs can evolve new functions or be retained with the original function. Speciation
helps propagate these duplicates across the newly formed species. Over time, the evolutionary
fate of these paralogs may differ, leading to varying gene trees across species. In this way,
speciation indirectly contributes to gene duplication by facilitating divergence between gene
copies in different lineages. This divergence can result in new traits or functionalities, increasing
the organism's adaptive potential
3) What is horizontal gene transfer, what challenges can it create during phylogenetic
analysis?
Horizontal gene transfer (HGT) is the movement of genetic material between organisms outside
the traditional parent-offspring inheritance. Unlike vertical transmission, where genes are
passed down through generations, HGT allows genes to move between different species,
sometimes across broad taxonomic boundaries. This process plays a significant role in
prokaryotes, contributing to rapid adaptation and evolution.
In phylogenetic analysis, HGT poses challenges because it disrupts the vertical inheritance
model that these analyses typically assume. When genes are horizontally transferred, they do
not follow the organism's lineage, leading to incongruities between gene trees and species
trees. This makes it difficult to infer evolutionary relationships accurately. Phylogenetic trees
built on the assumption of vertical descent may misrepresent the evolutionary history by
clustering unrelated species together, confusing true evolutionary signals with artifacts caused
by gene transfer
4) Example of higher sequence conservation in multiple different species:
An excellent example of higher sequence conservation across species is glyceraldehyde-3-
phosphate dehydrogenase (GAPDH). This enzyme is crucial in glycolysis and is highly conserved
across a wide range of species, from humans and plants to bacteria and archaea. Multiple
sequence alignment shows that the amino acid sequences of GAPDH maintain remarkable
similarity across these species. Despite evolutionary divergence, essential proteins like GAPDH
are preserved because of their critical functional roles. Sequence conservation across diverse
organisms reflects the fundamental importance of these proteins in cellular processes, and
conserved sequences often point to functionally important regions of the protein
5) Provide an enzyme that is conserved across the kingdom. Explain information sites:
The enzyme glyceraldehyde 3-phosphate dehydrogenase (GAPDH) is highly conserved across
various kingdoms, including animals, plants, bacteria, and archaea. GAPDH is central to
glycolysis, a process critical for energy production. Despite evolutionary divergence, the
protein's essential role in metabolism has led to a high degree of sequence conservation, as
seen in multiple species alignmentsation sites refer to regions within DNA or protein sequences
that are particularly important for function or structure. In proteins, these are typically active
sites, binding sites, or motifs essential for catalytic activity. In the case of GAPDH, the active site
residues involved in its catalytic function tend to be highly conserved, as alterations would likely
lead to loss of function, providing insight into evolutionary pressures. These conserved regions
across species help infer functional and evolutionary relationships among organisms
6) How minimum changes are equivalent to avoiding homoplasy:
Minimizing changes in phylogenetic tree construction is critical to avoiding homoplasy—similar
traits arising independently in unrelated lineages, often due to convergent or parallel evolution.
Homoplasy can create misleading signals, suggesting evolutionary relationships where none
exist. By seeking the tree that requires the fewest evolutionary changes (the principle of
parsimony), researchers aim to reduce the possibility of homoplasy. This approach ensures that
observed similarities between species are due to common ancestry rather than independent
evolutionary events. Therefore, the parsimony method strives for simplicity, assuming that fewer
evolutionary changes imply fewer opportunities for convergent evolution, thus minimizing
homoplasy
7) Difference between maximum parsimony and likelihood commonality:
Maximum parsimony (MP) and maximum likelihood (ML) are two common methods used in
phylogenetic tree construction. MP seeks the tree with the fewest evolutionary changes,
assuming the simplest explanation (Occam’s Razor). It’s based on minimizing the number of
mutations required to explain the observed data, focusing on character changes (e.g.,
nucleotide or amino acid substitutions). MP is effective when evolutionary changes are rare but
can be less accurate in cases with high mutation rates or complex evolution.
ML, on the other hand, is a statistical approach that assigns probabilities to trees based on how
likely the observed data would be under different models of evolution. It incorporates more
complex assumptions, such as varying mutation rates across different branches or characters.
ML tends to be more computationally intensive but is often preferred when evolutionary
changes are frequent, as it accounts for the likelihood of various scenarios rather than
minimizing changes. While MP focuses on simplicity, ML leverages probabilistic models to
optimize tree accuracy
**8) What is homoplasy? How is it related to the maximum parsimony method?**
Homoplasy refers to similarities in traits between different species that are not due to common
ancestry but result from convergent evolution, parallel evolution, or evolutionary reversals.
These traits arise independently due to similar selective pressures in different lineages. In the
context of maximum parsimony (MP), homoplasy can be problematic because MP assumes
that the tree with the fewest evolutionary changes is the most accurate. However, homoplasy
can result in misleading signals, making unrelated species appear more closely related.
MP tries to minimize the number of character changes (like mutations), but it doesn’t account
for the possibility that similar traits might evolve more than once independently. This means
homoplasy can violate the MP method's assumption, leading to incorrect tree topologies.
Therefore, while MP is useful for simple evolutionary histories, it may struggle with complex
cases involving homoplasy
**9) How homoplasy violates the assumption of maximum parsimony having fewer
changes in the tree?**
The principle of maximum parsimony assumes that the tree requiring the fewest evolutionary
changes is the correct one. This assumption is based on the belief that evolution generally
follows the simplest path. However, homoplasy violates this assumption because similar traits
can evolve independently in different lineages. This independent evolution can result in more
changes than initially expected, leading the maximum parsimony method to incorrectly infer
fewer changes than what actually occurred. Homoplasy can mislead the algorithm into
grouping species together based on convergent or parallel evolution rather than shared
ancestry, thus violating the core assumption of fewer changes in the parsimony model
**10) What is homoplasy? How is it carcinogenic in nature?**
Homoplasy refers to the appearance of similar traits in species due to convergent evolution
rather than common ancestry. In a biological context, homoplasy itself isn't directly
carcinogenic, but the concept has parallels in cancer development. Carcinogenesis often
involves multiple, independent mutations leading to similar outcomes in cells, such as
uncontrolled growth. Just as homoplasy leads to similar traits through different evolutionary
paths, cancer cells can develop similar malignant characteristics through different genetic
mutations. These mutations, like homoplasy, arise independently but result in the same
phenotype (tumor formation), making it challenging to pinpoint a single origin for the disease【
5†source】【8:12†source】
**11) Basic assumptions of maximum parsimony and maximum likelihood:**
The maximum parsimony (MP) method assumes that the simplest evolutionary path, requiring
the fewest character changes, is the most likely one. It operates on the idea that evolution tends
to favor minimal changes and that fewer mutations or substitutions have occurred across
lineages. MP doesn’t incorporate any model of evolution beyond this simplicity principle.
Maximum likelihood (ML), in contrast, assumes that evolutionary changes follow specific
probabilistic models, such as varying rates of mutation across sites or lineages. ML estimates
the likelihood of a given tree based on how well it explains the observed data under these
models. The fundamental assumption in ML is that evolutionary processes are complex and can
be mathematically modeled to determine the most likely tree
**12) Purpose and application of multiple sequence alignment (MSA):**
Multiple sequence alignment (MSA) arranges sequences (DNA, RNA, or proteins) to identify
regions of similarity. The primary purpose is to infer functional, structural, or evolutionary
relationships. MSA highlights conserved regions across multiple sequences, which can indicate
important functional or structural roles, such as active sites in enzymes or binding domains in
proteins.
MSA is widely applied in phylogenetic analysis to construct evolutionary trees, in comparative
genomics to identify conserved motifs, and in protein structure prediction. By identifying
conserved sequences, researchers can predict the function of unknown sequences or trace
evolutionary relationships among species. MSA is essential for detecting orthologs and
paralogs, guiding studies on gene families, and understanding molecular evolution
**13) What kind of changes are needed to make an unrooted tree into a rooted tree? What
information is required?**
To convert an unrooted tree into a rooted tree, one needs to introduce a root that indicates the
common ancestor of all taxa in the tree. This requires additional information, typically in the
form of an outgroup—a species or group known to be more distantly related to all the other
species in the tree. By including an outgroup, you can infer the direction of evolutionary changes
and establish the root of the tree, providing insight into the evolutionary lineage of the included
taxa.
Without an outgroup, the tree remains unrooted, meaning it cannot determine the ancestor-
descendant relationships or the order of divergence among species
**14) How branch length helps to obtain evolutionary relationships; taxa names and
explanations:**
Branch length in a phylogenetic tree represents the amount of evolutionary change or time that
has passed. Longer branches indicate more changes or a longer time since divergence, while
shorter branches imply fewer changes. By comparing branch lengths, one can infer how closely
or distantly related different taxa are.
For instance, in a tree comparing species A, B, and C, if species A and B have shorter branches
between them than either does with C, A and B are more closely related. The branch length
helps to visualize the rate of evolution or genetic distance among the taxa. This information is
crucial in understanding both the timing of divergence events and the amount of evolutionary
change
**16) Multiple questions related to orthologs, paralogs, homologs:**
Orthologs are genes in different species that originated from a common ancestral gene and
usually retain the same function. Paralogs are genes within the same species that arose through
gene duplication and may evolve new functions. Both orthologs and paralogs are types of
homologs, meaning they share a common evolutionary ancestor. Understanding the distinction
between these gene relationships is crucial in phylogenetic studies, as orthologs are used to
infer evolutionary histories between species, while paralogs help understand functional
diversification within genomes【8:13†source】.
**17) Why is it important to have MSA to obtain a phylogenetic tree?**
Multiple sequence alignment (MSA) is crucial for constructing accurate phylogenetic trees
because it identifies homologous regions across sequences that are used to infer evolutionary
relationships. By aligning sequences, researchers can determine where mutations, insertions,
or deletions have occurred, helping to accurately depict the evolutionary paths of the species in
question. MSA ensures that corresponding nucleotides or amino acids are compared across all
sequences, allowing for meaningful analysis of similarities and differences that reflect
evolutionary events
**15) How branch length helps to obtain evolutionary relationships; taxa names and
explanations:**
Branch length in a phylogenetic tree indicates the amount of evolutionary change or time that
has passed since two taxa diverged from their common ancestor. Longer branches suggest
more evolutionary change or a longer divergence time, while shorter branches imply less
change. Understanding branch lengths helps to infer the relatedness of species. For example, in
a tree where taxa A, B, and C are compared, if A and B have a shorter branch between them than
either has with C, this implies A and B share a more recent common ancestor than they do with
C, and thus are more closely related. By examining branch lengths, scientists can assess the
pace of evolution and estimate divergence times
**18) What is the purpose of MSA?**
The purpose of Multiple Sequence Alignment (MSA) is to align sequences of DNA, RNA, or
proteins from different species to identify regions of similarity. These conserved regions provide
insights into functional, structural, or evolutionary relationships among the sequences. MSA
helps detect conserved motifs, active sites in proteins, and evolutionary conserved genes
across species. Additionally, it is crucial for phylogenetic analyses, helping to reconstruct
evolutionary trees by providing aligned sequences that can be used to infer common ancestry
and evolutionary divergence
**19) Distance-based methods (Neighbor-Joining and UPGMA):**
Distance-based methods like Neighbor-Joining (NJ) and UPGMA (Unweighted Pair Group
Method with Arithmetic Mean) are algorithms used to construct phylogenetic trees. NJ is a fast
and efficient method that focuses on minimizing the total branch length, producing an unrooted
tree based on pairwise distance data between sequences. It is useful for constructing trees
when computational speed is important and when branch lengths need to reflect evolutionary
distances.
UPGMA, on the other hand, assumes a constant rate of evolution (a molecular clock),
constructing rooted trees by clustering taxa based on average distances. UPGMA works best
when the molecular clock hypothesis holds, meaning it is ideal for species with relatively
uniform mutation rates over time
**20) Character-based methods (Maximum Parsimony and Maximum Likelihood):**
Character-based methods such as Maximum Parsimony (MP) and Maximum Likelihood (ML)
construct phylogenetic trees by considering the individual evolutionary changes at each
position in the sequence. MP seeks the simplest tree, requiring the fewest changes, and is best
suited for cases where evolutionary changes are infrequent. It assumes that evolution favors the
least number of changes, which makes it a straightforward and quick approach, though it can
struggle in complex evolutionary scenarios.
ML, by contrast, assigns probabilities to different evolutionary models and evaluates how likely
each tree is, given the observed data. ML accounts for variation in mutation rates and
evolutionary models, making it more accurate but computationally demanding compared to MP