Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
44 views36 pages

Computational Biology Lecture

This document provides an introduction to phylogeny and the process of reconstructing evolutionary history through phylogenetic trees. It discusses how input data like morphological characters and biomolecular sequences are used with reconstruction algorithms like maximum parsimony, maximum likelihood, and distance methods to output a phylogenetic tree. Popular distance methods like neighbor joining and quartet-based methods are also introduced. The challenges of testing reconstruction methods empirically are noted given the difficulty of knowing the true evolutionary tree.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views36 pages

Computational Biology Lecture

This document provides an introduction to phylogeny and the process of reconstructing evolutionary history through phylogenetic trees. It discusses how input data like morphological characters and biomolecular sequences are used with reconstruction algorithms like maximum parsimony, maximum likelihood, and distance methods to output a phylogenetic tree. Popular distance methods like neighbor joining and quartet-based methods are also introduced. The challenges of testing reconstruction methods empirically are noted given the difficulty of knowing the true evolutionary tree.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Introduction to Phylogeny

Lecture Notes for 9 October 2003


CMP 464/788: Introduction to Computational Biology
Prof. St. John
Lehman College
City University of New York

CMP 464/788, 9 October 2003 1


Goal: Reconstruct the Evolutionary History

(www.amnh.org/education/teacherguides/dinosaurs)
Goal: Reconstruct the Evolutionary History

(www.amnh.org/education/teacherguides/dinosaurs)

The evolutionary process not only determines


relationships among taxa, but allows prediction of
structural, physiological, and biochemical properties.
CMP 464/788, 9 October 2003 2
Process for Reconstruction: Input Data

Start with information about the taxa. For example:


Morphological
Characters
Process for Reconstruction: Input Data

Start with information about the taxa. For example:


Morphological Biomolecular
Characters Sequences
A GTTAGAAGGCGGCCAGCGAC. . .
B CATTTGTCCTAACTTGACGG. . .
C CAAGAGGCCACTGCAGAATC. . .
D CCGACTTCCAACCTCATGCG. . .
E ATGGGGCACGATGGATATCG. . .
F TACAAATACGCGCAAGTTCG. . .

CMP 464/788, 9 October 2003 3


Process for Reconstruction
Process for Reconstruction

Input
Data
A GTTAGAAGGC. . .
B CATTTGTCCT. . .
C CAAGAGGCCA. . .
D CCGACTTCCA. . .
E ATGGGGCACG. . .
F TACAAATACG. . .
Process for Reconstruction

Input Reconstruction
Data Algorithms
A GTTAGAAGGC. . . → Maximum Parsimony
B CATTTGTCCT. . . Maximum Likelihood
C CAAGAGGCCA. . . Distance Methods: NJ,
D CCGACTTCCA. . . Quartet-Based,
E ATGGGGCACG. . . Fast Convering,
F TACAAATACG. . . ..
.
Process for Reconstruction

Input Reconstruction Output


Data Algorithms Tree
A GTTAGAAGGC. . . → Maximum Parsimony →
B CATTTGTCCT. . . Maximum Likelihood
C CAAGAGGCCA. . . Distance Methods: NJ,
D CCGACTTCCA. . . Quartet-Based,
E ATGGGGCACG. . . Fast Convering,
F TACAAATACG. . . ..
.

CMP 464/788, 9 October 2003 4


Applications

In addition to finding the evolutionary history of species,


phylogeny is also used for:
Applications

In addition to finding the evolutionary history of species,


phylogeny is also used for:

• drug discovery: used to determine structural and


biochemical properties of potential drugs
Applications

In addition to finding the evolutionary history of species,


phylogeny is also used for:

• drug discovery: used to determine structural and


biochemical properties of potential drugs

• determining origins of HIV infection


Applications

In addition to finding the evolutionary history of species,


phylogeny is also used for:

• drug discovery: used to determine structural and


biochemical properties of potential drugs

• determining origins of HIV infection

• origin of other virus and bacteria strains

CMP 464/788, 9 October 2003 5


Algorithms for Reconstruction

• Most optimization criteria are hard:


Algorithms for Reconstruction

• Most optimization criteria are hard:


– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)
find the tree that can explain the observed sequences with a
minimal number of substitutions.
Algorithms for Reconstruction

• Most optimization criteria are hard:


– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)
find the tree that can explain the observed sequences with a
minimal number of substitutions.
– Maximum Likelihood Estimation: find the tree with the
maximum likelihood: P(data|tree).
Algorithms for Reconstruction

• Most optimization criteria are hard:


– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)
find the tree that can explain the observed sequences with a
minimal number of substitutions.
– Maximum Likelihood Estimation: find the tree with the
maximum likelihood: P(data|tree).
• For both, the best known algorithms require an
exponential number of trees to be checked.
Algorithms for Reconstruction

• Most optimization criteria are hard:


– Maximum Parsimony: (NP-hard: Foulds & Graham ‘82)
find the tree that can explain the observed sequences with a
minimal number of substitutions.
– Maximum Likelihood Estimation: find the tree with the
maximum likelihood: P(data|tree).
• For both, the best known algorithms require an
exponential number of trees to be checked.
This is not feasible for more than 20 taxa.

CMP 464/788, 9 October 2003 6


Approximation Algorithms

• Since calculating the exact answer is hard, algorithms


that estimate the answer have been developed.
Approximation Algorithms

• Since calculating the exact answer is hard, algorithms


that estimate the answer have been developed.
– Heuristics for maximum parsimony and maximum
likelihood estimation
(use clever ways to limit the number of trees checked, while still
sampling much of “tree-space”)
Approximation Algorithms

• Since calculating the exact answer is hard, algorithms


that estimate the answer have been developed.
– Heuristics for maximum parsimony and maximum
likelihood estimation
(use clever ways to limit the number of trees checked, while still
sampling much of “tree-space”)
– Polynomial-time methods, based on the distance
between taxa

CMP 464/788, 9 October 2003 7


Distance-Based Methods

• These methods calculate the distance between taxa:


B D A C F E
B 0 0.496505 0.496505 0.444519 0.375798 0.268166
D 0.496505 0 0.496505 0.375798 0.275673 0.279728
A 0.496505 0.496505 0 0.362124 0.323812 0.496505
C 0.444519 0.375798 0.362124 0 0.496505 0.496505
F 0.375798 0.275673 0.323812 0.496505 0 0.496505
E 0.268166 0.279728 0.496505 0.496505 0.496505 0

and then determine the tree using the distance matrix.


Distance-Based Methods

• These methods calculate the distance between taxa:


B D A C F E
B 0 0.496505 0.496505 0.444519 0.375798 0.268166
D 0.496505 0 0.496505 0.375798 0.275673 0.279728
A 0.496505 0.496505 0 0.362124 0.323812 0.496505
C 0.444519 0.375798 0.362124 0 0.496505 0.496505
F 0.375798 0.275673 0.323812 0.496505 0 0.496505
E 0.268166 0.279728 0.496505 0.496505 0.496505 0

and then determine the tree using the distance matrix.


• One way to calculate distance is to take differences
divided by the length (the normalized Hamming distance).

CMP 464/788, 9 October 2003 8


Distance-Based Methods
• Popular distance based methods include
Distance-Based Methods
• Popular distance based methods include
– Neighbor Joining (Saitou & Nei ‘87) which repeatedly
joins the “nearest neighbors” to build a tree, and
Distance-Based Methods
• Popular distance based methods include
– Neighbor Joining (Saitou & Nei ‘87) which repeatedly
joins the “nearest neighbors” to build a tree, and
– Quartet-based methods that decide the topology for
every 4 taxa and then assemble them to form a tree
(Berry et al. 1999, 2000, 2001).
Distance-Based Methods
• Popular distance based methods include
– Neighbor Joining (Saitou & Nei ‘87) which repeatedly
joins the “nearest neighbors” to build a tree, and
– Quartet-based methods that decide the topology for
every 4 taxa and then assemble them to form a tree
(Berry et al. 1999, 2000, 2001).

• Many of these methods have good performance


empirically, and some can be proven to have nice
accuracy properties.
CMP 464/788, 9 October 2003 9
Testing Methods Empirically
• How accurate are the methods at reconstructing trees?
Testing Methods Empirically
• How accurate are the methods at reconstructing trees?

• In biological applications, the true, historical tree is


almost never known, which makes assessing the quality
of phylogenetic reconstruction methods problematic.
Testing Methods Empirically
• How accurate are the methods at reconstructing trees?

• In biological applications, the true, historical tree is


almost never known, which makes assessing the quality
of phylogenetic reconstruction methods problematic.
(an exception: Hillis ‘92 created an evolutionary tree in the
laboratory)
Testing Methods Empirically
• How accurate are the methods at reconstructing trees?

• In biological applications, the true, historical tree is


almost never known, which makes assessing the quality
of phylogenetic reconstruction methods problematic.
(an exception: Hillis ‘92 created an evolutionary tree in the
laboratory)

• Simulation is used instead to evaluate methods, given


a model of evolution.
CMP 464/788, 9 October 2003 10
Simulation Studies

1. Construct a
“model” tree.
Simulation Studies

1. Construct a 2. “Evolve”
“model” tree. sequences down
the tree.

A GTTAGAAGGCGGCCA. . .
B CATTTGTCCTAACTT. . .
C CAAGAGGCCACTGCA. . .
D CCGACTTCCAACCTC. . .
E ATGGGGCACGATGGA. . .
F TACAAATACGCGCAA. . .
Simulation Studies

1. Construct a 2. “Evolve” 3. Reconstruct


“model” tree. sequences down the tree using
the tree. method.

A GTTAGAAGGCGGCCA. . .
B CATTTGTCCTAACTT. . .
C CAAGAGGCCACTGCA. . .
D CCGACTTCCAACCTC. . .
E ATGGGGCACGATGGA. . .
F TACAAATACGCGCAA. . .
Simulation Studies

1. Construct a 2. “Evolve” 3. Reconstruct


“model” tree. sequences down the tree using
the tree. method.

A GTTAGAAGGCGGCCA. . .
B CATTTGTCCTAACTT. . .
C CAAGAGGCCACTGCA. . .
D CCGACTTCCAACCTC. . .
E ATGGGGCACGATGGA. . .
F TACAAATACGCGCAA. . .

4. Evaluate the accuracy of the constructed tree.

CMP 464/788, 9 October 2003 11


Simulation Studies

1. Construct a 2. “Evolve” 3. Reconstruct


“model” tree. sequences down the tree using
the tree. method.

A GTTAGAAGGCGGCCA. . .
B CATTTGTCCTAACTT. . .
C CAAGAGGCCACTGCA. . .
D CCGACTTCCAACCTC. . .
E ATGGGGCACGATGGA. . .
F TACAAATACGCGCAA. . .

4. Evaluate the accuracy of the constructed tree.

CMP 464/788, 9 October 2003 12

You might also like