100% found this document useful (1 vote)

411 views51 pages

Phylogenetics for Biology Students

This document discusses phylogenetic trees and methods for constructing them from molecular sequence data. It introduces phylogenetic trees as a way to illustrate evolutionary relationships among organisms or sequences. It describes commonly used software packages and data types, including morphological features and molecular sequences. It explains different tree construction methods like distance-based methods, maximum parsimony, and maximum likelihood. Specific algorithms covered include UPGMA, neighbor joining, and Fitch's algorithm for parsimony. It also discusses computing evolutionary distances between sequences and models for sequence evolution.

Uploaded by

api-3807637

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

411 views51 pages

Phylogenetics for Biology Students

Uploaded by

api-3807637

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Phylogenetics

COS551, Fall 2003

Mona Singh
Phylogenetics
• Phylogenetic trees illustrate the
evolutionary relationships among groups of
organisms, or among a family of related
nucleic acid or protein sequences
• E.g., how might have this family been
derived during evolution
Hypothetical Tree
Relating Organisms
Phylogenetic Relationships
Among Organisms
• Entrez: www.ncbi.nlm.nih.gov/Taxonomy
• Ribosomal database project:
rdp.cme.msu.edu/html/
• Tree of Life:
phylogeny.arizona.edu/tree/phylogeny.html
Globin Sequences
Phylogeny Applications
• Tree of life: Analyzing changes that have
occurred in evolution of different organisms
• Phylogenetic relationships among genes can
help predict which ones might have similar
functions (e.g., ortholog detection)
• Follow changes occuring in rapidly
changing species (e.g., HIV virus)
Phylogeny Packages
• PHYLIP, Phylogenetic inference package
– evolution.genetics.washington.edu/phylip.html
– Felsenstein
– Free!
• PAUP, phylogenetic analysis using
parsimony
– paup.csit.fsu.edu
– Swofford
What data is used to
build trees?
• Traditionally: morphological features (e.g.,
number of legs, beak shape, etc.)
• Today: Mostly molecular data (e.g., DNA
and protein sequences)
Data for Phylogeny
• Can be classified into two categories:
– Numerical data
• Distance between objects
e.g., distance(man, mouse)=500,
distance(man, chimp)=100
Usually derived from sequence data
– Discrete characters
• Each character has finite number of states
e.g., number of legs = 1, 2, 4
DNA = {A, C, T, G}
Rooted vs Unrooted
Trees
Internal node

Root
External node

Rooted tree Unrooted tree

Note: Here, each node has three neighboring nodes

Terminology
• External nodes: things under comparison;
operational taxonomic units (OTUs)
• Internal nodes: ancestral units; hypothetical; goal is
to group current day units
• Root: common ancestor of all OTUs under study.
Path from root to node defines evolutionary path
• Unrooted: specify relationship but not evolutionary
path
– If have an outgroup (external reason to believe certain
OTU branched off first), then can root
• Topology: branching pattern of a tree
• Branch length: amount of difference that occurred
along a branch
How to reconstruct trees
• Distance methods: evolutionary distances are
computed for all OTUs and build tree where
distance between OTUs “matches” these distances
• Maximum parsimony (MP): choose tree that
minimizes number of changes required to explain
data
• Maximum likelihood (ML): under a model of
sequence evolution, find the tree which gives the
highest likelihood of the observed data
Number of possible trees
Given n OTUs, there are unrooted trees

OTUs unrooted trees

3 1
4 3
5 15
10 2,027,025
Number of possible trees
Given n OTUs, there are rooted trees

OTUs Rooted trees

Bottom Line: an 3 3
enumeration strategy
4 15
over all possible trees to
find the best one under 5 105
some criteria is not
feasible! 10 34,459,425
Parsimony
Find tree which minimizes number of changes needed to
explain data

Ex:
123456
A GTCGTA
B GTCACT
C GCGGTA
D ACGACA
E ACGGAA
Parsimony
• For given example tree and alignment, can do this
for all sites, and get away with as few as 9 changes
• Changing the tree (either the topology or labeling
of leaves) changes the minimum number of
changes need
• Two computational problems
– (Easy) Given a particular tree, how do you find
minimum number of changes need to explain data?
(Fitch)
– (Hard) How do you search through all trees?
Parsimony: Fitch’s algorithm

Idea: construct set of possible nucleotides for internal nodes,

based on possible assignments of children
Parsimony: Fitch’s algorithm
• For each site:
– Each leaf is labeled with set containing observed
nucleotide at that position
– For each internal node i with children j and k with
labels Sj and Sk

• Total # changes necessary for a site is # of union

operations
Parsimony
• How do you search through all trees?
– Enumerate all trees (too many…)
– Can use techniques to try to limit the search space (e.g., branch and bound)
– or use heuristics (many possibilities)
• E.g., nearest neighbor interchange. Start with a tree and consider neighboring
trees. If any neighboring tree has fewer changes, take it as current tree. Stop when
no improvements

a b a b a c

c d d c b d
Parsimony weaknesses
Parsimony analysis implicitly assumes that rate of change
along branches are similar

G G
G A

G A
A A
Real tree: two long branches Inferred tree
where G has turned to A independently
Distance Methods
• Input: given an n x n matrix M where
Mij>=0 and Mij is the distance between
objects i and j
• Goal: Build an edge-weighted tree where
each leaf (external node) corresponds to one
object of M and so that distances measured
on the tree between leaves i and j
correspond to Mij
Distance Methods
A B C D E
A 0
B 12 0
C 14 12 0
D 14 12 6 0
E 15 13 7 3 0

A tree exactly fitting the matrix does not always exist.

Distance Method Criteria
• Try to find the tree with distances dij which
“best fits” the distance data Mij
• Different possibilities for “best”
– Cavalli-Sforza criterion: minimize

– Fitch-Margoliash criterion: minimize

• Unfortunately, both lead to computationally

intractable problems (e.g., enumerating)
Distance Method
Heuristic: UPGMA
• UPGMA (Unweighted group method with
arithmetic mean)
– Sequential clustering algorithm
– Start with things most similar
• Build a composite OTU
– Distances to this OTU are computed as
arithmetic means
– From new group of OTUs, pick pair with
highest similarity etc.
• Average-linkage clustering
UPGMA: Visually

4
3 1 2 3 5 4
5
UPGMA Example
A B C D
A 0
B 8 0
C 7 9 0
D 12 14 11 0

M B(AC) = (MBA + MBC)/2 = (8+9)/2=8.5

M D(AC) = (MDA + MDC)/2= (12+11)/2=11.5
UPGMA Example
AC B D
AC 0
B 8.5 0
D 11.5 14 0

M (ABC)D = (MAD + MBD + MCD)/3 = (12+14+11)/3

UPGMA: Example

ABC D
ABC 0
D 12.33 0
UPGMA weaknesses
A B C D
A 0
B 8 0
C 7 9 0
D 12 14 11 0

In fact, exact fitting tree exists !

UPGMA weaknesses
• UPGMA assumes that the rates of evolution
are the same among different lineages
• In general, should not use this method for
phylogenetic tree reconstruction (unless
believe assumption)
• Produces a rooted tree
• As a general clustering method (as we
discussed in an earlier lecture), it is better…
Distance Method:
Neighbor Joining
• Most widely-used distance based method
for phylogenetic reconstruction
• UPGMA illustrated that it is not enough to
just pick closest neighbors
• Idea here: take into account averaged
distances to other leaves as well
• Produces an unrooted tree
Neighbor Joining (NJ)

Start off with star tree; pull out pairs at a time

NJ Algorithm
Step 1: Let
– (Almost) “average” distance to other nodes
Step 2: Choose i and j for which Mij – ui –uj is
smallest
– Look for nodes that are close to each other,
and far from everything else
– Turns out minimizing this is minimizing sum
of branch lengths
NJ algorithm
Step 3: Define a new cluster (i, j), with a
corresponding node in the tree
i
(i,j)
j

Distance from i and j to node (i,j):

di, (i,j) = 0.5(Mij + ui-uj) Default: split distance but
if on average one is further
dj, (i,j) = 0.5(Mij +uj-ui) away, make it longer
NJ Algorithm
Step 4: Compute distance between new cluster
and all other clusters:
M(ij)k = Mik+Mjk – Mij
2 i

k
(i,j)
j
Step 5: Delete i and j from matrix
and replace by (i, j)

Step 6: Continue until only 2 leaves remain

NJ Performance
• Works well in practice
• If there is a tree that fits the matrix, it will
find it
• Can sometimes get trees with negative
length edges (!)
Computing Distances
Between Sequences

Could compute fraction of mismatches between

two sequences; however, this is an underestimate
of actual distance
Computing Distances
Between Sequences

E.g., many
underlying
substitutions
possible

Use models of
substitution to
correct these values
Computing Distances
Between Sequences
Jukes & Cantor model
-Each position in DNA
sequence is independent
-Each position can mutates
with same probability to
any another base

Correction to observed
substitution rate (see notes):
Ex: Computing Distances
Between Sequences
• Alignment of two DNA sequences
– Length of alignment (non gapped positions): 100
– Number of differences: 25
• Naïve distance calculation = 25/100 = ¼
• Correction

• Other models for DNA, also protein (e.g.,

PAM)
Maximum Likelihood
• Given a probabilistic model for nucleotide
(or protein) substitution (e.g., Jukes &
Cantor), pick the tree that has highest
probability of generating observed data
– I.e., Given data D and model M, find tree T
such that Pr(D|T, M) is maximized
• Models gives values pij(t), the probability of
going from nucleotide i to j in time t
Maximum Likelihood
• Makes 2 independence assumptions
– Different sites evolve independently
– Diverged sequences (or species) evolve
independently after diverging
• If Di is data for ith site
Maximum Likelihood
How to calculate Pr(Di|T,M) ?

pxy(t) ~ prob
of going from x
to y in time t
Maximum Likelihood
• Given tree topology and branch lengths, can
efficiently calculate Pr(D|T, M) using dynamic
programming
– I.e., don’t have to enumerate over all internal states
• Finding best maximum likelihood tree is expensive
– Must consider all topologies
– Find best edge lengths for each topology
• Idea: use some search procedure, e.g., EM, to optimize these
lengths
Assessing Reliability:
Bootstrap
Say we’ve inferred the following tree

Would like to get confidence

levels that 1 & 2 belong together,
and 3&4 belong together

1 2 3 4
Assessing Reliability:
Bootstrap
Say we’re given following alignment:
12345678
1 GCAGTACT We’ll create a pseudosample
2 GTAGTACT by choosing sites randomly
until N sites are chosen
3 ACAATACC (N is length of alignment)
4 ACAACACT
Assessing Reliability:
Bootstrap
Say chose 6th, 1st, 6th, 8th, …
12345678 6168 …
1 GCAGTACT AGAT …
2 GTAGTACT AGAT …
3 ACAATACC AAAC …
4 ACAACACT AAAT …
Assessing Reliability:
Bootstrap
• Use pseudosample to construct tree
• Repeat many times
• Confidence of (1) and (2) together is
fraction of times they appear together in
trees generated from pseudosamples
95
90

1 2 3 4
Phylogeny Flowchart
Family of Build Strong Y MP
sequences alignment similarity Methods

Recognizable Y
Distance
similarity Methods

N
ML
Methods

(Mount, Bioinformatics)
Difference in Methods
• Maximum-likelihood and parsimony methods
have models of evolution
• Distance methods do not necessarily
– Useful aspect in some circumstances
• E.g., trees built based on whole genomes, presence or absence
of genes
• Religious wars over which methods to use
– Most people now believe ML based methods are best:
most sensitive at large evolutionary distances – but also
most time-consuming & depend on specific model of
evolution used
• Most commonly used packages contain software
for all three methods: may want to use more than 1
to have confidence in built tree
Phylip
• Parsimony
– DNApenny or Protpars
• Distance
– Compute distance measure using DNAdist or
Protdist
– Neighbor (can use NJ or UPGMA)
• ML
– DNAml

Molecular Systematics - David Hillis, Craig Moritz, Barbara Mable
No ratings yet
Molecular Systematics - David Hillis, Craig Moritz, Barbara Mable
676 pages
EuPlatesc Documentatie
No ratings yet
EuPlatesc Documentatie
7 pages
Principles of Taxonomy PPT 2024
No ratings yet
Principles of Taxonomy PPT 2024
110 pages
Bioinformatics Notes
No ratings yet
Bioinformatics Notes
40 pages
Telekinesis - Unleash Your Telekinetic Ability
93% (29)
Telekinesis - Unleash Your Telekinetic Ability
241 pages
Fundamentals of Plant Physiology Lincoln Taiz 2025 Instant Download
No ratings yet
Fundamentals of Plant Physiology Lincoln Taiz 2025 Instant Download
122 pages
Modern Trends in Taxonomy
No ratings yet
Modern Trends in Taxonomy
38 pages
LAb Activity - Weed Vegetation Sampling
No ratings yet
LAb Activity - Weed Vegetation Sampling
3 pages
Phylogenetic Tree Construction - Methods
No ratings yet
Phylogenetic Tree Construction - Methods
7 pages
Sequence Similarity Searching: Basic Local Alignment Search Tool
No ratings yet
Sequence Similarity Searching: Basic Local Alignment Search Tool
47 pages
Phylogenetic Analysis
No ratings yet
Phylogenetic Analysis
47 pages
Cool:gen for Enterprise Developers
100% (2)
Cool:gen for Enterprise Developers
16 pages
Concept of Taxonomy, Systematics and Its Significance
No ratings yet
Concept of Taxonomy, Systematics and Its Significance
8 pages
TGBBOCV
No ratings yet
TGBBOCV
470 pages
Understanding Species Diversity
No ratings yet
Understanding Species Diversity
3 pages
Disclaimer
No ratings yet
Disclaimer
36 pages
MODEL ANSWER QUESTION FOR Molecular Marker and Resistance Breeding MC - Ag Shiv Shankar Loniya
No ratings yet
MODEL ANSWER QUESTION FOR Molecular Marker and Resistance Breeding MC - Ag Shiv Shankar Loniya
70 pages
Gene Finding
No ratings yet
Gene Finding
31 pages
Project Status and Issues Report
100% (1)
Project Status and Issues Report
22 pages
Computer and Internet Crime: Ethics in Information Technology, Fourth Edition
No ratings yet
Computer and Internet Crime: Ethics in Information Technology, Fourth Edition
51 pages
Drone and Hub Class Specifications
No ratings yet
Drone and Hub Class Specifications
4 pages
DSPIC Datasheet
No ratings yet
DSPIC Datasheet
386 pages
Phylogenetic Tree Construction Guide
No ratings yet
Phylogenetic Tree Construction Guide
4 pages
Transgenic Plants for Biochemists
No ratings yet
Transgenic Plants for Biochemists
33 pages
Computer Based Police Investigation System
No ratings yet
Computer Based Police Investigation System
4 pages
C TAW12 70-Details
No ratings yet
C TAW12 70-Details
1 page
Spool Converter Pro: Windows Form Designer
No ratings yet
Spool Converter Pro: Windows Form Designer
2 pages
BCS-11.solved Assignment 2018-19 - Watermark Ignou Assignment Wala PDF
No ratings yet
BCS-11.solved Assignment 2018-19 - Watermark Ignou Assignment Wala PDF
39 pages
Chapter 7: Energy and Ecosystems: Summary
No ratings yet
Chapter 7: Energy and Ecosystems: Summary
3 pages
6.6.6 Configure Port Security 1
No ratings yet
6.6.6 Configure Port Security 1
1 page
Cisco AAA Authentication and TACACS+ Guide
No ratings yet
Cisco AAA Authentication and TACACS+ Guide
3 pages
Angiosperm Classification and The APG
100% (1)
Angiosperm Classification and The APG
2 pages
Invasion Biology: Mark A. Davis
No ratings yet
Invasion Biology: Mark A. Davis
1 page
ICS 143 - Principles of Operating Systems: Lecture 1 - Introduction and Overview T, TH 3:30 - 4:50 P.M. )
No ratings yet
ICS 143 - Principles of Operating Systems: Lecture 1 - Introduction and Overview T, TH 3:30 - 4:50 P.M. )
37 pages
BFC 20802 PDF
No ratings yet
BFC 20802 PDF
17 pages
Molecular Ecology BI214F Exam Spring 2019 PDF
No ratings yet
Molecular Ecology BI214F Exam Spring 2019 PDF
3 pages
Service Oriented Architecture Based Integration
No ratings yet
Service Oriented Architecture Based Integration
39 pages
I B.SC Botany-Evolution of The Sporophyte - Telome Theory
No ratings yet
I B.SC Botany-Evolution of The Sporophyte - Telome Theory
34 pages
Smart Stick For Blind Man: Nitish Sukhija, Shruti Taksali, Mohit Jain and Rahul Kumawat
No ratings yet
Smart Stick For Blind Man: Nitish Sukhija, Shruti Taksali, Mohit Jain and Rahul Kumawat
8 pages
4 Simple RTL (VHDL) Project With Vivado
No ratings yet
4 Simple RTL (VHDL) Project With Vivado
6 pages
Energy Dynamics
No ratings yet
Energy Dynamics
10 pages
RED Simulation
No ratings yet
RED Simulation
9 pages
Ssap
No ratings yet
Ssap
2 pages
Cr800 Manual
No ratings yet
Cr800 Manual
480 pages
Mod H
No ratings yet
Mod H
103 pages
8255 PPI Technical Overview
No ratings yet
8255 PPI Technical Overview
3 pages
UNIX and Shell Scripting - Module 2
No ratings yet
UNIX and Shell Scripting - Module 2
44 pages
Cytoplasmic Inheritance (With Diagram) - Cell Biology
0% (1)
Cytoplasmic Inheritance (With Diagram) - Cell Biology
45 pages
OpenFOAM GUI Development Insights
No ratings yet
OpenFOAM GUI Development Insights
19 pages
Phylogenetic Tree Building with MEGA
100% (1)
Phylogenetic Tree Building with MEGA
18 pages
Molecular Evolution
100% (1)
Molecular Evolution
134 pages
Species Concept
No ratings yet
Species Concept
9 pages
8051 Notes New
No ratings yet
8051 Notes New
70 pages
Phylogenetic Tree Reconstruction: I519 Introduction To Bioinformatics, 2012
No ratings yet
Phylogenetic Tree Reconstruction: I519 Introduction To Bioinformatics, 2012
40 pages
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
No ratings yet
Https Raw - Githubusercontent.com Joelgrus Data-Science-From-Scratch Master Code Natural Language Processing
5 pages
Developmental Plant Biogical
No ratings yet
Developmental Plant Biogical
30 pages
Network Security Quiz
No ratings yet
Network Security Quiz
17 pages
Russel - Capt3 - Replicacion
No ratings yet
Russel - Capt3 - Replicacion
25 pages
CTGA Lec1
No ratings yet
CTGA Lec1
42 pages
Ctga Lec9
No ratings yet
Ctga Lec9
40 pages
Teachers Guide Biotechnology GaBIO PDF
No ratings yet
Teachers Guide Biotechnology GaBIO PDF
86 pages
Blast
No ratings yet
Blast
28 pages
Ctga Lec8
No ratings yet
Ctga Lec8
27 pages
Systematic Approaches To Phylogeny)
100% (2)
Systematic Approaches To Phylogeny)
26 pages
Plant Genome Research Insights
100% (1)
Plant Genome Research Insights
4 pages
Lecture 36 - Translation and Protein Targeting
No ratings yet
Lecture 36 - Translation and Protein Targeting
24 pages
Laboratory Exercise 7
No ratings yet
Laboratory Exercise 7
3 pages
Unit 7 PDF
No ratings yet
Unit 7 PDF
22 pages
Wiley 1978 Evolutionary Species Concept
No ratings yet
Wiley 1978 Evolutionary Species Concept
11 pages
Bioinfo - S1 2021 - L7 - Phylogeny - 1 Slide
100% (1)
Bioinfo - S1 2021 - L7 - Phylogeny - 1 Slide
76 pages
Data Monkey Tutorial
No ratings yet
Data Monkey Tutorial
31 pages
Rivina humilis Morphological Description
No ratings yet
Rivina humilis Morphological Description
19 pages
Step Forward in Plant Taxonomy Book (Final)
No ratings yet
Step Forward in Plant Taxonomy Book (Final)
113 pages
Web2 Hacking Tutorial by Radu State
No ratings yet
Web2 Hacking Tutorial by Radu State
14 pages
Plant Systematics Activity
No ratings yet
Plant Systematics Activity
10 pages
Phylogenetic Trees
No ratings yet
Phylogenetic Trees
11 pages
2 RTU560 Webserver E
No ratings yet
2 RTU560 Webserver E
34 pages
Phylogenetic Analysis
100% (1)
Phylogenetic Analysis
25 pages
Protocols For BioEdit
No ratings yet
Protocols For BioEdit
24 pages
OEB 181: Systematics Catalog Number: 5459: Tu & TH, 10 - 11:30 Am, MCZ 202 Wednesdays, 2 - 4 PM, Science Center 418D
No ratings yet
OEB 181: Systematics Catalog Number: 5459: Tu & TH, 10 - 11:30 Am, MCZ 202 Wednesdays, 2 - 4 PM, Science Center 418D
21 pages
Eubacteria: Classification and Characteristics
No ratings yet
Eubacteria: Classification and Characteristics
48 pages
Plant Systematics
100% (1)
Plant Systematics
5 pages
Insect Collecting & Entomology Overview
No ratings yet
Insect Collecting & Entomology Overview
2 pages
Concept of Taxonomy
No ratings yet
Concept of Taxonomy
9 pages
Dr. Maneesha Singh Assistant Professor, Department of Life Sciences SGRRITS, Patel Nagar, Dehradun, UK
No ratings yet
Dr. Maneesha Singh Assistant Professor, Department of Life Sciences SGRRITS, Patel Nagar, Dehradun, UK
28 pages
Nomenlature of Plants
100% (1)
Nomenlature of Plants
84 pages
Gymnosperms: Gymno Sperm Naked Seeds
No ratings yet
Gymnosperms: Gymno Sperm Naked Seeds
14 pages
Morphometrics: A Brief Review
No ratings yet
Morphometrics: A Brief Review
23 pages
Phylogenetic Tree Creation Morphological and Molecular Methods For 07-Johnson
100% (2)
Phylogenetic Tree Creation Morphological and Molecular Methods For 07-Johnson
35 pages
Multiple Sequence Alignment &amp Phylogenetic Tree
No ratings yet
Multiple Sequence Alignment &amp Phylogenetic Tree
24 pages
Vascular Basic Shapes in Taxonomy
100% (1)
Vascular Basic Shapes in Taxonomy
109 pages
APG III. 2009. An Update of The Angiosperm Phylogeny Group Classification For The Orders and Families of Flowering Plants APG III PDF
100% (1)
APG III. 2009. An Update of The Angiosperm Phylogeny Group Classification For The Orders and Families of Flowering Plants APG III PDF
17 pages
Isolating Mechanisms-2
No ratings yet
Isolating Mechanisms-2
20 pages
Crypto Gam
100% (1)
Crypto Gam
3 pages
Chapter 10 Lecture For Web
No ratings yet
Chapter 10 Lecture For Web
68 pages
Developmental Biology 8e Ch20
100% (2)
Developmental Biology 8e Ch20
43 pages

Phylogenetics for Biology Students

Uploaded by

Phylogenetics for Biology Students

Uploaded by

Phylogenetics

COS551, Fall 2003

Rooted tree Unrooted tree

Note: Here, each node has three neighboring nodes

OTUs unrooted trees

OTUs Rooted trees

Idea: construct set of possible nucleotides for internal nodes,

• Total # changes necessary for a site is # of union

A tree exactly fitting the matrix does not always exist.

– Fitch-Margoliash criterion: minimize

• Unfortunately, both lead to computationally

M B(AC) = (MBA + MBC)/2 = (8+9)/2=8.5

M (ABC)D = (MAD + MBD + MCD)/3 = (12+14+11)/3

In fact, exact fitting tree exists !

Start off with star tree; pull out pairs at a time

Distance from i and j to node (i,j):

Step 6: Continue until only 2 leaves remain

Could compute fraction of mismatches between

• Other models for DNA, also protein (e.g.,

Would like to get confidence

You might also like