Global Alignment
Ben Langmead
Department of Computer Science
Please sign guestbook (www.langmead-lab.org/teaching-materials)
to tell me briefly how you are using the slides. For original Keynote
files, email me (
[email protected]).
Generalizing edit distance
What if it doesn’t make sense for every edit to cost 1?
If you compare two human genomes, you see some kinds of
sequence differences more often than others
Generalizing edit distance
Transitions are A ↔ G
and C ↔ T changes
Transversions are A C,
A T, C G, G T
For random mutations,
transitions should be half as
frequent as transversions...
...but if you compare two humans, transition
to transversion ratio (ti/tv) is ~2.1
Generalizing edit distance
GGGTAGCGGGTTTAAC
||||| ||||||||||
GGGTAACGGGTTTAAC
Human substitution rate ≈ 1 in 1,000
GGGTAGCGGGTTTAAC
||||| |||||||||
GGGTA--GGGTTTAAC
Small-gap rate is ≈ 1 in 3,000
Wanted: keep basic edit distance idea and algorithm, but give
different weights to different events according to likelihood
Penalty function
A C G T -
A 0 4 2 4 8 2 Transitions (A G, C T)
C 4 0 4 2 8 4 Transversions
G 2 4 0 4 8
8 Gaps
T 4 2 4 0 8
- 8 8 8 8
Global alignment
Pj 1 Pi 1
Let D[0, j] = k=0 s( , y[k]), and let D[i, 0] = k=0 s(x[k], )
8
< D[i 1, j] + s(x[i 1], )
Otherwise, let D[i, j] = min D[i, j 1] + s( , y[j 1])
:
D[i 1, j 1] + s(x[i 1], y[j 1])
s(a, b) assigns a cost to a particular gap or substitution
A C G T -
A 0 4 2 4 8 2 Transitions (A G, C T)
C 4 0 4 2 8
s(a, b) : 4 Transversions (everything else)
G 2 4 0 4 8
T 4 2 4 0 8 8 Gaps
- 8 8 8 8
Global alignment: implementation
from numpy import zeros
def globalAlignment(x, y, s):
""" Calculate global alignment value of sequences x and y using
dynamic programming. Return global alignment value. """
D = zeros((len(x)+1, len(y)+1), dtype=int)
for j in range(1, len(y)+1): Use of new
D[0, j] = D[0, j-1] + s('-', y[j-1])
for i in range(1, len(x)+1): penalty function
D[i, 0] = D[i-1, 0] + s(x[i-1], '-')
for i in range(1, len(x)+1):
for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal
return D, D[len(x), len(y)]
Similar to edit distance
http://bit.ly/CG_DP_Global
Global alignment: implementation
def exampleCost(xc, yc):
""" Cost function assigning 0 to match, 2 to transition, 4 to
transversion, and 8 to a gap """
if xc == yc: return 0 # match
if xc == '-' or yc == '-': return 8 # gap
minc, maxc = min(xc, yc), max(xc, yc)
if minc == 'A' and maxc == 'G': return 2 # transition
elif minc == 'C' and maxc == 'T': return 2 # transition
return 4 # transversion
A C G T -
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
- 8 8 8 8
http://bit.ly/CG_DP_Global
Global alignment: dynamic programming
D = zeros((len(x)+1, len(y)+1), dtype=int)
globalAlignment for j in range(1, len(y)+1):
D[0, j] = D[0, j-1] + s('-', y[j-1])
initialization: for i in range(1, len(x)+1):
D[i, 0] = D[i-1, 0] + s(x[i-1], '-')
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 A C G T -
A 0 4 2 4 8
A 16 C 4 0 4 2 8
C 24 G 2 4 0 4 8
T 4 2 4 0 8
G 32 - 8 8 8 8
T 40
C 48
A 56
G 64
C 72
Global alignment: dynamic programming
for i in range(1, len(x)+1):
globalAlignment for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
loop: D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 0 8 16 24 32 40 48 56 64 72 A C G T -
A 0 4 2 4 8
A 16 8 0 8 16 24 32 40 48 56 64 C 4 0 4 2 8
C 24 16 8 ? G 2 4 0 4 8
T 4 2 4 0 8
G 32 - 8 8 8 8
T 40
C 48
A 56
G 64
C 72
Global alignment: dynamic programming
for i in range(1, len(x)+1):
globalAlignment for j in range(1, len(y)+1):
D[i, j] = min(D[i-1, j-1] + s(x[i-1], y[j-1]), # diagonal
loop: D[i-1, j ] + s(x[i-1], '-'), # vertical
D[i , j-1] + s('-', y[j-1])) # horizontal
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80 s(a, b)
T 8 0 8 16 24 32 40 48 56 64 72 A C G T -
A 0 4 2 4 8
A 16 8 0 8 16 24 32 40 48 56 64 C 4 0 4 2 8
C 24 16 8 2 10 18 24 32 40 48 56 G 2 4 0 4 8
T 4 2 4 0 8
G 32 24 16 10 2 10 18 26 34 40 48 - 8 8 8 8
T 40 32 24 16 10 2 10 18 26 34 42
C 48 40 32 24 18 10 2 10 18 26 34
A 56 48 40 32 26 18 10 2 10 18 26
G 64 56 48 40 32 26 18 10 6 10 18
C 72 64 56 48 40 34 26 18 12 10 10 Optimal global
alignment value
Global alignment: getting the alignment
Traceback works just as it did for edit distance
ϵ T A T G T C A T G C
ϵ 0 8 16 24 32 40 48 56 64 72 80
T 8 0 8 16 24 32 40 48 56 64 72
A 16 8 0 8 16 24 32 40 48 56 64
C 24 16 8 2 10 18 24 32 40 48 56
G 32 24 16 10 2 10 18 26 34 40 48
T 40 32 24 16 10 2 10 18 26 34 42 TACGTCA-GC
C 48 40 32 24 18 10 2 10 18 26 34 || |||| ||
TATGTCATGC
A 56 48 40 32 26 18 10 2 10 18 26 +2 +8
(transition) (gap)
G 64 56 48 40 32 26 18 10 6 10 18
C 72 64 56 48 40 34 26 18 12 10 10
Global alignment: summary
Matrix-filling dynamic programming algorithm is O(mn) time and space
FillIng matrix is O(mn) space and time, yields global alignment value
Traceback is O(m + n) time, yields optimal alignment
Global alignment: scoring functions
Where do these penalty functions come from? A C G T -
A 0 4 2 4 8
C 4 0 4 2 8
G 2 4 0 4 8
T 4 2 4 0 8
- 8 8 8 8
They can be based on:
Expected frequency of the different mutational events
How interchangeable are the alternatives are from a biological perspective
Does the substitution change the shape or function of the molecule
Prevalence of simple (linear, constant, affine) gap penalties is mostly because
that’s what we can do efficiently, as discussed in HW4
One occasionally sees more general (e.g. convex) gap penalties
BLOSUM62 Some amino acid substitutions have a smaller impact on
structure & function than others. BLOSUM62 elements
are, roughly speaking, log-odds of observing these
substitutions between two highly related proteins
Rare; larger effect on Common; modest effect
structure/function on structure/function
negative positive
Matrix is symmetric
Amino acids