Introduction to Bioinformatics
Lecture 5 & 6
Sequence Alignment
What is sequence alignment?
Procedure of comparing sequences
Two sequences (Pair-wise Sequence Alignment)
More than two (Multiple Sequence Alignment)
What sequence are aligned?
Match
Mis-match
Global VS Local*****
Global Alignment
Attempts to align the maximum of the entire sequence
Suitable for similar and equal length sequences
CTGTCG-CTGCACG
-TGC-CG-TG---Global alignment
CTGTCGCTGCACG--------TGC-CGTG
Local alignment
Local Alignment
Gathers islands of matches
Stretches of sequences with highest density of
matches are aligned
Suitable for partially similar, different length and
conserved region containing sequences
How can we tell if the two sequences are similar?
Similarity judgments should be based on:
The types of changes or mutations that occur within
sequences.
Characteristics of those different types of mutations.
Frequency of mutations
Substitution > Insertion, Deletion
>>
Duplication
>
Inversion
Common mutations in DNA***
Substitution:
A C G T T G A C
A C G A T G A C
Deletion:
A C G T T G A C
A C G A C
Insertion:
A C G T T G A C
A C G C A A T T G A C
Common mutations***
Duplication:
A C G T T G A C
A C G T T G A T T G A C
Inversion (double stranded DNA shown):
A C G T T G
T G C A A C
A
T
C
G
Terminology *****
Homolog
A gene related to a second gene by descent
from a common ancestral DNA sequence
Ortholog
Orthologs are genes in different species that
evolved from a common ancestral gene by
speciation
Paralog
Paralogs are genes related by duplication
within a genome
Terminology
Analogous : different structure but similar
feature
Xenologous: related through transfer of
genetic material between species
Global Alignment *****
C
A
T
T
C
A
-5
-5
10
-5
-10
-2
-7
-5
-10
-15
15
10
-5
-2
-7
-2
-7
-4
-20
-5
-10 -15 -20 -25 -30 -35 -40
10 * 13
-10 -15 -20 -25
-25 -10
20
15
18
13
-30 -15
15
18
13
28
23
18
-35 -20
-5
10
13
28
23
26
33
C
Traceback can yield both optimum alignments
Local VS Global Alignment ***
Both uses dynamic programming method
Main difference
Rules for calculating scoring matrix are
slightly different
The scoring system must include negative
scores for mismatches
Only non-negative values are kept in the
scoring matrix
Has the effect of terminating the alignment
Local Alignment***
C
A
C
A
+1 for a match, -1 for a mismatch, -5 for a space
How can we know?
The alignment is global if
Matched regions are long
Cover most of the aligning sequences
Many gaps are present
This is very subjective
The matrix will give GA, if
Gives an average positive score to each aligned position
A small gap penalty
The matrix will give LA, if
Gives an average negative value to the mismatched
positions
Large gap penalty
Introduction to Bioinformatics
Lecture 7
Why Multiple Sequence Alignment?
Up until now we have only tried to align two sequences.
A faint similarity between two sequences becomes significant
if present in many
Multiple alignments can reveal subtle similarities that
pairwise alignments do not reveal
Multiple Sequence Alignment: Approaches
Optimal Global Alignments Generalization of Dynamic programming
Find alignment that maximizes a score function
Computationally expensive: Time grows as product
of sequence lengths
Global Progressive Alignments - Match closelyrelated sequences first using a guide tree
Global Iterative Alignments - Multiple re-building
attempts to find best alignment
Local alignments
Profile analysis,
Block analysis
Patterns searching and/or Statistical methods
Global msa: Challenges
Computationally Expensive
If msa includes matches, mismatches and gaps and also
accounts the degree of variation then global msa can be
applied to only a few sequences
Difficult to score
Multiple comparison necessary in each column of the msa for
a cumulative score
Placement of gaps and scoring of substitution is more difficult
Difficulty increases with diversity
Relatively easy for a set of closely related sequences
Identifying the correct ancestry relationships for a set of
distantly related sequences is more challenging
Even difficult if some members are more alike compared
to others
Multiple Alignment: Dynamic Programming*********
si,j,k = max
si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:
si-1,j-1,k + (vi, wj, _ ) no in/dels
si-1,j,k-1
si,j-1,k-1
si-1,j,k
si,j-1,k
si,j,k-1
+ (vi, _, uk)
+ (_, wj, uk)
+ (vi, _ , _)
+ (_, wj, _)
+ (_, _, uk)
face diagonal:
one in/del
edge diagonal:
two in/dels
(x, y, z) is an entry in the 3-D scoring matrix
Introduction to Bioinformatics
Lecture 8
Sensitivity and Selectivity***
Sensitivity: the percentage of homologs that are
identified by the database search
(true positives) / (all positives)
Selectivity: the percentage of non-homologs that
are not identified as homologs
(true negatives) / (all negatives)
For sequence database similarity search methods,
there is usually a trade-off between sensitivity and
selectivity
Database searching
Instead, use faster heuristic approaches
FASTA [Pearson & Lipman, 1988]
BLAST [Altschul et al., 1990;
Smith-Waterman is slower, but more sensitive
FASTA
W. R. Pearson and D. J. Lipman (1988)
FASTA is the first widely used program for sequence database
similarity search
Goal: Perform fast, approximate local alignments to find
sequences in the database that are related to the query
sequence
Based on dot plot idea
Better than BLAST for nucleotide sequence search
Hashing Example, 1/9
Query Sequence: WATSNANDCRICK
k=1
1
10
11
12
13
Hash table:
Hashing Example, 2/9
Query Sequence: WATSNANDCRICK
k=1
1
10
11
12
13
Hash table:
2
6
9
12
11
13
5
7
10
Hashing Example, 3/9
Target Sequence: BASEBALLANDCRICKET
Target table
10
11
12
13
14
15
16
17
18
Hashing Example, 4/9
Target Sequence: BASEBALLANDCRICKET
Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
Hashing Example, 5/9
Target Sequence: BASEBALLANDCRICKET
Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
-3
-15
Hashing Example, 6/9
Target Sequence: BASEBALLANDCRICKET
Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
Offset
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-15
-3
-2
-1
Hashing Example, 7/9
Target Sequence: BASEBALLANDCRICKET
Hash table:
Target table
2
6
9
12
11
13
5
7
10
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
Offset
-15
1
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-15
-3
-2
-1
4
1
Hashing Example, 8/9
Target Sequence: BASEBALLANDCRICKET
A
2
6
9
12
11
13
5
7
10
Hash table:
Target table
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
-15
Offset
-14
-13
-12
-11
-10
-9
-8
1
-3
-2
-1
-7
-6
-5
-4
-3
-15
-3
-2
-1
10
11
12
4
1
13
14
15
16
17
Hashing Example, 9/9
Target Sequence: BASEBALLANDCRICKET
A
2
6
9
12
11
13
5
7
10
Hash table:
Target table
10
11
12
13
14
15
16
17
18
-4
-7
-5
-3
-3
-3
-3
-6
-3
-3
-3
-15
Offset
-14
-13
-12
-11
-10
-9
-8
1
-3
-2
-1
-7
-6
-5
-4
-3
-15
-3
-2
-1
10
11
12
13
14
15
4
1
16
17
Introduction to Bioinformatics
Lecture 9
A Markov Chain Model
Nucleotide frequencies in the human genome
A
29.5
20.4
20.5
29.6
Markov Chain Model: Definition
a Markov chain model is defined by
a set of states
some states emit symbols
other states are silent
(e.g. the begin and end states)
a set of transitions with associated
probabilities
the transitions emanating from a given state
define a distribution over the possible next
states
Markov Chain Model: Property
given some sequence x of length L, we can ask how
probable the sequence is given our model
for any probabilistic model of sequences, we can write
this probability as
Pr(x) Pr(xL , xL1,K, x1 )
Pr(xL | xL1 ,K, x1 ) Pr(xL1 | xL2 ,K, x1 )KPr(x1 )
key property of a (1st order) Markov chain: the
probability of each xi depends only on the value of xi-1
Pr(x) Pr(xL | xL1 ) Pr(xL1 | xL2 )KPr(x2 | x1 ) Pr(x1 )
Pr(x1 ) Pr(xi | xi1 )
i2
The Probability of a Sequence for a Given
Markov Chain Model
Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end | t)
Markov Chain Model: Notation
the transition parameters can be denoted by
a xi1 xi where
a xi1 xi Pr(xi | xi1 )
similarly we can denote the probability of a sequence x as
B
x1
L
xi1 xi
i2
where
aB xi
Pr(x1 ) Pr(xi | xi1
)
i2
represents the transition from the begin state
HMM:
Goal: Find the most likely explanation for the
observed variables
CpG Islands
Written CpG to
distinguish from
a CG base pair
CpG dinucleotides are rarer than would be expected
from the independent probabilities of C and G.
Reason: When CpG occurs, C is typically chemically
modified by methylation and there is a relatively high
chance of methyl-C mutating into T
A CpG island is a region where CpG dinucleotides
are much more abundant than elsewhere.
High CpG frequency may be biologically significant;
e.g., may signal promoter region (start of a gene).
Markov Chain for Discrimination
parameters estimated for + and - models
human sequences containing 48 CpG islands
60,000 nucleotides
Calculated Transition probabilities for both models
The occasionally dishonest casino
A casino uses a fair die most of the time, but
occasionally switches to a loaded one
Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) =
These are the emission probabilities
Transition probabilities
Prob(Fair Loaded) = 0.01
Prob(Loaded Fair) = 0.2
Transitions between states obey a Markov process
An HMM for occasionally dishonest casino
1: 1/6
2: 1/6
akl
0.99
0.80
0.01
3:
4:
5:
6:
ek (b)
1/6
1/6
1/6
1/6
Fair
0.2
1:
2:
3:
4:
5:
6:
1/10
1/10
1/10
1/10
1/10
1/2
Loaded
Three Important Questions
How likely is a given sequence?
the Forward algorithm
What is the most probable path for
generating a given sequence?
the Viterbi algorithm
How can we learn the HMM parameters
given a set of sequences?
the Baum-Welch (Forward-Backward)
algorithm
The occasionally dishonest casino
x x1 , x2 , x3 6,2,6
Pr(x, ) a0 F eF (6)a FF eF (2)a FF eF (6)
1
1
1
0.5 0.99 0.99
6
6
6
0.00227
(1)
(1) FFF
(2) LLL
(3)
LFL
Pr(x,
(2)
) a0 L eL (6)aLL eL (2)aLL eL (6)
0.5 0.5 0.8 0.1 0.8 0.5
0.008
Pr(x,
(3)
) a0 L eL (6)a LF eF (2)a FL eL (6)a L0
1
0.5 0.5 0.2 0.01 0.5
6
0.0000417
The Viterbi Algorithm
Initialization:
(i = 0)
v0 (0) 1, vk (0) 0 for k 0
Recursion: (i = 1, . . . , L): For each state k
vk (i) ek (xi ) max
v
(i
1)a
r
rk
r
Termination:
Pr(x, ) max
v
k (L)a k 0
k
*
To find *, use trace-back, as in dynamic programming
Viterbi: Example
x
6
0
2
0
(1/6)(1/2)
= 1/12
(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375
(1/6)max{0.013750.99,
0.020.2}
= 0.00226875
(1/2)(1/2)
= 1/4
(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02
(1/2)max{0.013750.01,
0.020.8}
= 0.08
6
0
0.80
0.99
vk (i ) ek (xi ) max
v r (i 1)ark
r
1:
2:
3:
4:
5:
6:
1/6
1/6
1/6
1/6
1/6
1/6
Fair
0.01
0.2
1:
2:
3:
4:
5:
6:
1/10
1/10
1/10
1/10
1/10
1/2
Loaded
THANKS A LOT...