100% found this document useful (4 votes)

2K views44 pages

Sequence Alignment in Bioinformatics

It is a short review on Lecture 5 to lecture 9 of Introduction to Bioinformatics subject, course code CSE235. Mainly this review is helpful for the final exam of my university students and others also.

Uploaded by

Md Saidur Rahman Kohinoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

2K views44 pages

Sequence Alignment in Bioinformatics

Uploaded by

Md Saidur Rahman Kohinoor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Introduction to Bioinformatics

Lecture 5 & 6

Sequence Alignment
What is sequence alignment?
Procedure of comparing sequences
Two sequences (Pair-wise Sequence Alignment)
More than two (Multiple Sequence Alignment)

What sequence are aligned?

Match
Mis-match

Global VS Local*****
Global Alignment
Attempts to align the maximum of the entire sequence
Suitable for similar and equal length sequences

CTGTCG-CTGCACG
-TGC-CG-TG---Global alignment

CTGTCGCTGCACG--------TGC-CGTG
Local alignment

Local Alignment
Gathers islands of matches
Stretches of sequences with highest density of
matches are aligned
Suitable for partially similar, different length and
conserved region containing sequences

How can we tell if the two sequences are similar?

Similarity judgments should be based on:

The types of changes or mutations that occur within

sequences.
Characteristics of those different types of mutations.

Frequency of mutations
Substitution > Insertion, Deletion
>>
Duplication
>
Inversion

Common mutations in DNA***

Substitution:
A C G T T G A C
A C G A T G A C

Deletion:

A C G T T G A C
A C G A C
Insertion:

A C G T T G A C
A C G C A A T T G A C

Common mutations***
Duplication:
A C G T T G A C
A C G T T G A T T G A C

Inversion (double stranded DNA shown):

A C G T T G
T G C A A C

A
T

C
G

Terminology *****
Homolog
A gene related to a second gene by descent
from a common ancestral DNA sequence

Ortholog
Orthologs are genes in different species that
evolved from a common ancestral gene by
speciation

Paralog
Paralogs are genes related by duplication
within a genome

Terminology

Analogous : different structure but similar

feature
Xenologous: related through transfer of
genetic material between species

Global Alignment *****

C
A
T

T
C
A

-5

-10

-2

-7

-5

-10

-15

-5

-2

-7

-2

-7

-4

-20

-5

-10 -15 -20 -25 -30 -35 -40

10 * 13

-10 -15 -20 -25

-25 -10

-30 -15

-35 -20

-5

C
Traceback can yield both optimum alignments

Local VS Global Alignment ***

Both uses dynamic programming method
Main difference
Rules for calculating scoring matrix are
slightly different
The scoring system must include negative
scores for mismatches
Only non-negative values are kept in the
scoring matrix
Has the effect of terminating the alignment

Local Alignment***

C
A

+1 for a match, -1 for a mismatch, -5 for a space

How can we know?

The alignment is global if

Matched regions are long
Cover most of the aligning sequences
Many gaps are present

This is very subjective

The matrix will give GA, if
Gives an average positive score to each aligned position
A small gap penalty

The matrix will give LA, if

Gives an average negative value to the mismatched
positions
Large gap penalty

Introduction to Bioinformatics
Lecture 7

Why Multiple Sequence Alignment?

Up until now we have only tried to align two sequences.
A faint similarity between two sequences becomes significant
if present in many
Multiple alignments can reveal subtle similarities that
pairwise alignments do not reveal

Multiple Sequence Alignment: Approaches

Optimal Global Alignments Generalization of Dynamic programming
Find alignment that maximizes a score function
Computationally expensive: Time grows as product
of sequence lengths
Global Progressive Alignments - Match closelyrelated sequences first using a guide tree
Global Iterative Alignments - Multiple re-building
attempts to find best alignment
Local alignments
Profile analysis,
Block analysis
Patterns searching and/or Statistical methods

Global msa: Challenges

Computationally Expensive

If msa includes matches, mismatches and gaps and also

accounts the degree of variation then global msa can be
applied to only a few sequences

Difficult to score

Multiple comparison necessary in each column of the msa for

a cumulative score
Placement of gaps and scoring of substitution is more difficult

Difficulty increases with diversity

Relatively easy for a set of closely related sequences
Identifying the correct ancestry relationships for a set of
distantly related sequences is more challenging
Even difficult if some members are more alike compared
to others

Multiple Alignment: Dynamic Programming*********

si,j,k = max

si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:

si-1,j-1,k + (vi, wj, _ ) no in/dels
si-1,j,k-1
si,j-1,k-1
si-1,j,k
si,j-1,k
si,j,k-1

+ (vi, _, uk)
+ (_, wj, uk)
+ (vi, _ , _)
+ (_, wj, _)
+ (_, _, uk)

face diagonal:
one in/del
edge diagonal:
two in/dels

(x, y, z) is an entry in the 3-D scoring matrix

Introduction to Bioinformatics
Lecture 8

Sensitivity and Selectivity***

Sensitivity: the percentage of homologs that are
identified by the database search
(true positives) / (all positives)

Selectivity: the percentage of non-homologs that

are not identified as homologs
(true negatives) / (all negatives)

For sequence database similarity search methods,

there is usually a trade-off between sensitivity and
selectivity

Database searching
Instead, use faster heuristic approaches
FASTA [Pearson & Lipman, 1988]
BLAST [Altschul et al., 1990;
Smith-Waterman is slower, but more sensitive

FASTA

W. R. Pearson and D. J. Lipman (1988)

FASTA is the first widely used program for sequence database
similarity search
Goal: Perform fast, approximate local alignments to find
sequences in the database that are related to the query
sequence
Based on dot plot idea
Better than BLAST for nucleotide sequence search

Hashing Example, 1/9

Query Sequence: WATSNANDCRICK
k=1
1

Hash table:

Hashing Example, 2/9

Query Sequence: WATSNANDCRICK
k=1
1

Hash table:

2
6

9
12

5
7

Hashing Example, 3/9

Target Sequence: BASEBALLANDCRICKET

Target table

Hashing Example, 4/9

Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

5
7

Hashing Example, 5/9

Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

5
7

-4

-7

-5

-3

-6

-3

-15

Hashing Example, 6/9

Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

5
7

-4

-7

-5

-3

-6

-3

Offset

-15

-14

-13

-12

-11

-10

-9

-8

-7

-6

-5

-4

-3

-15

-3

-2

-1

Hashing Example, 7/9

Target Sequence: BASEBALLANDCRICKET

Hash table:
Target table

2
6

9
12

5
7

-4

-7

-5

-3

-6

-3

Offset

-15
1

-14

-13

-12

-11

-10

-9

-8

-7

-6

-5

-4

-3

-15

-3

-2

-1

4
1

Hashing Example, 8/9

Target Sequence: BASEBALLANDCRICKET
A

2
6

9
12

5
7

Hash table:
Target table

-4

-7

-5

-3

-6

-3

-15

Offset

-14

-13

-12

-11

-10

-9

-8

1
-3

-2

-1

-7

-6

-5

-4

-3

-15

-3

-2

-1

4
1

Hashing Example, 9/9

Target Sequence: BASEBALLANDCRICKET
A

2
6

9
12

5
7

Hash table:
Target table

-4

-7

-5

-3

-6

-3

-15

Offset

-14

-13

-12

-11

-10

-9

-8

1
-3

-2

-1

-7

-6

-5

-4

-3

-15

-3

-2

-1

4
1

Introduction to Bioinformatics
Lecture 9

A Markov Chain Model

Nucleotide frequencies in the human genome
A

29.5

20.4

20.5

29.6

Markov Chain Model: Definition

a Markov chain model is defined by

a set of states
some states emit symbols
other states are silent
(e.g. the begin and end states)

a set of transitions with associated

probabilities
the transitions emanating from a given state
define a distribution over the possible next
states

Markov Chain Model: Property

given some sequence x of length L, we can ask how
probable the sequence is given our model

for any probabilistic model of sequences, we can write

this probability as
Pr(x) Pr(xL , xL1,K, x1 )
Pr(xL | xL1 ,K, x1 ) Pr(xL1 | xL2 ,K, x1 )KPr(x1 )

key property of a (1st order) Markov chain: the

probability of each xi depends only on the value of xi-1
Pr(x) Pr(xL | xL1 ) Pr(xL1 | xL2 )KPr(x2 | x1 ) Pr(x1 )
Pr(x1 ) Pr(xi | xi1 )

The Probability of a Sequence for a Given

Markov Chain Model

Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end | t)

Markov Chain Model: Notation

the transition parameters can be denoted by

a xi1 xi where

a xi1 xi Pr(xi | xi1 )

similarly we can denote the probability of a sequence x as
B

L
xi1 xi

where

aB xi

Pr(x1 ) Pr(xi | xi1

)

represents the transition from the begin state

HMM:

Goal: Find the most likely explanation for the

observed variables

CpG Islands

Written CpG to
distinguish from
a CG base pair

CpG dinucleotides are rarer than would be expected

from the independent probabilities of C and G.
Reason: When CpG occurs, C is typically chemically
modified by methylation and there is a relatively high
chance of methyl-C mutating into T
A CpG island is a region where CpG dinucleotides
are much more abundant than elsewhere.
High CpG frequency may be biologically significant;
e.g., may signal promoter region (start of a gene).

Markov Chain for Discrimination

parameters estimated for + and - models
human sequences containing 48 CpG islands
60,000 nucleotides

Calculated Transition probabilities for both models

The occasionally dishonest casino

A casino uses a fair die most of the time, but
occasionally switches to a loaded one
Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6
Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10,
Prob(6) =
These are the emission probabilities

Transition probabilities
Prob(Fair Loaded) = 0.01
Prob(Loaded Fair) = 0.2
Transitions between states obey a Markov process

An HMM for occasionally dishonest casino

1: 1/6
2: 1/6

akl

0.99

0.80
0.01
3:
4:
5:
6:

ek (b)

1/6
1/6
1/6
1/6

Fair

0.2

1:
2:
3:
4:
5:
6:

1/10
1/10
1/10
1/10
1/10
1/2

Loaded

Three Important Questions

How likely is a given sequence?

the Forward algorithm

What is the most probable path for

generating a given sequence?
the Viterbi algorithm

How can we learn the HMM parameters

given a set of sequences?
the Baum-Welch (Forward-Backward)
algorithm

The occasionally dishonest casino

x x1 , x2 , x3 6,2,6
Pr(x, ) a0 F eF (6)a FF eF (2)a FF eF (6)
1
1
1
0.5 0.99 0.99
6
6
6
0.00227
(1)

(1) FFF

(2) LLL

(3)

LFL

Pr(x,

(2)

) a0 L eL (6)aLL eL (2)aLL eL (6)

0.5 0.5 0.8 0.1 0.8 0.5
0.008

Pr(x,

(3)

) a0 L eL (6)a LF eF (2)a FL eL (6)a L0

1
0.5 0.5 0.2 0.01 0.5
6
0.0000417

The Viterbi Algorithm

Initialization:

(i = 0)

v0 (0) 1, vk (0) 0 for k 0

Recursion: (i = 1, . . . , L): For each state k

vk (i) ek (xi ) max

v
(i
1)a

r
rk
r
Termination:

Pr(x, ) max

v
k (L)a k 0
k
*

To find *, use trace-back, as in dynamic programming

Viterbi: Example
x

6
0

2
0

(1/6)(1/2)
= 1/12

(1/6)max{(1/12)0.99,
(1/4)0.2}
= 0.01375

(1/6)max{0.013750.99,
0.020.2}
= 0.00226875

(1/2)(1/2)
= 1/4

(1/10)max{(1/12)0.01,
(1/4)0.8}
= 0.02

(1/2)max{0.013750.01,
0.020.8}
= 0.08

6
0

0.80

0.99

vk (i ) ek (xi ) max
v r (i 1)ark
r

1:
2:
3:
4:
5:
6:

1/6
1/6
1/6
1/6
1/6
1/6

Fair

0.01

0.2

1:
2:
3:
4:
5:
6:

1/10
1/10
1/10
1/10
1/10
1/2

Loaded

THANKS A LOT...

Bioinformatics for Researchers
100% (1)
Bioinformatics for Researchers
1 page
Basics of Bioinformatics
100% (7)
Basics of Bioinformatics
99 pages
Bioinformatics for Researchers
100% (2)
Bioinformatics for Researchers
21 pages
Bioinformatics For Biologists
No ratings yet
Bioinformatics For Biologists
394 pages
FASTA
No ratings yet
FASTA
33 pages
Bioinformatics
100% (2)
Bioinformatics
104 pages
Big Data Analytics in Genomics Ka Chun Wong PDF
No ratings yet
Big Data Analytics in Genomics Ka Chun Wong PDF
426 pages
Bioinformatics Tools Overview
100% (1)
Bioinformatics Tools Overview
17 pages
Bioinformatics With Python Cookbook - Sample Chapter
100% (1)
Bioinformatics With Python Cookbook - Sample Chapter
24 pages
RNA-Seq and Transcriptome Analysis: Jessica Holmes
No ratings yet
RNA-Seq and Transcriptome Analysis: Jessica Holmes
98 pages
Supercomputing & Computational Biology: Presented by
100% (2)
Supercomputing & Computational Biology: Presented by
26 pages
Bioinformatics Lab Guide
No ratings yet
Bioinformatics Lab Guide
29 pages
Introduction Bioinformatics
50% (2)
Introduction Bioinformatics
155 pages
Bioinformatics For Biologists PDF
96% (23)
Bioinformatics For Biologists PDF
394 pages
Machine Learning in Bioinformatics
100% (1)
Machine Learning in Bioinformatics
50 pages
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
No ratings yet
Lab 1A - Exploring Ncbi: Bioinformatic Methods I Lab 1
22 pages
Bioinformatics Pratical File
No ratings yet
Bioinformatics Pratical File
63 pages
Bioinformatics II Course Overview
No ratings yet
Bioinformatics II Course Overview
91 pages
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
100% (1)
Mastering Bioinformatics and Computational Biology - Unraveling The Complexities of Life Through Data-Driven Discovery
216 pages
Bioinformatics Database and Applications
100% (3)
Bioinformatics Database and Applications
82 pages
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
No ratings yet
Analysis of Microarray Gene Expression Data - M. Lee (Kluwer
398 pages
Genome Mapping Essentials
100% (1)
Genome Mapping Essentials
8 pages
Bioinformatics in Drug Discovery
No ratings yet
Bioinformatics in Drug Discovery
8 pages
Bioinformatics and Functional Genomics, Second Edition. by Jonathan Pevsner
No ratings yet
Bioinformatics and Functional Genomics, Second Edition. by Jonathan Pevsner
9 pages
Pairwise Sequence Alignment
No ratings yet
Pairwise Sequence Alignment
12 pages
Biological Database 1
No ratings yet
Biological Database 1
50 pages
Bioinformatics - An Introductory Textbook
100% (1)
Bioinformatics - An Introductory Textbook
354 pages
Advanced Gene Sequence Alignment
No ratings yet
Advanced Gene Sequence Alignment
36 pages
Bioinformatics
No ratings yet
Bioinformatics
18 pages
Introduction To Bioinformatics
100% (2)
Introduction To Bioinformatics
52 pages
Basic Bioinformatics - S. Ignacimuthu
100% (4)
Basic Bioinformatics - S. Ignacimuthu
232 pages
Bioinformatics for Biochem Students
No ratings yet
Bioinformatics for Biochem Students
6 pages
Algorithms For Next-Generation Sequencing Data (3319598244)
100% (3)
Algorithms For Next-Generation Sequencing Data (3319598244)
356 pages
(Ebook) Mastering Python For Bioinformatics by Ken Youens-Clark ISBN 9781098100889, 1098100883 PDF Download
0% (1)
(Ebook) Mastering Python For Bioinformatics by Ken Youens-Clark ISBN 9781098100889, 1098100883 PDF Download
62 pages
Bioinformatics
100% (2)
Bioinformatics
10 pages
Bioinformatics in Drug Design
100% (3)
Bioinformatics in Drug Design
13 pages
Applied Bioinformatics
100% (1)
Applied Bioinformatics
166 pages
Bioinformatics Toolbox™ User's Guide PDF
No ratings yet
Bioinformatics Toolbox™ User's Guide PDF
351 pages
Application of Bioinformatics in Various Fields
71% (7)
Application of Bioinformatics in Various Fields
9 pages
GWAS
No ratings yet
GWAS
49 pages
Bioinformatics for Molecular Biologists
100% (1)
Bioinformatics for Molecular Biologists
18 pages
Insilico Gene Analysis
No ratings yet
Insilico Gene Analysis
34 pages
(Methods in Molecular Biology 1296) Mathieu Rederstorff (Eds.) - Small Non-Coding RNAs - Methods and Protocols-Humana Press (2015) PDF
100% (1)
(Methods in Molecular Biology 1296) Mathieu Rederstorff (Eds.) - Small Non-Coding RNAs - Methods and Protocols-Humana Press (2015) PDF
239 pages
NGS
100% (3)
NGS
252 pages
Exer 5 - BIOINFORMATICS
100% (1)
Exer 5 - BIOINFORMATICS
21 pages
Bioinformatics A Practical Guide To Next Generation Sequencing Data
100% (1)
Bioinformatics A Practical Guide To Next Generation Sequencing Data
349 pages
MSC Bioinformatics Syllabus
No ratings yet
MSC Bioinformatics Syllabus
42 pages
Whitlock and Schluter-The Analysis of Biological Data Solutions Manual (2008) PDF
62% (13)
Whitlock and Schluter-The Analysis of Biological Data Solutions Manual (2008) PDF
44 pages
Advances in Bioinformatics (Springer, 2021)
50% (2)
Advances in Bioinformatics (Springer, 2021)
446 pages
Biotechnology
No ratings yet
Biotechnology
30 pages
Plant Bioinformatics: Methods and Protocols
100% (1)
Plant Bioinformatics: Methods and Protocols
541 pages
Blast
100% (1)
Blast
21 pages
Computational Genomics With R
No ratings yet
Computational Genomics With R
3 pages
Bioinformatic Tools For Next Generation DNA Sequencing - PHD Thesis
No ratings yet
Bioinformatic Tools For Next Generation DNA Sequencing - PHD Thesis
237 pages
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
No ratings yet
Need & Emergence of The Field: Speaker Shashi Shekhar Head of Computational Section Biowits Life Sciences
59 pages
Lecture 6
No ratings yet
Lecture 6
31 pages
Sequence Alignment Presentation
No ratings yet
Sequence Alignment Presentation
27 pages
PCB Lect02 Pairwise Allign
No ratings yet
PCB Lect02 Pairwise Allign
51 pages
Sequence Alignment & BLAST Guide
No ratings yet
Sequence Alignment & BLAST Guide
37 pages
Bio Medical Tics - Sequence Analysis - Alignment - 2011
No ratings yet
Bio Medical Tics - Sequence Analysis - Alignment - 2011
96 pages
Predictive Parsing with First & Follow
75% (8)
Predictive Parsing with First & Follow
13 pages
Scenario To ER Diagram - Step by Step Solution
87% (110)
Scenario To ER Diagram - Step by Step Solution
5 pages
Assignment On Deadlock: Set: 01 Class Test Number: 02
70% (10)
Assignment On Deadlock: Set: 01 Class Test Number: 02
6 pages
N4. VLAN & Inter VLAN Step-By-Step Routing Configuration
100% (4)
N4. VLAN & Inter VLAN Step-By-Step Routing Configuration
7 pages
Computer Networks: Distance-Vector
No ratings yet
Computer Networks: Distance-Vector
12 pages
ER Diagram and Relational Model Exercise Solution
81% (43)
ER Diagram and Relational Model Exercise Solution
9 pages
ER Diagram & Relational Model Exercise
100% (10)
ER Diagram & Relational Model Exercise
6 pages
Chapter 6 - Bandwidth Utilization
92% (104)
Chapter 6 - Bandwidth Utilization
8 pages
Chapter 2 - Network Models - Exercise Question With Solution
86% (92)
Chapter 2 - Network Models - Exercise Question With Solution
5 pages
Application of Numerical Methods
100% (2)
Application of Numerical Methods
7 pages
Transmission Media Guide
90% (50)
Transmission Media Guide
8 pages
Chapter 5 - Analog - Transmission
92% (86)
Chapter 5 - Analog - Transmission
6 pages
Chapter 4 - Digital Transmission - Exercise Question With Solution
94% (109)
Chapter 4 - Digital Transmission - Exercise Question With Solution
9 pages
Chapter 1 - Introduction - Exercise Question With Solution
92% (62)
Chapter 1 - Introduction - Exercise Question With Solution
5 pages
S. Radhakrshinan
No ratings yet
S. Radhakrshinan
37 pages
Mhra and Ctdi
No ratings yet
Mhra and Ctdi
34 pages
8020 Blocked From Use: Tuesday
No ratings yet
8020 Blocked From Use: Tuesday
95 pages
Catalog Fortuner GR Sport Compressed
No ratings yet
Catalog Fortuner GR Sport Compressed
8 pages
Mood Disorder
No ratings yet
Mood Disorder
18 pages
RSCP Rssi RTWP
100% (5)
RSCP Rssi RTWP
1 page
Lust Epidemic 100 Percent Walkthrough
67% (67)
Lust Epidemic 100 Percent Walkthrough
164 pages
E.macieira - MIT Cover Letter
No ratings yet
E.macieira - MIT Cover Letter
2 pages
Encyclopedia of Recreational Diving Chapter 1
100% (4)
Encyclopedia of Recreational Diving Chapter 1
98 pages
6744-00-16-46-SP-09 Ra
No ratings yet
6744-00-16-46-SP-09 Ra
4 pages
UPS Power Monitor Users Manual Ver 1.17 - C
No ratings yet
UPS Power Monitor Users Manual Ver 1.17 - C
32 pages
Navsure N400i
No ratings yet
Navsure N400i
76 pages
Large Workpiece Machining Solutions
No ratings yet
Large Workpiece Machining Solutions
20 pages
Water Treatment Using UV Rays
No ratings yet
Water Treatment Using UV Rays
19 pages
Different Types of Pollution, English Essay, Project 2
No ratings yet
Different Types of Pollution, English Essay, Project 2
3 pages
Country Home Ibusa Duplex Electrical 1
No ratings yet
Country Home Ibusa Duplex Electrical 1
15 pages
Bài tập ôn hè lớp 4 lên lớp 5 môn tiếng Anh
No ratings yet
Bài tập ôn hè lớp 4 lên lớp 5 môn tiếng Anh
29 pages
Updated SoW
No ratings yet
Updated SoW
6 pages
ALAC14
No ratings yet
ALAC14
6 pages
Analysis of CE Amplifier
100% (2)
Analysis of CE Amplifier
8 pages
Impact of Pressure on IRP Fatigue
No ratings yet
Impact of Pressure on IRP Fatigue
23 pages
Automobile Engineering Course Plan
No ratings yet
Automobile Engineering Course Plan
2 pages
iPhone 14 Setup Guide for New Users
No ratings yet
iPhone 14 Setup Guide for New Users
22 pages
Greek Mythology
No ratings yet
Greek Mythology
26 pages
CS2D Guide for Gamers
No ratings yet
CS2D Guide for Gamers
18 pages
Medical Y-Connector Guidelines
No ratings yet
Medical Y-Connector Guidelines
2 pages
Caulking
No ratings yet
Caulking
6 pages
Wecall Catalog
100% (2)
Wecall Catalog
20 pages
Davanagere
No ratings yet
Davanagere
11 pages
The Great Books of The World
No ratings yet
The Great Books of The World
15 pages

Sequence Alignment in Bioinformatics

Uploaded by

Sequence Alignment in Bioinformatics

Uploaded by

Introduction to Bioinformatics

What sequence are aligned?

How can we tell if the two sequences are similar?

The types of changes or mutations that occur within

Common mutations in DNA***

Inversion (double stranded DNA shown):

Analogous : different structure but similar

Global Alignment *****

-10 -15 -20 -25 -30 -35 -40

-10 -15 -20 -25

Local VS Global Alignment ***

+1 for a match, -1 for a mismatch, -5 for a space

How can we know?

The alignment is global if

This is very subjective

The matrix will give LA, if

Why Multiple Sequence Alignment?

Multiple Sequence Alignment: Approaches

Global msa: Challenges

If msa includes matches, mismatches and gaps and also

Multiple comparison necessary in each column of the msa for

Difficulty increases with diversity

Multiple Alignment: Dynamic Programming*********

si-1,j-1,k-1 + (vi, wj, uk) cube diagonal:

(x, y, z) is an entry in the 3-D scoring matrix

Sensitivity and Selectivity***

Selectivity: the percentage of non-homologs that

For sequence database similarity search methods,

W. R. Pearson and D. J. Lipman (1988)

Hashing Example, 1/9

Hashing Example, 2/9

Hashing Example, 3/9

Hashing Example, 4/9

Hashing Example, 5/9

Hashing Example, 6/9

Hashing Example, 7/9

Hashing Example, 8/9

Hashing Example, 9/9

A Markov Chain Model

Markov Chain Model: Definition

a Markov chain model is defined by

a set of transitions with associated

Markov Chain Model: Property

for any probabilistic model of sequences, we can write

key property of a (1st order) Markov chain: the

The Probability of a Sequence for a Given

Pr(cggt) Pr(c) Pr(g | c) Pr(g | g) Pr(t | g) Pr(end | t)

Markov Chain Model: Notation

a xi1 xi Pr(xi | xi1 )

Pr(x1 ) Pr(xi | xi1

represents the transition from the begin state

Goal: Find the most likely explanation for the

CpG dinucleotides are rarer than would be expected

Markov Chain for Discrimination

Calculated Transition probabilities for both models

The occasionally dishonest casino

An HMM for occasionally dishonest casino

Three Important Questions

How likely is a given sequence?

What is the most probable path for

How can we learn the HMM parameters

The occasionally dishonest casino

) a0 L eL (6)aLL eL (2)aLL eL (6)

) a0 L eL (6)a LF eF (2)a FL eL (6)a L0

The Viterbi Algorithm

v0 (0) 1, vk (0) 0 for k 0

vk (i) ek (xi ) max

To find *, use trace-back, as in dynamic programming

You might also like