Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views55 pages

Introduction To Bioinformatics

Bioinformatics is a field that combines biology and computer science to analyze biological macromolecules, with significant milestones including the creation of protein databases and algorithms for sequence alignment. It has applications in drug design, personalized medicine, and genome analysis, relying on experimental data for interpretation. The document also discusses sequence alignment methods, phylogenetics, and the importance of identifying homologous sequences in understanding evolutionary relationships.

Uploaded by

vasumathimaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views55 pages

Introduction To Bioinformatics

Bioinformatics is a field that combines biology and computer science to analyze biological macromolecules, with significant milestones including the creation of protein databases and algorithms for sequence alignment. It has applications in drug design, personalized medicine, and genome analysis, relying on experimental data for interpretation. The document also discusses sequence alignment methods, phylogenetics, and the importance of identifying homologous sequences in understanding evolutionary relationships.

Uploaded by

vasumathimaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Introduction to Bioinformatics

Bioinformatics
• Bioinformatics, which will be more clearly defined below,
is the discipline of quantitative analysis of information
relating to biological macromolecules with the aid of
computers.
• the first major bioinformatics project was undertaken by
Margaret Dayhoff in 1965, who developed a first protein
sequence database called Atlas of Protein Sequence and
Structure.
• Subsequently, in the early 1970s, the Brookhaven
National Laboratory established the Protein Data Bank
for archiving three-dimensional protein structures.
Bioinformatics
• The first sequence alignment algorithm was developed by
Needleman and Wunsch in 1970.
• The first protein structure prediction algorithm was developed
by Chou and Fasman in 1974.
• The 1980s saw the establishment of GenBank and the
development of fast database searching algorithms such as
FASTA by William Pearson and BLAST by Stephen Altschul and
co workers.
• The start of the human genome project in the late 1980s
provided a major boost for the development of bioinformatics.
• The development and the increasingly widespread use of the
Internet in the 1990s made instant access to, and exchange and
dissemination of, biological data possible.
Bioinformatics
• Luscombe et al. define bioinformatics as a
union of biology and informatics:
– bioinformatics involves the technology that uses
computers for storage, retrieval, manipulation,
and distribution of information related to
biological macromolecules such as DNA, RNA, and
proteins.
Applications and Limitation
• Main application are knowledge-based drug
design, forensic DNA analysis, Personalized
medicine, Genome analysis and agricultural
biotechnology.
• Bioinformatics depends on experimental
science to produce raw data for analysis.
• It, in turn, provides useful interpretation of
experimental data and important leads for
further experimental research.
What is sequence alignment?

Alignment: Comparing two (pairwise) or


more (multiple) sequences. Searching for
a series of identical or similar characters
in the sequences.

MVNLTSDEKTAVLALWNKVDVEDCGGE
|| || ||||| ||| || || ||
MVHLTPEEKTAVNALWGKVNVDAVGGE
Why perform a pairwise sequence
alignment?

Finding homology between two


sequences
e.g., predicting characteristics of a protein

premised on:

similar sequence (or structure)

similar function
Local vs. Global
 Local alignment – finds regions of high
similarity in parts of the sequences (Smith
and Waterman, 1981)
ADLGAVFALCDRYFQ
|||| |||| |
ADLGRTQN CDRYYQ

 Global alignment – finds the best alignment


across the entire two sequences (Needleman
and Wunsch, 1970)
ADLGAVFALCDRYFQ
|||| |||| |
ADLGRTQN-CDRYYQ
Evolutionary changes in sequences

Three types of nucleotide


changes:
1. Substitution – a replacement ofAAGA  more)
one (or AACA
sequence characters by another:
2. Insertion - an insertion of
AAGoneA
T(or more)
sequence characters:
Deletion – a deletion of one (or more)
3.
AAGA
sequence characters:

Insertion + Deletion 
Indel
Choosing an alignment:
 Many different alignments between two
sequences are possible:

AAGCTGAATTCGAA
AGGCTCATTTCTGA

AAGCTGAATT-C-GAA .. A-AGCTGAATTC--GAA
AGGCT-CATTTCTGA- AG-GCTCA-TTTCTGA-
.

How do we determine which is the best alignment?


Toy exercise
Compute the scores of each of the
following alignments using this naïve
scoring scheme
Scoring A C G T
scheme:
 Match: +1 A
C
2-
2-
2-
2-
2-
1
1
2-
Substitution matrix G
 Mismatch: -2 2- 1 2- 2-
T 1 2- 2- 2-
 Indel: -1
Gap penalty (opening =
extending)

AAGCTGAATT-C-GAA A-AGCTGAATTC--GAA
AGGCT-CATTTCTGA- AG-GCTCA-TTTCTGA-
Substitution matrices: accounting for
biological context

Which best reflects the biological reality


regarding nucleotide mismatch
penalty?
Tr = Transition
Tv = Transversion

1. Tr > Tv > 0
2. Tv > Tr > 0
3. 0 > Tr > Tv
4. 0 > Tv > Tr
Scoring schemes: accounting for
biological context

Which best reflects the


biological reality
regarding these
mismatch penalties?

1. Arg->Lys > Ala->Phe


2. Arg->Lys > Thr->Asp
3. Asp->Val > Asp->Glu
PAM 250 and BLOSUM 62 Matrices

Algorithm used for scoring: Dynamic


BLAST and FASTA
 BLAST and FASTA are two similarity
searching programs that identify
homologous DNA sequences and
proteins based on the sequence
similarity.
 FASTA and BLAST are the software tools
used in bioinformatics.
 Both BLAST and FASTA use a heuristic
word method for fast pairwise sequence
alignment.
Multiple
Sequence
Alignment (MSA)
Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG
Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG
Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG
Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG
Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG--
Seq6 ATLVCLISDFYPGA--VTVAWKADS--
Seq7 AALGCLVKDYFPEP--VTVSWNSG---
Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--
Similar to pairwise alignment BUT n sequences
are aligned instead of just 2
Each row represents an individual sequence
Each column represents the ‘same’ position
Why perform an MSA?
MSAs are at the heart of comparative
genomics studies which seek to study
evolutionary histories, functional and
structural aspects of sequences, and to
understand phenotypic differences
between species
Multiple sequence alignment
Seq1 VTISCTGSSSNIGAG-NHVKWYQQLPG
Seq2 VTISCTGTSSNIGS--ITVNWYQQLPG
Seq3 LRLSCSSSGFIFSS--YAMYWVRQAPG
Seq4 LSLTCTVSGTSFDD--YYSTWVRQPPG
Seq5 PEVTCVVVDVSHEDPQVKFNWYVDG--
Seq6 ATLVCLISDFYPGA--VTVAWKADS--
Seq7 AALGCLVKDYFPEP--VTVSWNSG---
Seq8 VSLTCLVKGFYPSD--IAVEWWSNG--

variable conserve
d
Definition and scope
 Simultaneous alignment of several
sequences towards identifying “common
motifs/regions/patterns” in protein or DNA
sequences
 Scope: Detection of shared regions of
homology (i.e) homologous residues
among a set of sequences are aligned
together in a column
Types of MSA
 Global – MSA

Uniform in length, shared similarity althrough
 Local – MSA

For large numbers of sequences, varied
lengths and sub-sequence similarity
Structural-Evolutionary perspective in MSA

Criteria 1: Homologous residues occupy


similar position in 3D space (i.e) similar
the structure - similar the sequence.

Criteria 2: All got diverged from a common


ancestor (eg) members of gene family
Significance of sequence
relatedness
Based on, How related the sequences are ?
Case1; An alignment from a set of similar
sequences will generally be unambiguous
(perfect) - ?
Case 2; In a family of proteins sharing
perhaps only 30% average pairwise
relationships - ?
How MSA become relevant even in less similar
sequences ?

Core structural elements always tend to be


conserved and get meaningfully aligned

But other regions may not be perfectly


aligned b’cos of structural evolution and
sequence divergence
Applications
Characterization of gene families
Identifying conserved motifs in promoters
Phylogenetics
Used for building 3D models of protein
structures
Selection of templates
Prediction of conserved sec. structures and in turn
aiding threading based structure prediction
Alignment methods

 Progressive/hierarchical alignment
(ClustalX)
 Iterative alignment (MAFFT, MUSCLE)
Phylogenetic Trees
MOLECULAR EVOLUTION AND
MOLECULAR PHYLOGENETICS
• Evolution can be defined as the development of a biological
form from other preexisting forms or its origin to the current
existing form through natural selections and modifications.

• Underlying mechanism of evolution is genetic mutations that


occur spontaneously.

• Phylogenetics is the study of the evolutionary history of living


organisms using tree like diagrams to represent pedigrees of
these organisms.

• The tree branching patterns representing the evolutionary


divergence are referred to as phylogeny.
• Molecular data that are in the form of DNA or protein sequences can also
provide very useful evolutionary perspectives of existing organisms because, as
organisms evolve, the genetic materials accumulate mutations over time
causing phenotypic changes.

• Because genes are the medium for recording the accumulated mutations,
they can serve as molecular fossils.
• The field of molecular phylogenetics can be
defined as the study of evolutionary relationships
of genes and other biological macromolecules by
analyzing mutations at various positions in their
sequences and developing hypotheses about the
evolutionary relatedness of the biomolecules.
• Based on the sequence similarity of the
molecules, evolutionary relationships between
the organisms can often be inferred.
• The lines in the tree are called branches.
• At the tips of the branches are present-day species
or sequences known as taxa (the singular form is
taxon) or operational taxonomic units.
• The connecting point where two adjacent branches
join is called a node, which represents an inferred
ancestor of extant taxa.
• The bifurcating point at the very bottom of the tree
is the root node, which represents the common
ancestor of all members of the tree.
• A group of taxa descended from a single common ancestor is
defined as a clade or monophyletic group.
• In a monophyletic group, two taxa share a unique common
ancestor not shared by any other taxa.
• They are also referred to as sister taxa to each other (e.g., taxa B
and C).
• The branch path depicting an ancestor–descendant relationship
on a tree is called a lineage, which is often synonymous with a
tree branch leading to a defined monophyletic group.
• paraphyletic (e.g., taxa B, C, and D) When a number of taxa share
more than one closest common ancestors, they do not fit the
definition of a clade.
• An unrooted phylogenetic tree does not
assume knowledge of a common ancestor, but
only positions the taxa to show their relative
relationships.
• In a rooted tree, all the sequences under study
have a common ancestor or root node from
which a unique evolutionary path leads to all
other nodes. (more informative)
FORMS OF TREE REPRESENTATION
Phylogenetic trees drawn as cladograms
(top) and phylograms (bottom).

The branch lengths are unscaled in the


cladograms and scaled in the phylograms.
The trees can be drawn as angled
form (left) or squared form (right).
Newick format of tree representation that employs a linear form of nested
parentheses within which taxa are separated by commas.

If the tree is scaled, branch lengths are indicated immediately after the taxon
name.

The numbers are relative units that represent divergent times.


• True tree: is the tree which represents the actual
evolutionary path by which the current array of organisms
was created.
• Inferred tree: is the tree inferred by the evolutionary
analysis program used.
• Species tree: is a phylogenetic tree representing the
evolution of a group of species.
• When a tree is constructed from one gene from each species,
the inferred tree is sometimes called a Gene tree.
Phylogeny
• Species Tree and Gene Tree:
E gene tree for Na+-K+ ion
pump membrane protein
family members

Evolutionary relationship
between seven eukaryotes
Phylogenetic Tree
branch internal node

Leaf or
terminal
node

40
THE IMPORTANCE OF IDENTIFYING
PARALOGS AND ORTHOLOGS
• Studies of protein and gene evolution involve
the comparison of homologs—sequences that
have common origins but may or may not
have common activity
• Homologs are most commonly either
orthologs, paralogs, or xenologs.
Orthologs
• Orthologs are homologs produced by
speciation. They represent genes derived from
a common ancestor that diverged due to
divergence of the organisms they are
associated with. They tend to have similar
function
Paralogs
• Paralogs are homologs produced by gene
duplication. They represent genes derived
from a common ancestral gene that
duplicated within an organism and then
subsequently diverged. They tend to have
different functions
Xenologs
• Xenologs are homologs resulting from
horizontal gene transfer between two
organisms. The determination of whether a
gene of interest was recently transferred into
the current host by horizontal gene transfer is
often difficult the function tends to be similar.
Maximum Likelihood –
Quartet
Phylogenetic Tree Character-Based Puzzling, NJML, Genetic
Construction Methods and Methods Algorithm, Bayesian
Programs Analysis

Maximum Parsimony –
Distance-Based
Weighted Parsimony, Tree-
Methods
Searching Methods

Clustering based
Optimality- methods
Based Methods

Phylogenetic Tree
UPGMA , Neighbor Evaluation Methods
Fitch–Margoliash , Joining, Generalized
Minimum Evolution Neighbor Joining

Bootstrapping,
Jackknifing, Bayesian
SOFTWARE: PHYLIP, PAUP, MEGA Simulation
Levels of protein Structure

α - helix, coils, and β - sheet


Primary Structure of Proteins
The Basic Block: Amino Acid
Sidechain
R
H
O-
+ Ca
8.9 < pKa < 10.8 N C 1.7 < pKa <2.6
H
H O
H

Amino group Carboxyl group

“zwitterion”
•The 20 standard amino acids used in
proteins, grouped based on the
properties of their side chains.

•Twenty kinds of side chains varying


in size, shape, charge, hydrogen-
bonding capacity, hydrophobic
character, and chemical reactivity are
commonly found in proteins.

•The simplest one is glycine, which


has just a hydrogen atom as its side
chain.

•With two hydrogen atoms bonded to


the a-carbon atom, glycine is unique
in being achiral. Alanine, the next
simplest amino acid, has a methyl
group (-CH3) as its side chain
Given an unknown protein, make an informed
guess on its 3D structure based on its sequence:
• Search structure databases for homologous
sequences
• Transfer coordinates of known protein onto unknown

MQEQLTDFSKVETNLISW-
QGSLETVEQMEPWAGSDANSQTEAY
| |..|. ||| ... |..||.|.| | |||..|
MHQQVSDYAKVEHQWLYRVAGTIETLDNMSPANHSDAQTQA
A
| = Identity
. | = Homology
Fold Prediction

• Used for predicting structure from the sequences


with low sequence simularity
• Remote homology prediction compare a protein
with a collection of related proteins using methods
– Position - Specific Iterative - BLAST (PSI – BLAST)
– protein family profiles
– hidden Markov models (HMMs)
– Sequence Alignment and Modeling System (SAM)
– Support Vector Machines
Applications
• Rational design of proteins with increased
stability or novel functions
• Analysis of protein function, interactions,
antigenic behavior
• Structure-based drug design
• Structural mutagenesis etc
Ab Initio
• First principle - based approaches
• Start with an unfolded conformation, usually
surrounded by solvent, and allow simulated
physical forces to fold the protein as would
normally happen in vivo.
• Methods involving highly optimized
approximate energy and statistics - based
potential function.

You might also like