Who needs to study Bioinformatics?
•What is Bioinformatics?
“Bioinformatics is about searching biological databases, comparing
sequences, looking at protein structures, and (more generally) asking
biological questions with a computer”
•Introduced by French scientist Jean-Michel Claverie in late 80s
(“bioinformatique”)
•Saves you months of work!
Before the era of Bioinformatics
• Only two ways to perform
experiments,
1. In vivo
2. In vitro
• We are now in the age of In
Silico biology!
Bioinformatics is a must do!
Bioinformatics in context
Mathematics/
Genomics computer
science
Molecular
biology Bioinformatics Biophysics
Ethical, legal, and
social implications Molecular
evolution
What does this mean?
• Think of Bioinformatics as a tool!
• Now you are equipped with computational tools to answer biological
questions
The biological foundations of Bioinformatics
• Proteins and Nucleic acids
• Proteins are made up of amino acids
while nucleic acids are made up of
nucleotides
• How best to represent proteins and
nucleic acids?
• Need a formula to describe their
composition
• The identity of the protein is determined
from the composition and the precise
order of amino acids it contains
The Birth of Bioinformatics
• Protein sequences started to accumulate in 1960s
• People started manual comparisons (pre-computer era)
• With the advent of computers, people started to write algorithms from
scratch to analyze “sequence data”
• This was the genesis of bioinformatics
“The holy grail of Bioinformatics”
GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGA
TCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAG > 500, 000 genes
TTAACCTAA... sequenced
Expected number of unique
protein structures:
~ 700-1,000
The core of Bioinformatics to date
•Relationships between
TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
sequence
DEPSEKDALQPGRNLVAA
GYALYGSATMLV
Sequence 3D structure protein functions
•Properties and evolution of genes, genomes, proteins, metabolic
pathways in cells
•Use of this knowledge for prediction, modelling, and design
From sequence to structure
• Proteins adapt a three-
dimensional (3D) structure,
which is functionally important
• Structure is determined by the
composition and order of amino 1. Hydrophobic amino acids (e.g., Valine, Leucine) do not
want to be on the surface
acids in that protein 2. Hydrophilic love to be on the surface to interact with
water (e.g., Serine)
3. Also affected by the electric charge on some residues
and their size
In Short!
• Proteins have a unique order and composition of amino acids, simply
referred to as the ‘sequence’
• Sequence determines the 3D shape of the protein, simply referred to as
the ‘structure’
• Structure determines the molecular activities of proteins, simply referred
to as the ’function’
• Sequence -> Structure -> Function (but not always!)
What about DNA & RNA?
• DNA & RNA are made up of
nucleotide chains
• Nucleotides consist of carbohydrates,
phosphate, and one out of five
nitrogen bases
• Adenine, Guanine, Cytosine,
Thymine, and Uracil or simply A, T,
G, and C
What should be cheaper and faster? DNA/RNA
or protein sequencing?
DNA/RNA sequencing is faster and cheaper simply
because of fewer characters, four nucleotides vs. twenty
What do we mean by complementarity?
T is always facing A, while G is always facing C in one-
to-one reciprocal relationship
How can this knowledge help us?
If we know the sequence of one strand, we can get the
sequence of the other strand
Example
• 5’-ATGCTGA-3’
• What is the complimentary sequence?
• 5’-ATGCTGA-3’
• 3’-TACGACT-5’
• How is this reported?
• 5’-ATGCTGA-3’ and 5’-TCAGCAT-3’
What is a Database?
A database is an organized collection of related information
What are the advantages of using databases?
• Easy and quick retrieval of information
• Provide backup support
Biological Databases
•Need to collect and store biological data and its associated knowledge
into databases
•Fundamental to the survival of science
Two kinds of Biological Databases
1. Primary
• Contain primary sequence information (nucleotide or protein) and associated
annotations
1. Secondary
• Summarize the results from primary databases
Primary Databases
• Nucleotide sequence databases
• Protein sequence databases
Nucleotide Sequence Databases
• Genbank
• Perhaps the best known database
• Contains all publically available annotated DNA sequences
• Exchanges data daily with the DNA Data Bank of Japan (DDBJ) and European
Molecular Biology Laboratory (EMBL)
• Contains roughly 179 million sequence entries (Dec 2014)
• Prior submission of sequence into Genbank/DDBJ/EMBL is a prerequisite for
publishing new sequence in any scientific journal
• Submission is easy and can be done electronically
• Each entry has a unique id known as the “Accession Number (AN)”
Accession number
• A unique identifier of each record in the database
• Usually alpha-numeric in nature
Why do we need accession numbers?
• Common names lead to non-specific results
• A search on “Cytochrome” will output many different types of cytochromes (a,
b, c, and others)
• Cannot distinguish among species
• Search on “Insulin” will return insulin sequences from many organisms
Example Genbank Entry
Secondary Databases
PROSITE
• Sometimes a newly sequenced protein gives no hits to sequence
databases
• How do we determine its function then?
“In some cases, the structure and function of an unknown protein which is
too distantly related to any protein of known structure to detect its affinity
by overall sequence alignment may be identified by its possession of a
particular cluster of residues types classified as a motifs. The motifs, or
templates, or fingerprints, arise because of particular requirements of
binding sites that impose very tight constraint on the evolution of portions
of a protein sequence” - A. M. Lesk, 1988