Introduction to Bioinformatics
Stephen Taylor Stephen Taylor
Computational Biology Research Group
Background
Definition
Bioinformatics is the computational analysis and storage of biological Bioinformatics is the computational analysis and storage of biological data data
Derivation
bio biology bio biology informatique French for data processing informatique French for data processing
Goal
To discover new biological insights using computers and biology To discover new biological insights using computers and biology
Computational Biology Research Group
Other related disciplines
Computational Biology
broader term, more an approach as generic as biology itself broader term, more an approach as generic as biology itself
Chemoinformatics
study and analysis of chemical information study and analysis of chemical information
Medical Informatics
study, invention and implementation of structures and algorithms to study, invention and implementation of structures and algorithms to improve communication, understanding and management of medical improve communication, understanding and management of medical information information
Mathematical Biology
more theoretical. Things which are not necessarily algorithmic, not more theoretical. Things which are not necessarily algorithmic, not necessarily molecular in nature, and are not necessarily useful in necessarily molecular in nature, and are not necessarily useful in analyzing collected data! analyzing collected data!
Computational Biology Research Group
What is bioinformatics?
Experiment Data Analysis Sequence Structure Function Evolution Pathway Interaction Mutation Expression Result Hypothesis
Computational Biology Research Group
Why use bioinformatics
Find an answer quickly
Most in silico biology is faster than in vitro Most in silico biology is faster than in vitro
Massive amounts of data to analyse
Need to make use of all information Need to make use of all information Not possible to do analysis by hand Not possible to do analysis by hand Cant organise and store information only using lab note books Cant organise and store information only using lab note books Automation is key Automation is key
However!
All results of computer analysis should to be verified by biologists All results of computer analysis should to be verified by biologists
Computational Biology Research Group
Bioinformatics databases
Public databases are the most important entity in bioinformatics Store knowledge about
Sequence e.g. EMBL Sequence e.g. EMBL Structure e.g. PDB Structure e.g. PDB Pathways e.g. KEGG Pathways e.g. KEGG Interactions e.g. DIP Interactions e.g. DIP Diseases e.g. OMIM Diseases e.g. OMIM And many others And many others
Can be searched in a variety of ways e.g. keyword, pattern, sequence
Computational Biology Research Group
Bioinformatics Tools
Hundreds of computer programs Many freely available Generally available on UNIX or LINUX Often interact with bioinformatics databases Many accessible via the WWW Some require very powerful computers to run on Computational Biology Research Group provide a environment to do this
Computational Biology Research Group
The Human Genome Project
Could not have been achieved without bioinformatics Goals
Determine the complete sequence of the 3 billion DNA subunits Determine the complete sequence of the 3 billion DNA subunits Discover all the human genes and make them accessible for further Discover all the human genes and make them accessible for further biological study biological study
Need to bring together and store vast amounts of information from
Lab equipment and experiments Lab equipment and experiments Computer Analysis Computer Analysis Human Analysis Human Analysis Make visible to the worlds scientists Make visible to the worlds scientists
Computational Biology Research Group
Central Dogma of Molecular Biology
(See http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml)
Computational Biology Research Group
Overview HGP bioinformatics
Assemble
Analyse
Annotate
Display
Computational Biology Research Group
Assembly
Human genome is theoretically several long strings totalling 3 billion base pairs
Assembled via hundreds of thousands of overlapping units or contigs to Assembled via hundreds of thousands of overlapping units or contigs to make a single consensus sequence make a single consensus sequence Sequences collated using information stored on ABI sequencer Sequences collated using information stored on ABI sequencer Sequence assembly bioinformatics tools used to Sequence assembly bioinformatics tools used to Automatically assemble fragments Automatically assemble fragments Hand finish using computer tools Hand finish using computer tools Requires constant reassembly and rebuilds as new data comes in Requires constant reassembly and rebuilds as new data comes in E.g. PHRED/PHRAP and Staden E.g. PHRED/PHRAP and Staden
Computational Biology Research Group
Analyse
Take the assembled string of nucleotides AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACG TCACGACGTAGATGCTAGCTGACTCGATGCAGACTGCTA GCTGCCAGCGACTCAGCTACGACTAGCATCGGCGCTAG CATCGGCAGC
Computational Biology Research Group
Find genes
Train algorithm to look for features e.g. Train algorithm to look for features e.g. Splice sites Splice sites Start // Stop codons Start Stop codons Codon frequency Codon frequency Promoters Promoters
Use existing biological information e.g. ESTs, cDNA Use existing biological information e.g. ESTs, cDNA Build a model of gene structure Build a model of gene structure
Computational Biology Research Group
Find translated protein(s)
Translate DNA to theoretical protein 5 3
>Unknown Sequence VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF DLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLR VDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY R
Computational Biology Research Group
Find Function
Major challenge in bioinformatics
Search the protein sequence vs database of proteins of known function* Search the protein sequence vs database of proteins of known function* Protein domains are evolutionarily conserved Protein domains are evolutionarily conserved Proteins that are similar in sequence across several species are likely to Proteins that are similar in sequence across several species are likely to have a similar function have a similar function BLAST :: BLAST A query sequence A query sequence Sequence database (protein or nucleotide) Sequence database (protein or nucleotide) Inspection of significant hits Inspection of significant hits There are many other methods used to imply function! There are many other methods used to imply function!
* Many of the databases contain errors
Computational Biology Research Group
Example
Query Sequence database Human, Unknown
Uniprot
Chimp, Myoglobin Search Results Pig, Myoglobin Mouse, Myoglobin
Putative function = Human, Myoglobin
Computational Biology Research Group
Annotate
Results of raw gene analysis are FEATURES Integration of features, biological rules and knowledge make ANNOTATIONS Write these back to the database Automated what would taken hundreds of scientists to do
Computational Biology Research Group
Ensembl
Ensembl Genome Browser (www.ensembl.org)
Computational Biology Research Group
UCSC Genome Browser http://genome.ucsc.edu/
Computational Biology Research Group
And this is just the start
Bioinformatics is now in the post - genomics era
Transcriptomics Transcriptomics Epigenomics Epigenomics Proteomics Proteomics Comparative genomics Comparative genomics Metabolic pathways Metabolic pathways Regulatory networks Regulatory networks Virtual Cells Virtual Cells Modelling Systems and Organs Modelling Systems and Organs
Computational Biology Research Group
Microarrays
Used in large scale functional Used in large scale functional studies studies Looking for patterns of gene Looking for patterns of gene expression e.g. expression e.g.
Disease vs Normal Disease vs Normal Over time Over time Normalise images Normalise images Measuring and adjust for variability Measuring and adjust for variability Analysis of differentially expressed Analysis of differentially expressed genes genes Storing data Much more Storing data Much more complicated data than sequences complicated data than sequences
Bioinformatics used to Bioinformatics used to
Transcriptome Transcriptome
http://www.ebi.ac.uk/microarray/biology_intro.html#Microarrays
Computational Biology Research Group
Proteomics
Sample
Fractionation Protein Annotation / Bioinformatics
Proteome
Mass spectrometry
Computational Biology Research Group
Systems Biology
How everything fits together by taking a holistic view of a biological system
DNA DNA RNA RNA Proteins Proteins Protein Interactions Protein Interactions Networks Networks Cells Cells Organs Organs
Will require a huge amount of data and corresponding computational Infrastructure
Computational Biology Research Group
Systems biology examples
Dennis Bray (Cambridge) Dennis Bray (Cambridge) http://www.pdn.cam.ac.uk/groups/comp-cell/ http://www.pdn.cam.ac.uk/groups/comp-cell/
Modelling chemotaxis in bacteria Modelling chemotaxis in bacteria Modelling the Heart--from Genes to Cells to the Whole Organ Modelling the Heart--from Genes to Cells to the Whole Organ
Denis Noble (Oxford) - Science, Vol 295, Issue 5560, 1678-1682 Denis Noble (Oxford) - Science, Vol 295, Issue 5560, 1678-1682
http://www.physiome.org/ http://www.physiome.org/
Computational Biology Research Group
List of useful websites
WWW has driven a lot of bioinformatics research Low powered computers access very high powered computers Important to use all resources available to do research Hundreds of sites available
Computational Biology Research Group
EBI - http://www.ebi.ac.uk/
Computational Biology Research Group
NCBI - http://www.ncbi.nlm.nih.gov/
Computational Biology Research Group
BioMart - http://www.biomart.org/
Computational Biology Research Group
TIGR - http://www.tigr.org/
Computational Biology Research Group
ExPASy - http://www.expasy.org/
Computational Biology Research Group
PDB - http://www.rcsb.org/pdb/
Computational Biology Research Group
Bioperl - http://bioperl.org/
Computational Biology Research Group
CBRG - http://www.cbrg.ox.ac.uk
[email protected]
Computational Biology Research Group