Genome Annotation
Dong Xu
Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University of Missouri, Columbia http://digbio.missouri.edu
Lecture Outline
Introduction Manual
Curation Annotation
Automatic
Conclusions
What is Genome Annotation?
Annotation is the process of interpreting raw sequence data into useful biological information by integrating computational analyses, other biological data and biological expertise It involves characterizing genomic features using computational and experimental methods Features could be repeats, genes, promoters, protein domains..
What is Genome Annotation?
Questions:
What genes does this genome contain? What proteins do they encode? How are they regulated? In what interactions or pathways do the proteins participate?
Types of Genome Annotation
Structural annotation
Location of protein-coding genes Location of regions of homology with other genomes cDNA sequences protein sequences Location and type of transcription regulatory elements
Functional annotation
Molecular function of encoded proteins Membership in metabolic and regulatory networks
Aim: To get from here
to here,
What are genes? - 1
Complete DNA segments responsible to make functional products Products
Proteins Functional RNA molecules
RNAi (interfering RNA) rRNA (ribosomal RNA) snRNA (small nuclear) snoRNA (small nucleolar) tRNA (transfer RNA) Non-coding RNA
What are genes? - 2
Definition vs. dynamic concept Consider
Prokaryotic vs. eukaryotic gene models Introns/exons Posttranscriptional modifications Alternative splicing Differential expression Posttranslational modifications Multi-subunit proteins
Prokaryotic Gene Structure
5 3
Coding region of Open Reading Frame Promoter region (maybe) Ribosome binding site (maybe) Termination sequence (maybe)
Start codon / Stop Codon
Open reading frame (ORF): a segment of DNA with two in-frame stop codons at the two ends and no in-frame stop codon in the middle
Prokaryotic gene model: ORF-genes
Small genomes, high gene density
Haemophilus influenza genome 85% genic
Operons
One transcript, many genes
No introns
One gene, one protein
Open reading frames
One ORF per gene ORFs begin with start, end with stop codon (def.)
TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
Eukaryotic gene model: spliced genes
Eukaryotic Gene Structure
Genetic Code
Reading Frame
Reading (or translation) frame: each DNA segment has six possible reading frames
Forward strand: Reading frame #1 ATG GCT TAC GCT TGC Reverse strand: Reading frame #4 TCA AGC GTA AGC CAT
ATGGCTTACGCTTGA
Reading frame #2 TGG CTT ACG CTT GA. Reading frame #3 GGC TTA CGC TTG A..
TCAAGCGTAAGCCAT
Reading frame #5 CAA GCG TAA GCC AT. Reading frame #6 AAG CGT AAG CCA T..
Lecture Outline
Introduction Manual
Curation Annotation or Pitfalls
Automatic
Challenges
Conclusions
Manual Curation
Annotation for all genes in the genome has been manually reviewed by a curator and should be regarded as accurate as possible given the data available at the time of curation. Annotation data assigned has been based on all evidence available to the curator. In addition, gene models are curated to remove overlapping genes, resolve frameshifted genes, and determine the initiation codon of each gene.
Manual Curation
Task involves identifying Genes
Known Novel Novel transcript Putative
Pseudogene
Annotation nomenclature
Known Gene Predicted gene matches the entire length of a known gene. Putative Gene Predicted gene contains region conserved with known gene. Also referred to as like or similar to. Unknown Gene Predicted gene matches a gene or EST of which the function is not known. Hypothetical Gene Predicted gene that does not contain significant similarity to any known gene or EST.
Things curators are looking to annotate?
CDS mRNA Alternative Promoter
RNA splicing
and Poly-A Signal
Pseudogenes ncRNA
Pseudogenes
Could be as high as 20-30% of all Genomic sequence predictions could be pseudogene Non-functional copy of a gene
Processed pseudogene
Retro-transposon derived No 5 promoters No introns Often includes polyA tail Gene duplication derived
Non-processed pseudogene
Both include events that make the gene non-functional
Frameshift Stop codons
We assume pseudogenes have no function, but we really dont know!
Example Pseudogene
LOCUS DEFINITION NG_005487 1850 bp DNA linear ROD 14-FEB-2006 Mus musculus ubiquitin-conjugating enzyme E2 variant 2 pseudogene (LOC625221) on chromosome 6. ACCESSION NG_005487 VERSION NG_005487.1 GI:87239965 KEYWORDS . SOURCE Mus musculus (house mouse) ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muroidea; Muridae; Murinae; Mus. REFERENCE 1 (bases 1 to 1850) AUTHORS Wilson,R. TITLE Mus musculus BAC clone RP24-201D17 from 6 JOURNAL Unpublished (2003) COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from AC121925.2. FEATURES Location/Qualifiers source 1..1850 /organism="Mus musculus" /mol_type="genomic DNA" /db_xref="taxon:10090" /chromosome="6" /note="AC121925.2 32277..34126" gene 101..1750 /gene="LOC625221" /pseudo /db_xref="GeneID:625221" repeat_region 1792..1827 /rpt_family="ID" ORIGIN 1 tcttctgcct caattcctca agtgctagta tcatatgccc atgccattat ttttaactcc 61 cctttttcat gctaagaatt gaacacacgg ccctgcgtgc ggtggtgcgt ctggtagcag 121 gagaagatgg cggtctccac aggagttaaa gttcctcgta attttcgctt gttggaagaa
Noncoding RNA (ncRNA)
ncRNA represent 98% of all transcripts in a mammalian cell ncRNA have not been taken into account in gene counts
cDNA ORF computational prediction Comparative genomics looking at ORF
ncRNA can be:
Structural Catalytic Regulatory
Example ncRNA From NW_632744.1
gene complement(55100..55691) /locus_tag="CR40465" /note="synonym: CR_tc_AT13310" /db_xref="GeneID:3354945" complement(55100..55691) /locus_tag="CR40465" /note="This annotation is identical to the ncRNA CR_tc_AT13310 annotation, also mapped identically to 2L [20224138,20223553] last curated on Thu Jan 15 13:37:02 PST 2004" /db_xref="FlyBase:FBgn0058465" /db_xref="GeneID:3354945"
misc_RNA
Noncoding RNA (ncRNA)
tRNA transfer RNA: involved in translation rRNA ribosomal RNA: structural component of ribosome, where translation takes place snoRNA small nucleolar RNA: functional/catalytic in RNA maturation Antisense RNA: gene regulation / silencing?
Repetitive Sequence
Definition
DNA sequences that made up of copies of the
same or nearly the same nucleotide sequence
Present in many copies per chromosome set
Repeat Filtering
RepeatMasker
Uses precompiled representative sequence
libraries to find homologous copies of known repeat families
Use Blast http://www.repeatmasker.org/
Lecture Outline
Introduction Manual
Curation Annotation
Automatic
Conclusions
Automatic Annotation
Automated annotation describes annotation which has been generated by an computational algorithm without being further curated.
EnsEMBL
Who does automatic annotation?
NCBI
UCSC
Gene-Finding Strategies
Genomic Sequence
Content-Based Bulk properties of sequence: Open reading frames Codon usage Repeat periodicity Compositional complexity
Site-Based Absolute properties of sequence: Consensus sequences Donor and acceptor splice sites Transcription factor binding sites Polyadenylation signals Right ATG start Stop codons out-of-context
Comparative Inferences based on sequence homology: Protein sequence with similarity to translated product of query Modular structure of proteins usually precludes finding complete gene
Gene-Finding Strategies
Homology-based gene prediction
Similarity Searches (e.g. BLAST) Genome Browsers RNA evidence (ESTs)
Ab initio gene prediction
Gene prediction programs Prokaryotes
ORF identification Eukaryotes Promoter prediction PolyA-signal prediction Splice site, start/stop-codon predictions
Homology based approaches
Idea is new species are not produced from scratch, they are evolutionary related to extant species Search by local alignment programs EST/cDNA to genome : BlastN, FASTA Protein to genome : TBlastN
Gene prediction through comparative genomics
Highly similar (Conserved) regions between two genomes are useful or else they would have diverged If genomes are too closely related all regions are similar, not just genes If genomes are too far apart, analogous regions may be too dissimilar to be found
Genome Browsers
Generic Genome Browser (CSHL) www.wormbase.org/db/seq/gbrowse
NCBI Map Viewer
Ensembl Genome Browser www.ensembl.org/
www.ncbi.nlm.nih.gov/mapview/
UCSC Genome Browser
Apollo Genome Browser www.bdgp.org/annot/apollo/
genome.ucsc.edu/cgi-bin/hgGateway?org=human
Gene discovery using ESTs
Expressed Sequence Tags (ESTs) represent sequences from expressed genes. If region matches EST with high stringency then region is probably a gene or pseudo gene.
EST overlapping exon boundary gives
an accurate prediction of exon boundary.
Ab initio approach -1
Rely on Identification of specific signals : start codon, stop codon, ribosomal binding site The Shine-Dalgarno Sequence (AGGAGG)
is the signal for initiation of protein biosynthesis in bacterial mRNA. It is located 5' of the first coding AUG, and consists primarily, but not exclusively, of purines. It is the ribosomal binding site.
Ab initio approach -2
Rely on
Differences in nucleotide-motif composition between protein coding and non-coding sequences Correct reading frame of a gene and other
reading frames
Coding Signal Detection
Frequency distribution of dimers in protein sequence (shewanella)
The average frequency is 5%
Some amino acids prefer to be next to each other
Some other amino acids prefer to be not next to each other
Ab initio approach -3
Compositional
differences
Nucleotides in coding and non-coding regions
evolve under different constraints
First and second codon position are
constrained by the encoded amino acid
Third codon position is subject to mutational
and translation efficiency constraints
Nucleotide in non-coding regions can evolve
independently
Ab initio gene prediction
Prokaryotes
ORF-Detectors
Eukaryotes
Position, extent & direction: through promoter
and polyA-signal predictors
Structure: through splice site predictors Exact location of coding sequences: through
determination of relationships between potential start codons, splice sites, ORFs, and stop codons
Gene prediction programs
Rule-based programs
Use explicit set of rules to make decisions. Example: GeneFinder
Neural Network-based programs
Use data set to build rules. Examples: Grail, GrailEXP
Hidden Markov Model-based programs
Use probabilities of states and transitions between
these states to predict features. Examples: Genscan, GenomeScan
Tools for Annotation
EnsEMBL Sequin
(EBI)
(SFU)
(NCBI) (CSHL) (UBiC)
PseudoCAP GMOD
Pegasys Apollo
(EBI/Berkeley)
Tools for Annotation
ORF detectors
NCBI: http://www.ncbi.nih.gov/gorf/gorf.html
Promoter predictors
CSHL: http://rulai.cshl.org/software/index1.htm BDGP: fruitfly.org/seq_tools/promoter.html ICG: TATA-Box predictor
PolyA signal predictors
CSHL: argon.cshl.org/tabaska/polyadq_form.html
Splice site predictors
BDGP: http://www.fruitfly.org/seq_tools/splice.html
Start-/stop-codon identifiers
DNALC: Translator/ORF-Finder BCM: Searchlauncher
Example : Ensembl Automatic Annotation Process -1
Ensembl
Manual annotation Core database
Ensembl pipeline & gene build
Data Mining System
EnsMART
Web site
Ensembl Automatic Annotation Process -2
Raw Compute
Sequence data arrives in contigs Repeat masking Ab initio predictions (Genscan) Blast the predictions against: swall, vertebrate RNA, unigene ePCR places markers on the sequence Assembly information is used to position contigs on a golden path EnsEMBL core
Ensembl Automatic Annotation Process -3
human proteins Pmatch Other proteins cDNAs ESTs Exonerate
GeneBuild
GeneWise
Est2Genome
Add UTRs Genscan exons Merge
Genes
EST-genes
Ensembl Automatic Annotation Process -4
Protein Sequences Genewise
Aligned to the Genome
Blast and MiniSeq Genewise
Ensembl Automatic Annotation Process -5
Map cDNAs and ESTs using Exonerate
(determine coverage, % identity and location in genome)
Store hits and filter on percentage identity and length coverage
ESTs and cDNA
blast sequence and create a miniseq
Run est2genome on miniseq
(determine strand, splicing)
Map transcripts back into genome-assembly
Ensembl Automatic Annotation Process -6
Miniseq - the need for speed
Blast proteins Hits ~ 100kb
Convert to miniseq
Minigenomic: 1kb on either side run Genewise
Map back to genomic
Spliced alignment
NCBI GenBank Features
-10_signal -35_signal 3'clip 3'UTR 5'clip 5'UTR attenuator CAAT_signal CDS conflict C_region D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_binding misc_difference misc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_region old_sequence polyA_signal polyA_site precursor_RNA primer_bind prim_transcript promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA satellite scRNA sig_peptide snoRNA snRNA S_region stem_loop STS TATA_signal terminator transit_peptide tRNA unsure variation V_region V_segment
Prokaryotic Projects List
Assigning function to ORF
in order to assign function, all predicted ORFs are translated to amino acid sequence and analysed by homology searches against sequence databases (usually Genbank)
for each ORF there are three possible results i) clear sequence homology indicating function ii) blocks of homology to defined functional motifs
- these should be confirmed experimentally
iii) no significant homology or homology to proteins of unknown function
Lecture Outline
Introduction Manual
Curation Annotation
Automatic
Conclusions
Challenges or Pitfalls
First and last exons difficult to annotate because they contain UTRs.
Smaller genes are not statistically significant so they are thrown out. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known.
Conclusions
Trust but verify Beware of gene prediction tools! Always use more than one gene prediction tool and more than one genome when possible. Active area of bioinformatics research, so be mindful of the new literature in this .
Readings
http://www.genome.org/cgi/content/full/15/12/1 777
Play with ORF Finder http://www.ncbi.nlm.nih.gov/gorf/gorf.html Study Microbial Genomes Resources http://www.ncbi.nlm.nih.gov/genomes/MICRO BES/microbial_taxtree.html
Seminar
Acknowledgments
This file is for the educational purpose only. Some materials (including pictures and text) were taken from the Internet at the public domain.