Bioinformatics 1 -- lecture 22
Gene nding in eukaryotes
intron/exon boundaries splicing alternative splicing
Finding genes in prokaryotes is easy.
Just translate the DNA sequence in all 6 reading frames. The ORFs (regions starting with ATG and ending in an in-frame stop codon) will be at least 300 bases in length, while random reading frames will be dotted with stop codons at the rate of about 3 stop codons every 64 codons.
XXXXXXXXATG......(3N).....TGAXXXXX
Finding genes in eukaryotes is harder.
Genes are composed of coding regions (exons) and internal non-coding regions (introns). Genes are transcribed to pre-mRNA. Introns are removed from pre-mRNA by the spliceosome (a ribozyme) Proteins are translated from the mRNA after splicing. Different tissues may splice pre-mRNA differently! dna
GU AG
pre-mRNA
mRNA
pre-mRNA structure
prokaryotic mRNA
+polyA tail
Introns-early? Introns-late?
did the common ancester have introns?
eubacteria
archea
eukaryotes have introns
dont have introns
A generic gene sequence model for pre-mRNA
pre-gene region AUG exon 3 splice site (...AG) 5 splice site (GU...) post-gene region stop (UAA|UAG|UGA)
intron (0|1|2) exon intron exon intron ... XXXXXXATG...XXXGUX...XXAGXX...XXGUX...XXAG XX...XXTAAXXX
Splicing mechanism, spliceosome
dna pre-mRNA Spliceosome: Def: A ribonucleoprotein complex, containing RNA and small nuclear ribonucleoproteins (snRNPs) that is assembled during the splicing of messenger RNA primary transcript to excise an intron.
exon1
GU
AG
exon2
5 splice site the donor
branchpoint
3 splice site the acceptor
Splicing mechanism
(1)
pre-mRNA
exon1
GU
AG
exon2
Spliceosome (not shown) forms. (2)
lariat loop forms
A exon1 GU
AG
exon2
(3) exon 1 cleaved from
lariat.
exon1
3-OH
A GU
AG
exon2
Splicing mechanism
the lariat (4)
exon 1,2 positioned to ligate A GU exon1
3-OH
AG
exon2
liberated lariat + ligated exons
(5) gets degraded
A GU AG exon1 exon2
goes to ribosome
Spliceosome (not shown) disassociates.
Splicing mechanism
http://neuromuscular.wustl.edu/pathol/diagrams/splicefunct.html
google: wustl neuromuscular splicefunct
http://neuromuscular.wustl.edu/pathol/diagrams/splicemech.html
google: wustl neuromuscular splicemech
Much thanks to T. Wilson, UCSC!
RNA binding proteins may selectively block splicing in some tissues.
For example:
an RNA binding protein is expressed in response to a stimulus. GU A AG exon2 it binds near the branchpoint (or one of the splicepoints) exon2
exon1
A exon1 GU
AG
it blocks, in this case, the cyclizing step.
GU..AG
Spliceosome cuts before GU and after AG. This is a constraint.
Frame of intron
Frame 0: intron starts at codon boundary AGU CUU AUC UUU UCA GUU GGG CCG UAG AAC CAC UCG UAA Frame 1: intron starts one after codon boundary AGU CUU AUC UUU UCA UGU GGG CCG UAA GAC CAC UCG UAA
... ... Frame 2: intron starts two after codon boundary AGU CUU AUC UUU UCA GGG UGG ... CCG UAG AGC CAC UCG UAA
This must be multiple of 3 if the intron is alternatively spliced.
How to nd splice points, using the protein sequence database.
(1) Translate the DNA in all 6 frames. (2) Search the database of protein sequences using the translations. (3) Using the complete protein sequence, align it to the translation and nd the regions of (near) perfect identity. These will abruptly end at the intron start site. (4) Find the 5-GT or 3-AG signal at the point where the identity matches abruptly end. (5) If your translation has an insertion with nearly perfect matches on either side, you have an alternative splicing.
In Class exercise: nd the alternative spliced variants
Go to NCBI, search nucleotides for AKAP9 (you should get the
sequence with accession number NM_005751.4, GI:197245395)
Slect BLAST sequence Select blastx (not tblastx) Select the nr/nt database. Organism: homo sapiens Submit. While waiting, do exercise on the next page....
A sure sign of alternative splicing in blastx output:
Score = 160 bits (404), Expect = 8e-37 Identities = 85/116 (73%), Positives = 86/116 (74%) Frame = +2 Query: 76820 RSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMAGAFSFIHSRVGSPWXXXXXXXXX 76999 +SHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA Sbjct: 778 KSHENGFMEDLDKTWVRYQECDSRSNAPATLTFENMA----------------------- 814 Query: 77000 XXXXRHTGVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 77167 GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ Sbjct: 815 -------GVFMLVAGGIVAGIFLIFIEIAYKRHKDARRKQMQLAFAAVNVWRKNLQ 863
Identical up to the insertion. Identical after the insertion. These are the same gene.
Which codons can come at the start/end of an alternative exon?
Frame : The un-spliced intron |GUi starts with GU. The un-spliced intron iAG| ends with AG.
0 e|GU
2 ee|G,Uii AG|e
iiA,G|ee
e=a base within the exon. i =a base within the intron. | = intron/exon boundary.
Which amino acids can come at the start/end of an alternative exon?
Frame : The un-spliced intron |V starts with GU.
1 [CRSG]
{FMNHYWCD}
2
{FINHYCD} [FLSYCW]
The un-spliced intron [QKE]| ends with AG.
[SR]
[VADEG]
e=a base within the exon. i =a base within the intron. | = intron/exon boundary.
What frame is the intron in the earlier slide?
Exon,GU..AG,Exon Is that all there is to it?
GU occurs on average every 16 nucleotides. AG, too. If this were the only information, there would be too many splice sites. GU..AG is necessary, not sufcient, for splicing.
What else is needed?
Introns always start with GU and end with AG (GT..AG in DNA)
What information is used to predict intron/exon boundaries?
Introns can start in one of three frames (0|1|2) relative to the codon frame. Alternatively spliced introns (may be exons) must have a multiple of 3 nucleotides. 3 and 5 intron sequence motifs branchpoint sequence motif Enhancer/silencer sequence motifs (ESEs, ESSs, ISEs, ISSs) Base composition in exons/introns. Orthologs conserve intron/exon boundaries.
Sequence composition method for genending
Most exons code for protein. Most introns do not. Selective pressure on exons includes:
(1) species-specic codon preferences (2) amino acid preferences (3) selection for foldability and function. P(G)=w P(C)=x P(T)=y P(A)=z P(G)=a P(C)=b P(T)=c P(A)=d
A simple HMM for intron/exon base composition. Not so specic.
ESEs, ESSs, ISEs, ISSs
ESE =Exonic Splicing enhancers: sequence in the exons that promote splicing ESS =Exonic Splicing Silencers: sequence in the exons that inhibit splicing ISE =Intronic Splicing Enhancers: sequence in the introns that promote splicing. ISS =Intronic Splicing Silencers: sequence in the introns that inhibit splicing
How were ESEs found?
(1) Training database was constructed of exonic mRNA (post-spliced) that was (a) constitutively spliced (not alternatively spliced), and (b) from an internal nonprotein-coding exon. (2) Database of control, non-ESE sequences was constructed. (3) The relative abundance of all 8-mers was found. (4) 8-mers with high relative abundance were tested by mutating the putative ESE 8-mers and determining the splicing efciency by gel electrophoresis.
Zhang,XHF. and Chasin LA. Computational denition of sequence motifs governing constitutive exon splicing. Genes & Development 18: 1241-1250 (2004)
Relative abundance ESE and ESS motifs
putative ESSs
putative ESEs
Some of the motifs found by Zhang & Chasin using relative abundance analysis of 8-mers, after clustering.
The nice thing about HMMs: they are modular.
begin end
begin
end
begin
end
begin
end
begin
end
begin
end
HMMs can be connected by their begin and end states to make a super-HMM. Individual modules can be trained separately.
A modular HMM for introns
short variable length intron model
1 p
DSS
1-p
Ishort
Ixed
1 1-q
ASS
q
donor site =GU
Igeo
acceptor site =AG
xed length + variable length intron model
Stanke M, Steinkamp R, Waack S, Morgenstern B. AUGUSTUS: a web server for gene nding in eukaryotes. Nucleic Acids Res. 2004 Jul 1;32:W309-12.
Intron model for mammals
branch site poly-pyrimidine region donor motif (contains GU) acceptor motif (contains AG)
from: Blencowe, BJ. Exonic splicing enhancers: mechanism of ction, diversity and role in human genetic diseases. TIBS 25:106 (2000)
A genending HMM: Genescan
internal exon model
intron models
initial exon model
terminal exon model single exon model
Intergentic Regions
Mirrored models for reverse complement strand
GENESCAN -- forward strand part
Splicing fact sheet
Exons average 145 nucleotides in length
Contain regulatory elements :
ESEs: Exonic splicing enhancers
ESSs: Exonic splicing silencers Introns average more than 10x longer than exons
Contain regulatory elements(bind regulatory complexes)
ISEs: Intronic splicing enhancers
ISSs: Intronic splicing silencers Splice sites
5' splice site
Sequence: AGguragu (r = purine)
U1 snRNP: Binds to 5' splice site
3' splice site
Sequence: yyyyyyy nagG (y= pyrimidine)
Branch site
Sequence: ynyuray (r = purine)
U2 snRNP: Binds to branch site via RNA:RNA
interactions between snRNA and pre-mRNA
Alternative splicing fact sheet
Alternative splicing
Denition: Joining of different 5' and 3' splice sites
~80% of alternative splicing results in changes in the encoded protein
Up to 59% of human genes express more than one mRNA by
alternative splicing
Functional effects: Generates several forms of mRNA from single gene
Allows functionally diverse protein isoforms to be expressed according to
different regulatory programs Structural effects:
Insert or remove amino acids
Shift reading frame
Introduce termination codon
Gene expression effects
Removes or inserts regulatory elements controlling translation, mRNA
stability, or localization Regulation
Splicing pathways modulated according to:
Cell type
Developmental stage
Gender
External stimuli