toy

This is the AGB project by Álvaro Abella :person_with_blond_hair: and Clàudia Fontserè 🙅!

##Organization of the code This project is split in different modules and scripts. Here is a little reference of what each file contains. An index of functions from each module (devised for personal use, but maybe useful) can be found in functions.pm.

#####HMM.pm Core functions needed to run a Hidden Markov Model, including the Viterbi algorithm, a function to run tests and some abstractions to simplify the rest of the code.

#####MODELS.pm The definition of each HMM model used along this project. Each model is composed by a set of states, symbols, transition and emission probabilities.

toy: toy model (with only one donor state)
U1: adding most informative positions around the donor site
U1_all: adding all positions around the donor site)
TIA1: taking U1_all positions but adding one splice regulator
TIA2: taking U1_all positions but adding two splice regulators
TIA3: taking U1_all positions but adding one splice regulator with a different approach

You can find the model structure for them in the directory ./graphs

#####AUX.pm Contains some helper functions, not related to Markov models.

#####countfreq.pl A simple script that we have used to calculate:

The frequencies of nucleotides at each position of the real (real_human_donors) and false (false_human_donors) donors.
From the frequencies in each position of both sets of sequences, we have calculated the Kullback-Leibler divergence among the distribution of nucleotides in the case of real and false donors, for each position.
From the Kullback-Leibler divergence we have calculated the Jensen-Shannon divergence, whose graph is shown here

#####sampling.pl The functions needed to perform a sampling of n nucleotides from a given HMM.

#####test.pl A simple script to run a test, calling the corresponding function in HMM.pm

###Objectives and solutions

Implement a program to sample sequences from an HMM

We provide a simple script sampling.pl which can be executed to produce a sequence of the desired length. Eg. to produce a sequence of 100 nucleotides:

./sampling.pl 100

Note 1: the nucleotides are emitted according to the transition and emission probabilities of the HMM toy model. However, the sampling process will ignore the transition from intron to end in order to provide a sufficiently large sequence. If the requested length is very long, probably most of it will be intron.

Note 2: the default model to sample is the toy model. However, this can be altered by modifying the end of the script, specifically the line HMM::select_model("toy"). Eg. to sample from the HMM including the TIA-1 binding site:

HMM::select_model("TIA");
print sampling(shift @ARGV), "\n";

2. Implement the Viterbi algorithm and test it with the toy donor splice site model.

The Viterbi algorithm is implemented as part of the module HMM.pm. The function is HMM::viterbi, which takes a sequence, builds a matrix of probabilities and pointers and returns the most probable path of states (for this task, it relies on the function HMM::getPath).

The testing is provided by the function HMM::test, which can be called from the script the script test.pl. This script can be easily modified to select a given model and a file of test sequences. The model is selected as shown in the previous section. The test is run calling HMM::test('filename', n), where filename is a file containing a set of sequences, one per line, and n is the position (base 1) of this sequences where the donor site is found.

3. Change the model into a donor site (5' splice site) model that considers the binding of the U1 snRNP, by extending the number of states that describe the exon-intron boundary. How many positions should you use for your model? Provide an argument for your answer.

In order to find which positions we should use we calculated the Jensen-Shannon divergence among the sequences in false_human_donors and real_human_donors, for each of the positions.

From the previous graph it's clear that positions 3, 6 and 9 exhibit a large difference in the nucleotide frequencies. The positions 1, 2, 7 and 9 show a much smaller divergence, and positions 4 and 5 show no divergence at all. This last positions correspond to the dinucleotide GT, which is always seen in the donor site. The lack of divergence in this case is due to the fact that the false_human_donors) have been selected to have this dinucleotide in position 4-5. However, it is obvious that this two positions must be included in the model, as they are the most relevant.

Given this data we decided to take positions 3, 4, 5, 6, 7 and 8 (the 6th in order to be able to take the 8th). This model is included in MODELS.pm with the name "U1".

We also decided to create another model taking into account all of the positions, which is called "U1_all".

To test, for example, the first model, edit test.pl as follows and run it:

HMM::select_model("U1");
HMM::test("testset_full.txt", 51);

This two models have provided, respectively, an accuracy of 86.58% and 94.00%.

4. Incorporate into the previous model a state describing the presence of a TIA-1 binding site (a Uridine-rich sequence) immediately downstream of the donor site.

We have incorporated this model under MODELS.pm with the name "TIA". After several tests we have seen that a good approach is to allow a small transition probability from donor to TIA, and a bigger probability of transition from donor to intron. Intron allows the transition both to itself and back to TIA, and TIA can lead to itself or to intron. This has provided a sensibility of around 91%, worse than the sensibility provided by "U1_all".

We have also tried an analogous model including two "TIA" states (TIA2). This implies a slight better sensitivity (around 2% better).

Finally, we have tested another approach, changing the order of the sates "intron" and "TIA" (using a single "TIA" state). In this case we get a result almost as accurate as with the model TIA2, but using one state less.

5. 📊 Make an assesment of the performance of the model using accuracy measures. Do you find any improvement between models?

We have already described how to run tests a couple of sections above. In order to check the accuracy of each model we calculate the number of true positives (TP), false positives (FP) and false negatives (FN). In this case we can either get the donor position right or wrong, and thus a false positive (we are signaling a position as donor when it isn't) also implies a false negative (we are not signaling the right position as donor). Due to this FP = FN, and the sensitivity TP/(TP + FN) is equal to the specificity TP/(TP + FP). A graphic is shown below with the comparison of the sensitivities obtained with each model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

toy

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
tests		tests
AUX.pm		AUX.pm
HMM.pm		HMM.pm
MODELS.pm		MODELS.pm
PresentationHMM.pdf		PresentationHMM.pdf
README.md		README.md
countfreq.pl		countfreq.pl
divide_set.py		divide_set.py
edy.pl		edy.pl
false_human_donors.txt		false_human_donors.txt
functions.pm		functions.pm
real_human_donors		real_human_donors
sampling.pl		sampling.pl
test.pl		test.pl

alvaroabascar/toy

Folders and files

Latest commit

History

Repository files navigation

toy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages