- 1 Installation
- 2 Scripts to reproduce the results
- 3 Change to your own datasets
- 4 Transfer learning using our pre-trained models
- 5 Make prediction using the trained models
- 6 Publication
- 7 Other models
- 8 CPU version
- 9 Questions
git clone https://github.com/horsepurve/DeepRTplus
cd DeepRTplus
And then follow DeepRT_install.sh to install the prerequisites.
Let's see how to apply DeepRT on HeLa dataset (modifications included). Simply type:
python data_split.py data/mod.txt 9 1 2
python capsule_network_emb.py
The HeLa data is split with 9:1 ratio with random seed 2, 9 for training and 1 for testing, and then capsule network begins training. You may check out the prediction result (about 0.985 ACC) and log file in typically 3 minutes (on a laptop with GTX 1070, for example).
To reproduce the result in the paper, just run as:
cd work
sh ../pipeline_mod.sh
And then you may see the reports in work directory.
See data/README_data.md for a summary and run corresponding pipline. All the necessary parameters for those datasets are stored in config_backup.py.
Prepare your dataset as the following format:
sequence RT
4GSQEHPIGDK 2507.67
GDDLQAIK 2996.73
FA2FNAYENHLK 4681.428
AH3PLNTPDPSTK 2754.66
WDSE2NSERDVTK 2645.274
TEEGEIDY2AEEGENRR 3210.3959999999997
SQGD1QDLNGNNQSVTR 2468.946
Separate the peptide sequence and RT (in second) by tab (\t), encode the modified amino acides as digits (currently only four kinds of modification are included in the pre-trained models):
'M[16]' -> '1',
'S[80]' -> '2',
'T[80]' -> '3',
'Y[80]' -> '4'
There are only several parameters to specify, e.g. for HeLa data, which is self-explainable:
train_path = 'data/mod_train_2.txt'
test_path = 'data/mod_test_2.txt'
result_path = 'result/mod_test_2.pred.txt'
log_path = 'result/mod_test_2.log'
save_prefix = 'epochs'
pretrain_path = ''
dict_path = ''
conv1_kernel = 10
conv2_kernel = 10
min_rt = 0
max_rt = 110
time_scale = 60 # set at 60 if your retention time is in second
max_length = 50 # maximum length of the peptides
Then type as following:
python capsule_network_emb.py
Training deep neural network models are time-consuming, especially for large dataset such as the Misc dataset here. However, the prediction accuracy is far from satisfactory without training dataset that big enough. The transfer leaning strategy used here can overcome this issue. You can use your small datasets in hand to fine-tune our pre-trained model in RPLC.
For a demo using the transfer learning strategy, just type:
cd work
sh ../pipeline_mod_trans_emb.sh
Note that you have to use the GPU version to load the pre-trained models, or otherwise you have to train from scratch on CPU.
Predicting unknown RT for a new peptide using a current model is easy to do, see below as a demo, the four parameters of which are maximum RT, saved RT model, convolutional filter size and testing file, respectively:
python prediction_emb.py max_rt param/dia_all_trans_mod_epo20_dim24_conv10.pt 10 ${rt_file}
Before all trainings, we firstly have normalized RTs for all peptides (rt_norm=(rt-min_rt)/(max_rt-min_rt)), so here we use max_rt to change them back to their previous RT scale (supposing min_rt==0).
doi: 10.1021/acs.analchem.8b02386 (PubMed)
As ResNet and LSTM (already been optimized) were less accurate then capsule network, the codes for ResNet and LSTM were deprecated, and DeepRT(+) (based on CapsNet) is recommended.
Of course you can still use SVM for training, use data_adaption.py to change the data format, and then import it to Elude/GPTime.
Running DeepRT on CPU is not recommended, because it is way too slow. However, if you have to, use capsule_network_emb_cpu.py instead of capsule_network_emb.py. You can set BATCH_SIZE to be very large if you have large enough memory.