OglyPred-PLM

Human O-linked Glycosylation Site Prediction Using Pretrained Protein Language Model

To guarantee proper use of our tool, please follow all the steps in the presented order.

Note

All the programs provided in this repository are properly tested and accurate for its functionality!

How it Works

OglyPred-PLM is a multi-layer perceptron based approach that leverages contexualized embeddings generated from the ProT5-XL-UniRef50 Protein Language Model (Referred here as ProT5 PLM) to predict Human O-linked glycosylation sites ("S/T").

Environment Details

OglyPred-PLM was developed in the following environment:

Python version: 3.8.3.final.0
Python bits: 64
Operating System: Linux
- OS release: 5.8.0-38-generic
Machine architecture: x86_64
Processor architecture: x86_64

Python Packages and Dependencies

The model relies on the following Python packages and their specific versions:

Pandas: 1.0.5
NumPy: 1.18.5
Pip: 20.1.1
SciPy: 1.4.1
scikit-learn: 0.23.1
TensorFlow: 2.3.1

Please ensure that you have a compatible Python environment with these dependencies to use the model effectively.

Create a Virtual Enviornment

To install the required dependencies for this project, you can create a virtual environment and use the provided requirements.txt file. Make sure your Python environment matches the specified versions above to ensure compatibility.

python -m venv myenv  # Create a virtual environment
source myenv/bin/activate  # Activate the virtual environment
pip install -r requirements.txt  # Install the project dependencies

Download Protein Sequences From Uniprot.org

You can use our model with all human protein sequences provided by UniPort.

We have included a file named Q63HQ2.fasta as an example in this repository

Once you have downloaded the protein sequence you can send the sequence as input to the ProT5 PLM.

Generate Contexualized Embeddings Using ProT5 PLM

ProT5 PLM Installation

In order to run the ProT5 PLM, you need to install the following modules:

pip install torch
pip install transformers
pip install sentencepiece

For more information, refer to this "ProtTrans" Repository.

Once the model is installed you can write your own code to generate the contexualized embeddings OR to make this process easier for you, we have also provided a Generate_Contexualized_Embeddings_ProT5.ipynb file.

Here's how you can use the provided file for a single protein sequence:

You need to make changes to only two lines of the code:

basedir = "/project/pakhrin/salman/after_cd_hit_files"     # here you will paste the location where your .fasta file is located
name = "Q63HQ2.fasta"                                      # here you will add your protein name

Rest of the lines will stay the same.

Once the code is successfully run, It will generate a file named "Q63HQ2_Prot_Trans_.csv".

Here's how the output file will look:

Rows = Length of The Protein Sequence
Columns = 1025 (1 column to represent the protein residue + 1024 embeddings)

Note

Make sure to keep all of the files being used in the same directory

Extracting "S/T" Sites From Contexualized Embeddings

In order to extract only the "S/T" sites from the contexualized embeddings you can use the following code:

import pandas as pd

df = pd.read_csv("Q63HQ2_Prot_Trans_.csv", header = None)                     # replace with your ProtT5 embeddings file
Header = ["Residue"]+[int(i) for i in range(1,1025)]
df.columns = Header
df_S_T_Residue_Only = df[df["Residue"].isin(["S","T"])]
df_S_T_Residue_Only.to_csv("Q63HQ2_S_T_Sites.csv", index = False)            # saves the embeddings of only S and T residues

Once the process is complete, you will have a .csv file containing the embeddings of "S/T" sites.

Here's an example output:

Sending Sites Into OglyPred-PLM For Predection

Now send the Q63HQ2_S_T_Sites.csv from the previous step as an input to the OglyPred-PLM.

We have provided a file named OglyPred-PLM.ipynb and our modelProt_T5_my_model_O_linked_Glycosylation370381Prot_T5_Subash_Salman_Neha.h5which should be downloaded and kept in the same directory to avoid any issues.

Any Questions?

If you need any help don't hesitate to get in touch with Dr. Subash Chandra Pakhrin ([email protected])

This is for Peer Review Purposes

OglyPred-PLM: Human O-linked Glycosylation Site Prediction Using Pretrained Protein Language Model

Programs were executed using Anaconda version: 2020.07

The programs were developed in the following environment. python: 3.8.3.final.0, python-bits: 64, OS: Linux, OS-release: 5.8.0-38-generic, machine: x86_64, processor: x86_64, pandas: 1.0.5, numpy: 1.18.5, pip: 20.1.1, scipy: 1.4.1, sci-kit-learn: 0.23.1., Keras: 2.4.3, tensorflow: 2.3.1.

Please place all the following files in the same directory comparison with SPRINT-Gly.ipynb, Feature_Extraction_July_30_Sprint_Gly_Negative_Independent_Test_3376_.txt, Feature_Extraction_July_30_Sprint_Gly_Positive_Independent_Test_79_.txt (In the publicly shared google drive), Compare_with_Sprint_Gly486__103928619___.h5, and execute the Comparision with SPRINT-Gly.ipynb program to see the reported result.

Please place all the following files in the same directory Comparison with Captor Independent O-linked Glycosylation Test Dataset.ipynb, Feature_Extraction_August_3_Captor_Independent_Negative_Test_1308_or_less.txt, Feature_Extraction_August_3_Captor_Independent_Positive_Test_341_or_less.txt (In the publicly shared google drive), Compare_with_captor375__1438311___.h5, and execute the Comparison with Captor Independent O-linked Glycosylation Test Dataset.ipynb program to see the reported result.

Please place all the following files in the same directory O-linked Glycosylation ProtT5 Independent Testing with undersampling result.ipynb, Feature_Extraction_O_linked_Testing_Negative_11466_Sites_less.txt, Feature_Extraction_O_linked_Testing_Positive_375_Sites_less.txt (In the publicly shared google drive), Prot_T5_my_model_O_linked_Glycosylation370381Prot_T5_Subash_Salman_Neha.h5, and execute the O-linked Glycosylation ProtT5 Independent Testing with undersampling result.ipynb program to see the reported result.

Please run the ANN_ESM2_3B_O_linked_glycosylation_Independent_Testing.py by placing the feature file in the corresponding directory to get the reported result.

To check the comparison with Alkuhlani et al. please observe the "Comparing with Alkhulani et al. Independent Test Dataset.ipynb" program.

*** For your convenience we have uploaded the ProtT5 feature extraction program (analyze_Cell_Mem_ER_Extrac_Protein (1).py) for the protein sequence from ProtT5 as well as the corresponding 1024 feature vector extraction program (Sprint Gly Independent Negative Feature Extraction.ipynb) from the ProtT5 file. We have uploaded the Q63HQ2_Prot_Trans_.csv ProtT5 feature file of protein Q63HQ2.fasta for your convenience ***

All the training and independent test data are uploaded at the following Google Drive link: https://drive.google.com/drive/folders/1MzfuMltEGF7jxKjzwLZx0gNuSDsLT7tO?usp=sharing

If you need any help don't hesitate to get in touch with Dr. Subash Chandra Pakhrin ([email protected])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OglyPred-PLM

How it Works

Environment Details

Python Packages and Dependencies

Create a Virtual Enviornment

Download Protein Sequences From Uniprot.org

Generate Contexualized Embeddings Using ProT5 PLM

ProT5 PLM Installation

Extracting "S/T" Sites From Contexualized Embeddings

Sending Sites Into OglyPred-PLM For Predection

Any Questions?

This is for Peer Review Purposes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
images		images
ANN_ESM2_3B_O_linked_glycosylation_Independent_Testing.py		ANN_ESM2_3B_O_linked_glycosylation_Independent_Testing.py
ANN_ESM2_3B_O_linked_glycosylation_Independent_Testing.sh		ANN_ESM2_3B_O_linked_glycosylation_Independent_Testing.sh
Compare_with_Sprint_Gly486__103928619___.h5		Compare_with_Sprint_Gly486__103928619___.h5
Compare_with_captor375__1438311___.h5		Compare_with_captor375__1438311___.h5
Comparing with Alkhulani et al. Independent Test Dataset.ipynb		Comparing with Alkhulani et al. Independent Test Dataset.ipynb
Comparision with SPRINT-Gly.ipynb		Comparision with SPRINT-Gly.ipynb
Comparison with Captor Independent O-linked Glycosylation Test Dataset.ipynb		Comparison with Captor Independent O-linked Glycosylation Test Dataset.ipynb
ESM2_3B_10_fold_CV_O_linked_Glycosylation.ipynb		ESM2_3B_10_fold_CV_O_linked_Glycosylation.ipynb
Feature_Extraction_O_linked_Testing_Positive_375_Sites_less.txt		Feature_Extraction_O_linked_Testing_Positive_375_Sites_less.txt
Generate_Contexualized_Embeddings_ProT5.ipynb		Generate_Contexualized_Embeddings_ProT5.ipynb
O-linked Glycosylation ProtT5 Independent Testing with undersampling result.ipynb		O-linked Glycosylation ProtT5 Independent Testing with undersampling result.ipynb
O_linked_Glycosylation_ANN_1D_CNN_LR_RF_ROC.png		O_linked_Glycosylation_ANN_1D_CNN_LR_RF_ROC.png
OglyPred-PLM overall framework.png		OglyPred-PLM overall framework.png
OglyPred-PLM.ipynb		OglyPred-PLM.ipynb
PR_CD_HIT_O_linked_Glycosylation_PR_Curve.png		PR_CD_HIT_O_linked_Glycosylation_PR_Curve.png
Prot_T5_my_model_O_linked_Glycosylation370381Prot_T5_Subash_Salman_Neha.h5		Prot_T5_my_model_O_linked_Glycosylation370381Prot_T5_Subash_Salman_Neha.h5
Q63HQ2_Prot_Trans_.csv		Q63HQ2_Prot_Trans_.csv
README.md		README.md
ROC_ANN_1D_CNN_LR_RF_etc_O_linked_Glycosylation.ipynb		ROC_ANN_1D_CNN_LR_RF_etc_O_linked_Glycosylation.ipynb
Sprint Gly Independent Negative Feature Extraction.ipynb		Sprint Gly Independent Negative Feature Extraction.ipynb
T-SNE of O-linked Glycosylation.ipynb		T-SNE of O-linked Glycosylation.ipynb
TSNE_of_O_linked_Glycosylation_Training_Data_477785 (1).png		TSNE_of_O_linked_Glycosylation_Training_Data_477785 (1).png
TSNE_of_O_linked_Glycosylation_Training_Data_from_Penultimate_layer_of_MLP477785 (1).png		TSNE_of_O_linked_Glycosylation_Training_Data_from_Penultimate_layer_of_MLP477785 (1).png
analyze_Cell_Mem_ER_Extrac_Protein (1).py		analyze_Cell_Mem_ER_Extrac_Protein (1).py
analyze_Cell_Mem_ER_Extrac_Protein (1).sh		analyze_Cell_Mem_ER_Extrac_Protein (1).sh
requirements.txt		requirements.txt

mainFrameWire/OglyPred-PLM

Folders and files

Latest commit

History

Repository files navigation

OglyPred-PLM

How it Works

Environment Details

Python Packages and Dependencies

Create a Virtual Enviornment

Download Protein Sequences From Uniprot.org

Generate Contexualized Embeddings Using ProT5 PLM

ProT5 PLM Installation

Extracting "S/T" Sites From Contexualized Embeddings

Sending Sites Into OglyPred-PLM For Predection

Any Questions?

This is for Peer Review Purposes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages