BertADP is a protein sequence classification tool built by fine-tuning the ProtBert language model. It predicts whether a peptide sequence has anti-diabetic activity.
BertADP/ 
├── BertADP.py # Main execution script 
├── BertADP/ # Directory containing the fine-tuned model
│ └── adapter_config.json
│ └── adapter_model.safetensors
├── example/ 
│ └── example.csv # Example input file with 10 sequences (5 positive, 5 negative) 
└── requirements.txt # List of required Python packages
You can clone this repository and run predictions in a few steps:
git clone https://github.com/xiexq007/BertADP.git
cd BertADP
pip install -r requirements.txt --ignore-installed
python BertADP.py example/example.csv
The prediction results will be saved to prediction_result.csv.
Please use Python 3.11 or above, and install the following dependencies (it is recommended to use a virtual environment).
pip install -r requirements.txt --ignore-installed
The input should be a CSV file with a single column named Sequence, containing raw amino acid sequences, like:
Sequence
GPPGPA
LLNQELLLNPTHQIYPV
SPTIPFFDPQIPK
...
- Sequence length should preferably be less than or equal to 41.
- Install dependencies (recommended in a virtual environment):
pip install -r requirements.txt --ignore-installed
- Run the prediction script:
python BertADP.py example/example.csv
- example/example.csvcan be replaced with your own file.
- Output:
 The script will generate aprediction_result.csvfile with the following format:
Sequence,Positive_Probability,Prediction
GPPGPA,0.96674114,1
LLNQELLLNPTHQIYPV,0.96733195,1
SPTIPFFDPQIPK,0.9591547,1
...
- Positive_Probability: Probability that the sequence is an anti-diabetic peptide.
- Prediction: Classification result (1 = positive, 0 = negative).
The model used in this project is based on the pre-trained ProtBert model (Rostlab/prot_bert) from the Hugging Face Model Hub. It is fine-tuned for binary classification to distinguish anti-diabetic peptides (ADPs) from non-ADPs.
We use the transformers library along with the PEFT framework and DoRA (Dropout as Reparameterization of Attention) for parameter-efficient fine-tuning. Only the final classification head and selected attention modules are updated during training.
Key training details:
- Evaluation and checkpoint saving are done at the end of each epoch.
- The best model is automatically selected based on validation accuracy.
The fine-tuned model weights are saved in the BertADP/ directory.
The tokenizer will be automatically downloaded from HuggingFace the first time you run the script. Please ensure you are connected to the internet.
This project is intended for academic and research use only. Please cite appropriately if used in publications.