OPUS-PLLM: Advancing Generative Large Language Models Toward Discriminative Performance in Protein Function Prediction

Here is the official codebase for OPUS-PLLM: Advancing Generative Large Language Models Toward Discriminative Performance in Protein Function Prediction.

🧩 Dependencies

First, Create a new virtual python3.10 enviroment and activate it. We recommend you to deploy this project in cuda11.8.

conda create -n OpusPLLM python=3.10
conda activate OpusPLLM

Install PyTorch with cuda-11.8 using pip following the instructions in link. In this project, we employ torch==2.4.0 and its corresponding dependencies so you can download it with command:

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118

And download xformers:

pip install xformers==0.0.27.post2 --index-url https://download.pytorch.org/whl/cu118

Install most packages required through requirements.txt.

pip install -r requirements.txt

Detect whether your conda support CXXABI1.3.9 using command

strings $CONDA_PREFIX/lib/libstdc++.so.6 | grep CXXABI

If CXXABI_1.3.9 is not returned, using command following to download libstdcxx-ng to support bitsandbytes

conda install -c conda-forge libstdcxx-ng

Finally, setup the project using:

export PYTHONPATH=/path/to/OPUS_PLLM/

📦 Datasets and Benchmarks

As mentioned in the article, we have two versions of the dataset for training: OPUS-InstructionCorpus and OPUS-InstructionCorpus-Evol.

Upon acceptance of the paper, both training datasets will be released via our Huggingface dataset repos OPUS-InstructionCorpus 👈🤗 and OPUS-InstructionCorpus-Evol 👈🤗.

Also, 18 test datasets of our benchmark are all open-sourced here 👈. We gratefully acknowledge the teams that contributed some parts of the original test sets(OPI-team, Clean-team, Deeploc-team).

🔋 Model Weights and Structure

To evaluate or use OPUS-PLLM, download the following components:

Base Model Weights: Corresponding to the generative language model used in stages (c) and (d) of the framework. This component remains frozen throughout all training stages and is responsible for text generation and comprehension.
Modality Encoding Adapter Weights: Corresponding to stage (a), where protein sequences and textual descriptions are aligned through a dedicated adapter to establish a shared representation space.(modality encoder)
Modality Refinement Projection Weights: Corresponding to the projection module in stage (c), which converts protein representations into token embeddings compatible with the language model.(modality refinement projector)
LoRA Fine-Tuning Weights: Corresponding to stage (d), where low-rank adaptation (LoRA) is used to efficiently fine-tune selected layers of the language model, while jointly optimizing the projection module to improve task-specific instruction following.(lora adapter)

Except for the base model, the remaining three types of weights can be downloaded from our opud-pllm-weights Model Zoo 👈.

Results

Performance comparison of different models across five protein function prediction tasks. a-e) OPUS-PLLM versus state-of-the-art generative LLMs on: a) two subcellular localization datasets, b) three UniProt keyword datasets, c) three GO term datasets, d) two EC number datasets, and e) three functional description generation datasets. f) Some examples from functional description generation tasks. g-j) OPUS-PLLM versus discriminative approaches based on different PLM representations (ESM2, ProtT5, Ankh) on: g) two subcellular localization datasets, h) three UniProt keyword datasets, i) three GO term datasets, and j) two EC number datasets.

🚀 Inference and Evaluation

At first, please ensure you have prepared all the prerequisite environments as specified in the Dependencies section. If not, please follow the instructions step by step in the Dependencies section.

Download the corresponding test sets from our 🤗 Hugging Face Repo and maintain their original name. The evaluation metrics are automatically selected based on specific keywords in the test dataset names. Specifically, test sets containing "GO" in their names will be processed through a dedicated pipeline that calculates precision, recall, and F1 score between the generated text and ground truth. The same applies to the others.

🔄 OPUS-PLLM-Llama3-8B

If you use OPUS-PLLM-Llama3-8B, the base model (Llama3-8B) can be downloaded here 👈, while the corresponding opus-pllm-weights (modality encoding adapter weights, projection weights, and LoRA weights) can be found in our Model Zoo 👈.

Once all model weights have been prepared, navigate to the evaluation directory: OPUS-PLLM/multi_modality_model/multi_modality_v1/eval/ and execute the provided scripts

For Batch Annotation:

accelerate launch run_opus_ddp.py  \
--model-base-path /path/Llama3-8B/ \
--opus-pllm-weights-path /path/opus-pllm-weights/ \
--input_path /path/to/file \
--save_path /path/to/save \

model-base-path specifies the path to the base model directory (e.g., Llama3-8B). opus-pllm-weights-path refers to the directory containing the OPUS-PLLM weights, including the modality encoding adapter weights, modality refinement projection weights, and LoRA fine-tuning weights. input_path indicates the path to the input test dataset in JSON format. save_path defines the output directory where the inference results will be saved.

For Online Inference:

python run_opus_online.py \
--model-base-path /path/Llama3-8B/ \
--opus-pllm-weights-path /path/opus-pllm-weights/ \

You can run OPUS-PLLM-Llama3-8B in an interactive, single-turn mode directly in the terminal.

The terminal will prompt:

Enter your instruction:
e.g., "Given a protein sequence, predict the corresponding Gene Ontology term that describes its molecular function, biological process, and cellular component."
Enter the protein sequence (or leave empty to skip):
e.g.,
MPYFAQRLYNTCKASFSSDGPITEDALEKVRNVLEKIKPSDVGIEQDAQLARSRSGPLNERNGSNQSPPAIKYLHLHECDSFSIGIFCMPPSSMIPLHNHPGMTVLSKLVYGSMHVKSYDWLEPQLTEPEDPSQARPAKLVKDTEMTAQSPVTTLYPKSGGNIHCFKAITHCAILDILAPPYSSEHDRHCTYFRKSRREDLPGELEVDGEVVTDVTWLEEFQPPDDFVIRRIPYRGPVIRT

The model will return output like:
cytosol; nucleus; cysteine dioxygenase activity; iron ion binding; cellular response to hypoxia; detection of hypoxia; response to hypoxia

🔄 OPUS-PLLM-Llama3-8B-Evol

If you want to experience the OPUS-PLLM-Llama3-8B-Evol's ability to solve protein-sequence-centered interactive capabilities in conversational mode, or verify its performance on our provided MCQ benchmark, the base model (Llama3-8B-Instruct) can be downloaded here 👈, while the corresponding opus-pllm-weights (modality encoding adapter weights, projection weights, and LoRA weights) can be found in our Model Zoo 👈.

For Batch MCQ Inference:

accelerate launch  eval_run_multichoice.py  \
--model-base-path /path/Llama3-8B-Instruct/ \
--opus-pllm-weights-path /path/opus-pllm-weights/ \
--input_path /path/to/file \
--save_path /path/to/save \

model-base-path represents the path to the Llama3-8B-Instruct, opus-pllm-weights-path refers to the directory containing the OPUS-PLLM weights, including the modality encoding adapter weights, modality refinement projection weights, and LoRA fine-tuning weights. input_path indicates the path to the input test dataset in JSON format. save_path defines the output directory where the inference results will be saved.

For Online Inference：

python eval_run_online.py  \
--model-base-path /path/Llama3-8B-Instruct/ \
--opus-pllm-weights-path /path/opus-pllm-weights/ \

Similar to Online Inference script for annotation model, you also need to enter any instruction and protein sequence you want in the terminal and the model will response with diverse and professional response for you.

Model Zoo🏛️

We provide four models: OPUS-PLLM-Llama3-8B, OPUS-PLLM-Galactica-1.3B, and OPUS-PLLM-Galactica-6.7B are primarily designed for protein function annotation tasks. OPUS-PLLM-Llama3-8B-Evol is specifically designed for diverse and complex daily interactions, with enhanced instruction-following capabilities.

Model name	Model_Type	Base Model	OPUS-PLLM-Weights
OPUS-PLLM-Llama3-8B	Base	Llama3-8B🐪	Link🤗
OPUS-PLLM-Galactica-1.3B	Base	Galactica-1.3B🌌	Link🤗
OPUS-PLLM-Galactica-6.7B	Base	Galactica-6.7B🌌	Link🤗
OPUS-PLLM-Llama3-8B-Evol	Evol	Llama3-8B-Instruct🐪	Link🤗

##Note on Inference Variability

Our model is an autoregressive language model, in which each token is generated by sampling from the probability distribution over the vocabulary predicted at each step. As a result, even with identical input, the inherent non-determinism introduced by token sampling—combined with variations in hardware resource scheduling (e.g., GPU memory usage)—may lead to slightly different outputs across multiple inference runs. Therefore, the results you obtain may not exactly match the numbers reported in the paper, but they should generally fall within a comparable and stable range.

Acknowledgements

Llama3-8B🐪，Galactica-1.3B🌌，Galactica-6.7B🌌，Llama3-8B-Instruct🐪.

Citing OPUS-PLLM

If you find this project helpful, please cite our paper:

@article{opuspllm2025,
  title={OPUS-PLLM: Advancing Generative Large Language Models Toward Discriminative Performance in Protein Function Prediction},
  author={Ying Lv†, Yifan Xu†, Gang Xu*, and Jianpeng Ma*},
  journal={},
  year={2025}
}

Contact

For any questions or issues, open an issue or contact Ying Lv ([email protected]) and Yifan Xu ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.idea		.idea
images		images
multi_modality_model		multi_modality_model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

OPUS-PLLM: Advancing Generative Large Language Models Toward Discriminative Performance in Protein Function Prediction

🧩 Dependencies

📦 Datasets and Benchmarks

🔋 Model Weights and Structure

Results

🚀 Inference and Evaluation

🔄 OPUS-PLLM-Llama3-8B

For Batch Annotation:

For Online Inference:

🔄 OPUS-PLLM-Llama3-8B-Evol

For Batch MCQ Inference:

For Online Inference：

Model Zoo🏛️

Acknowledgements

Citing OPUS-PLLM

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Uh oh!

License

Uh oh!

Fanchuana/OPUS-PLLM

Folders and files

Latest commit

History

Repository files navigation

OPUS-PLLM: Advancing Generative Large Language Models Toward Discriminative Performance in Protein Function Prediction

🧩 Dependencies

📦 Datasets and Benchmarks

🔋 Model Weights and Structure

Results

🚀 Inference and Evaluation

🔄 OPUS-PLLM-Llama3-8B

For Batch Annotation:

For Online Inference:

🔄 OPUS-PLLM-Llama3-8B-Evol

For Batch MCQ Inference:

For Online Inference：

Model Zoo🏛️

Acknowledgements

Citing OPUS-PLLM

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages