PepTron is a sequence to ensemble generative model designed to accurately represent protein ensembles with any level of disorder content.
This makes it the ideal choice for multi-domain proteins, which are the most common target class in cutting-edge therapeutics.
# Clone the repository
git clone https://github.com/PeptoneLtd/peptron.git
cd peptron
# Build Docker container
docker build -t peptron:latest .
# Run container
docker run --gpus all -it --rm peptron:latestPre-trained PepTron checkpoints are available for download at https://zenodo.org/records/17306061.
PepTron: best performance across the whole proteomePepTron-base: model pre-trained on the PDB, used for the fine-tuning on disordered regions to getPepTron
Generate protein structure ensembles from sequences:
Modify the configuration in peptron/infer.py if needed (we suggest to keep the default):
EXEC_CONFIG = config_flags.DEFINE_config_file('config', 'peptron/model/config.py:peptron_o_inference_cueq')Create a CSV file with your protein sequences:
name,seqres
protein1,MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQA
protein2,MSHHWGYGKHNGPEHWHKDFPIAKGERQSPVDID
Download PepTron.tar.gz from here and unzip it.
The peptron-checkpoint directory is your checkpoint.
Using the convenience script:
# Edit run_peptron_infer.sh with your paths
export CKPT_PATH="/path/to/the/peptron-checkpoint"
export RESULTS_PATH="/path/to/results"
export CSV_FILE="/path/to/sequences.csv"
sh run_peptron_infer.shHere MSAs are need only for the processing and input pipelines and not used during the training.
To download and preprocess the PDB:
- Run
aws s3 sync --no-sign-request s3://pdbsnapshots/20230102/pub/pdb/data/structures/divided/mmCIF pdb_mmciffrom the desired directory. - Run
find pdb_mmcif -name '*.gz' | xargs gunzipto extract the MMCIF files. - Prepare an MSA directory and place the alignments in .a3m format at the following paths:
{alignment_dir}/{name}/a3m/{name}.a3m. If you don't have the MSAs, there are two ways to generate them:- Query the ColabFold server with
python -m dataprep.mmseqs_query --split [PATH] --outdir [DIR]. - Download UniRef30 and ColabDB according to https://github.com/sokrypton/ColabFold/blob/main/setup_databases.sh and run
python -m scripts.mmseqs_search_helper --split [PATH] --db_dir [DIR] --outdir [DIR].
- Query the ColabFold server with
- From the repository root, run
python -m dataprep.unpack_mmcif --mmcif_dir [DIR] --outdir [DIR] --num_workers [N]. This will preprocess all chains into NPZ files and create apdb_mmcif.csvindex. - Download OpenProteinSet with
aws s3 sync --no-sign-request s3://openfold/openfoldfrom the desired directory. - Run
python -m dataprep.add_msa_train_info --openfold_dir [DIR]to produce apdb_mmcif_msa.csvindex with OpenProteinSet MSA lookup. - Run
python -m dataprep.cluster_chainsto produce apdb_clustersfile at 40% sequence similarity (Mmseqs installation required). - Create MSAs for the PDB validation split (
splits/cameo2022.csv) according to standard MSA generation procedures andadd_msa_val_info.
The IDRome-o dataset has been created predicting the ensembles of the sequences in IDRome for respectively training and validation with https://github.com/PeptoneLtd/IDP-o.
To download and preprocess the IDRome-o dataset:
- Download IDRome-o from https://zenodo.org/records/17306061.
- Place the MSA directory in your preferred location
- From the repository root, run
python -m dataprep.prep_idrome --split [FILE] --ensemble_dir [DIR] --outdir [DIR] --num_workers [N]. This will preprocess the IDRome trajectories into NPZ files. Do tha for both train and val splits. - Run
python -m dataprep.add_msa_train_infoandpython -m dataprep.add_msa_val_info - Create MSAs for all entries in
splits/IDRome_DB-train-msa.csvandsplits/IDRome_DB-val-msa.csvaccording to standard MSA generation procedures.
Modify the configuration in peptron/train.py based on your training strategy:
EXEC_CONFIG = config_flags.DEFINE_config_file('config', 'peptron/model/config.py:peptron_o_mixed')Edit peptron/model/config.py in the training section:
"training": {
"experiment_dir": "/path/to/your/experiment/dir",
"wandb_project": "peptron-stable",
"experiment_name": "your-experiment-name",
"n_steps_train": 2500,
"warmup_steps_percentage": 0.10,
"train_epoch_len": 80000,
"val_epoch_len": 5,
"micro_batch_size": 8,
"num_nodes": 1,
"devices": 8,
"tensor_model_parallel_size": 1,
"pipeline_model_parallel_size": 1,
"accumulate_grad_batches": 1,
"steps_to_save_ckpt": 100,
"val_check_interval": 100,
"limit_val_batches": 3,
"precision": "bf16-mixed",
"initial_nemo_ckpt_path": "/path/to/initial/checkpoint",
# Data paths
"train_data_dir_pdb": "/path/to/pdb_mmcif_npz_dir",
"val_data_dir_pdb": "/path/to/pdb_mmcif_npz_dir",
"train_msa_dir_pdb": "/path/to/pdb_msa_dir",
"val_msa_dir_pdb": "/path/to/cameo2022_msa_dir",
# Chain files
"train_chains_pdb": "splits/pdb_chains_msa.csv",
"valid_chains_pdb": "splits/cameo2022_msa.csv",
"train_data_dir_idp": "/path/to/IDRome_train_dir",
"train_msa_dir_idp": "/path/to/IDRome_train_msa_dir",
"train_chains_idp": "splits/IDRome_DB-train-msa.csv",
"mmcif_dir": "/path/to/pdb_mmcif_dir",
"dataset_prob_pdb": 0.3,
"dataset_prob_idp": 0.7,
"train_clusters": "/path/to/pdb_clusters",
"train_cutoff": "2020-05-01",
"encoder_frozen": True,
"structure_frozen": False,
"pretrained_structure_head_path": "",
}# Single node training
sh run_peptron_train.sh
# Multi-node distributed training
sh run_peptron_distributed_train.shKey parameters you can modify in the inference configuration:
samples: Number of ensemble conformations to generate (default: 10)steps: Number of diffusion denoising steps (default: 10)max_batch_size: Number of structures generated in parallel for each predicted ensemble (default: 1)num_gpus: Number of GPUs PepTron will use during inference (default: 1)
NOTE1: The num_gpus parameter has to be CSV_FILE
NOTE2: The longer the sequence and the smaller max_batch_size has to be to avoid Out-Of-Memory errors. We set the default to 1
as safe configuration but we encourage to increase it based on your GPU memory and max-sequence-length. The bigger the
ensemble you want to generate and the more you want to increase this parameter.
Key parameters for training configuration:
flow_matching.noise_prob: Probability of adding noise during trainingflow_matching.self_cond_prob: Self-conditioning probabilitycrop_size: Input sequence crop size for memory management
PepTron's performance compared to other structural models can be evaluated using PeptoneBench.
To run PeptoneBench evaluation with PepTron:
- Install PeptoneBench following the instructions at https://github.com/PeptoneLtd/peptonebench
- Generate PepTron ensembles for your target proteins using the inference pipeline above
- Use PeptoneBench to evaluate the generated ensembles against experimental observables
For detailed evaluation procedures and benchmark comparisons, please refer to
- CUDA Out of Memory: Keep
micro_batch_size=1and tunemax_batch_sizebased on your needs. - cuEquivariance Import Error: Ensure cuequivariance-torch is properly installed. Discard the torchdynamo warnings
- Checkpoint Loading Error: Verify checkpoint path and model configuration compatibility
- Training Convergence Issues: Check data paths and CSV file formats
- Check the Issues for common problems
- Review configuration examples in
peptron/model/config.py - Ensure all data paths are correctly set in the training configuration
PepTron delivers the predictive accuracy required to finally characterize multi-domain proteins and IDPs, unlocking new frontiers in:
- Drug Discovery: Accurate modeling of disordered therapeutic targets
- Protein Engineering: Design of flexible, functional protein domains
- Fundamental Biology: Understanding disorder's role in cellular processes
- Therapeutic Development: Targeting the most common protein class in modern medicine
If you use PepTron in your research, please cite:
@article{peptone2025,
title = {Advancing Protein Ensemble Predictions Across the Order-Disorder Continuum},
author = {Invernizzi, Michele and Bottaro, Sandro and Streit, Julian O and Trentini, Bruno and Venanzi, Niccolo AE and Reidenbach, Danny and Lee, Youhan and Dallago, Christian and Sirelkhatim, Hassan and Jing, Bowen and Airoldi, Fabio and Lindorff-Larsen, Kresten and Fisicaro, Carlo and Tamiola, Kamil},
year = 2025,
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
doi = {10.1101/2025.10.18.680935},
url = {https://www.biorxiv.org/content/early/2025/10/19/2025.10.18.680935}
}Copyright 2025 Peptone Ltd
Licensed under the Apache License, Version 2.0. You may obtain a copy of the License at: http://www.apache.org/licenses/LICENSE-2.0
PepTron is developed through collaboration between Peptone Ltd, NVIDIA and the MIT, leveraging the BioNeMo platform for optimized biological AI computing. Special thanks to the computational biology community for advancing our understanding of protein disorder and dynamics.