Biomolecular Emulator (BioEmu for short) is a model that samples from the approximated equilibrium distribution of structures for a protein monomer, given its amino acid sequence.
For more information see our paper, citation below.
This repository contains inference code and model weights.
bioemu is provided as a Linux-only pip-installable package:
pip install bioemuNote
The first time bioemu is used to sample structures, it will also setup Colabfold on a separate virtual environment for MSA and embedding generation. By default this setup uses the ~/.bioemu_colabfold directory, but if you wish to have this changed please manually set the BIOEMU_COLABFOLD_DIR environment variable accordingly before sampling for the first time.
You can sample structures for a given protein sequence using the sample module. To run a tiny test using the default model parameters and denoising settings:
python -m bioemu.sample --sequence GYDPETGTWG --num_samples 10 --output_dir ~/test-chignolin
Alternatively, you can use the Python API:
from bioemu.sample import main as sample
sample(sequence='GYDPETGTWG', num_samples=10, output_dir='~/test_chignolin')The model parameters will be automatically downloaded from huggingface. A path to a single-sequence FASTA file can also be passed to the sequence argument.
Sampling times will depend on sequence length and available infrastructure. The following table gives times for collecting 1000 samples measured on an A100 GPU with 80 GB VRAM for sequences of different lengths (using a batch_size_100=20 setting in sample.py):
| sequence length | time / min | 
|---|---|
| 100 | 4 | 
| 300 | 40 | 
| 600 | 150 | 
By default, unphysical structures (steric clashes or chain discontinuities) will be filtered out, so you will typically get fewer samples in the output than requested. The difference can be very large if your protein has large disordered regions which are very likely to produce clashes. If you want to get all generated samples in the output, irrespective of whether they are physically valid, use the --filter_samples=False argument.
Note
If you wish to use your own generated MSA instead of the ones retrieved via Colabfold, you can pass an A3M file containing the query sequence as the first row to the sequence argument. Additionally, the msa_host_url argument can be used to override the default Colabfold MSA query server. See sample.py for more options.
This code only supports sampling structures of monomers. You can try to sample multimers using the linker trick, but in our limited experiments, this has not worked well.
BioEmu is also available on Azure AI Foundry. See How to run BioEmu on Azure AI Foundry for more details.
You can use this code together with code from bioemu-benchmarks to approximately reproduce results from our preprint.
The bioemu-v1.0 checkpoint contains the model weights used to produce the results in the preprint. Due to simplifications made in the embedding computation and a more efficient sampler, the results obtained with this code are not identical but consistent with the statistics shown in the preprint, i.e., mode coverage and free energy errors averaged over the proteins in a test set. Results for individual proteins may differ. For more details, please check the BIOEMU_RESULTS.md document on the bioemu-benchmarks repository.
BioEmu outputs structures in backbone frame representation. To reconstruct the side-chains, several tools are available. As an example, we interface with HPacker to conduct side-chain reconstruction, and also provide basic tooling for running a short molecular dynamics (MD) equilibration.
Warning
This code is experimental and relies on a conda-based package manager due to hpacker having conda as a dependency. Make sure that conda is in your PATH and that you have CUDA12-compatible drivers before running the following code.
Install optional dependencies:
pip install bioemu[md]You can compute side-chain reconstructions via the bioemu.sidechains_relax module:
python -m bioemu.sidechain_relax --pdb-path path/to/topology.pdb --xtc-path path/to/samples.xtcNote
The first time this module is invoked, it will attempt to install hpacker and its dependencies into a separate hpacker conda environment. If you wish for it to be installed in a different location, please set the HPACKER_ENV_NAME environment variable before using this module for the first time.
By default, side-chain reconstruction and local energy minimization are performed (no full MD integration for efficiency reasons). Note that the runtime of this code scales with the size of the system. We suggest running this code on a selection of samples rather than the full set.
There are two other options:
- To only run side-chain reconstruction without MD equilibration, add --no-md-equil.
- To run a short NVT equilibration (0.1 ns), add --md-protocol nvt_equil
To see the full list of options, call python -m bioemu.sidechain_relax --help.
The script saves reconstructed all-heavy-atom structures in samples_sidechain_rec.{pdb,xtc} and MD-equilibrated structures in samples_md_equil.{pdb,xtc} (filename to be altered with --outname other_name).
The code in the openfold subdirectory is copied from openfold with minor modifications. The modifications are described in the relevant source files.
If you have any questions not covered here, please create an issue or contact the BioEmu team by writing to the corresponding author on our preprint.
If you are using our code or model, please cite the following paper:
@article{bioemu2025,
  title={Scalable emulation of protein equilibrium ensembles with generative deep learning},
  author={Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew YK and Satorras, Victor Garc{\'\i}a and Abdin, Osama and Veeling, Bastiaan S and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Foster, Adam E. and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper Vincent and  Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Sordillo, Roberto and Tomioka, Ryota and Clementi, Cecilia and No{\'e},  Frank},
  journal={Science},
  pages={eadv9817},
  year={2025},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.adv9817}
}