Repository implementing aSAM (atomistic Structural Autoencoder Model) in PyTorch. aSAM is a latent diffusion model for generating heavy atom conformations of proteins. The model was trained on datasets of molecular dynamics (MD) simulations from mdCATH or ATLAS. Here we provide code for aSAM and weights of two models trained on these two datasets.
This repository mainly implements the inference phase of aSAM (panel c in the figure below). Given an input 3D structure provided as a PDB file (and optionally a temperature T), you can generate a 3D structural ensemble which should represent an MD ensemble observable in the training dataset of aSAM: 100 ns x 3 runs or 500 ns x 5 runs for ALTAS and mdCATH, respectively.
We provide a Linux-only package for running aSAM. Here is how to install it:
-
Clone the repository:
git clone https://github.com/giacomo-janson/sam2.git
and navigate to the root directory of the repository.
-
We recommend installing the aSAM package in a dedicated Python environment, for example a Conda environment or a Python virtual environment. If you use Conda:
# Create the environment. conda create --name sam2 python=3.10 # Activate the environment. conda activate sam2
-
Optional (but recommended): if you want to use a GPU to speed up aSAM sampling, you may need to manually install PyTorch with CUDA. We have tested aSAM with PyTorch versions 1.13.1 to 2.6.0, higher versions should also work well. Make sure to choose a PyTorch version associated to a CUDA version compatible with your system. Check the PyTorch website here and here for installation instructions. Note: if you skip this step, a PyTorch version will be installed automatically in the next step, but it may not include CUDA support depending on your system.
-
Install the
samPython library and its dependencies:pip install -e .
We tested the package on Python 3.10, but it should work more recent versions too. The entire installation process should take a few minutes at most.
You can generate a structural ensemble of a protein via the scripts/generate_ensemble.py inference script. To run the mdCATH-based aSAMt model, the usage is:
python scripts/generate_ensemble.py -c config/mdcath_model.yaml -i examples/input/4qbuA03.320.pdb -o protein -n 24 -b 8 -T 320 -d cudaHere is a description of the arguments:
-c: configuration file for aSAMt. Use the default ones provided in theconfigdirectory of the repository.-i: initial PDB file. This represent the initial-o: output path. In this example, the command will save a series of output files namedprotein.*. The expected output is: (i) a DCD trajectory file storing the conformations you generated; (ii) a PDB file that you can use as topology for parsing the DCD file (this conformation is simply the first snapshot of the trajectory file).-n: number of conformations to generate. In this example we set it to 24, but depending on the application you may need more.-b: batch size used sampling.-T: simulation temperature (in units of Kelvin) at which to generate an ensemble.-d: PyTorch device for the aSAMt models. If you want to generate ensembles with large number of conformations, we recommend to use GPU support, via thecudavalue here.
There are also other options that you can tweak. Use the --help flag to get the full list. Using a GPU, generating 250 conformations should not more than a few minutes.
In case you want to use the ATLAS-based aSAMc model, the command is:
python scripts/generate_ensemble.py -c config/atlas_model.yaml -i examples/input/6h49_A.pdb -o protein -n 250 -b 8 -d cudaNote that this model is not conditioned on temperature, therefore any temperature input will be ignored.
The first time you will use the scripts/generate_ensemble.py script, it will automatically download aSAM weights from a GitHub release. The files are pretty large (~ 1 GB for the ATLAS and mdCATH models each), so make sure to have enough storage space. By default, the files will be placed in the $HOME/.sam2/weights directory, which will be automatically created. If you want to change their download location, change the SAM_WEIGHTS_PATH environmental variable before running the script.
📝 Input PDB structure: The only required input of this aSAM version is a PDB file. You can only provide PDB files with the following characteristics:
- Single chain PDB file.
- Single model PDB file (e.g.: no multi-model NMR input).
- No missing heavy atoms (e.g.: no missing residues, chain breaks).
- Only atoms from standard amino acid (e.g.: no nucleic acid, ligands or modified residues).
📝 Temperature range: The training set temperatures of the mdCATH-based aSAMt range from 320 to 450 K. We tested the model from temperatures from 250 to 710 K. Using temperatures outside this range might produce bad results.
The training/validation/test splits that we used for the ATLAS and mdCATH datasets are available in data/splits.
Links for the input PDB files of all systems analyzed in the aSAM article are available at: data/input/README.md.
This repository also contains the code for computing the scores used in the aSAM article to analyze and compare ensembles.
To use these scores, you need some input trajectory and PDB files. They will be parsed in the script via mdtraj and used to compute these scores. If you want to use some example files (with some expected output from aSAM), you can download them using these commands when you are in the root of this repository:
# Download some example files from our repository.
wget https://github.com/giacomo-janson/sam2/releases/download/data-1.0/ensemble_examples.zip
# Unzip.
unzip ensemble_examples.zip -d examples
# Cleanup.
rm ensemble_examples.zipthe commands below assume that you are analyzing the ensembles downloaded in this way, but of course you can use your own files. They might be useful if you want to replicate some of the benchmarks that we carried out in our article.
To analyze a single ensemble and compute its folded state fraction (FSF), secondary structure element preservation (SSEP) and average initRMSD use:
python scripts/ensemble_analysis.py --native_pdb examples/ensembles/4qbuA03.init.pdb --ensemble_top examples/ensembles/4qbuA03.sam.pdb --ensemble_traj examples/ensembles/4qbuA03.320.sam.xtcTo compare two ensembles of the same protein via the comparison scores that we used in the aSAM paper, use:
python scripts/ensemble_comparison.py --ref_top examples/ensembles/6h49_A.md.pdb --ref_traj examples/ensembles/6h49_A.md.xtc --hat_top examples/ensembles/6h49_A.sam.pdb --hat_traj examples/ensembles/6h49_A.sam.xtc --init_pdb examples/input/6h49_A.pdbNote: the script will not compute WASCO scores. If you want to compute them, you need to manually use the WASCO package. To compute the WASCO scores reported in the aSAM paper, we used code based on the comparison_tool.ipynb notebook in that package.
The decoder of aSAM relies on the Structure Module of AlphaFold2. We used the OpenFold implementation of this module with minor modifications. OpenFold is licensed under the Apache License 2.0. The sam/openfold directory is a direct copy from the OpenFold repository, with some small edits.
If you are looking for idpSAM, a predecessor of aSAM trained on Cα atoms of implicit solvent simulations of intrinsically disordered peptides, please refer to the idpSAM repository.
- 7/3/2025: initial release.
Janson G., Jussupow A. & Feig M. Deep generative modeling of temperature-dependent structural ensembles of proteins. Preprint at https://www.biorxiv.org/content/10.1101/2025.03.09.642148v1 (2025).
Feig Lab at Michigan State University, Department of Biochemistry and Molecular Biology
Michael Feig, [email protected]
Giacomo Janson, [email protected]