- MassiveFold: parallelize protein structure prediction
- Installation
- Uninstallation
- Usage
- Ligand screening with MassiveFold
- massivefold_plots: output representation
- Troubleshooting
- Citation
- Authors
MassiveFold is a tool that allows to massively expand the sampling of structure predictions by improving the computation of AlphaFold based predictions.
It optimizes the parallelization of the structure inference by splitting the computation on CPU for alignments, running automatically batches of structure predictions on GPU, and gathering the results in one global output directory, with a global ranking and a variety of plots.
MassiveFold uses AFmassive, ColabFold or AlphaFold3 as inference engine; AFmassive is an updated version of Björn Wallner's AFsample that offers additional diversity parameters for massive sampling.
MassiveFold's design (see schematic below) is optimized for GPU cluster usage. It allows fast computation for massive sampling by automatically splitting a large run of numerous predictions into several jobs. Each of these individual jobs are computed on a single GPU node and their results are then gathered as a single output with each prediction ranked on a global level instead of the level of each individual job.
This automatic splitting is also convenient for massive sampling on a single GPU server to manage jobs priorities.
MassiveFold is only available with the SLURM workload manager (Simple Linux Utility for Resource Management) as it relies heavily on its features (job array, job dependency, etc...).
A run is composed of three steps:
-
alignment: on CPU, sequence alignments is the first step (can be skipped if alignments are already computed)
-
structure prediction: on GPU, structure predictions follow the massive sampling principle. The total number of predictions is divided into smaller batches and each of them is distributed on a single GPU. These jobs wait for the alignment job to be over, if the alignments are not provided by the user.
-
post_treatment: on CPU, it finishes the job by gathering all batches outputs and produces plots with the plots module to visualize the run's performances. This job is executed only once all the structure predictions are over.
MassiveFold was developed to run massive sampling with AFmassive, ColabFold and AlphaFold3, and relies on them for its installation.
- Retrieve MassiveFold
# clone MassiveFold's repository
git clone https://github.com/GBLille/MassiveFold.gitFor AFmassive runs, two additional installation steps are required to use MassiveFold:
- Download sequence databases
- Retrieve the neural network (NN) models parameters
For ColabFold runs, two additional installation steps are required to use MassiveFold:
- Download sequence databases
- Retrieve the neural network (NN) models parameters and move them to a 'params' folder in the sequence databases folder (same parameters as AFmassive)
For AlphaFold3 runs, two additional installation steps are required to use MassiveFold:
- Download sequence databases
- Retrieve the neural network (NN) models parameters
- Install MassiveFold
We use an installation based on conda. The install.sh script we provide installs the conda environments using the
environment.yml, mf-afmassive.yml, mf-colabfold.yml and mf-alphafold3.yml files. The first one is created for MassiveFold,
the second one for AFmassive, the third for ColabFold and the last one for Alphafold3. It also creates the files
architecture and set paths according to this architecture in the AFmassive_params.json and/or ColabFold_params.json
and/or AlphaFold3_params.json parameters file.
Help with:
cd MassiveFold
./install.sh -hInstallation with:
/install.sh [--only-envs] || --alphafold-db str --alphafold3-db str --colabfold-db str [--no-env]
./install -h for more details
Options:
--alphafold-db <str>: path to AlphaFold2 database
--alphafold3-db <str>: path to AlphaFold3 database
--colabfold-db <str>: path to ColabFold database
--no-env: do not install the environments, only sets up the files and parameters.
At least one of --alphafold-db or colabfold-db is required with this option.
--only-envs: only install the environments (other arguments are not used)This file tree displays the files' architecture after running ./install.sh.
MassiveFold
├── install.sh
├── ...
├── examples
├── massivefold
└── massivefold_runs
├── AFmassive_params.json
├── ColabFold_params.json
├── AlphaFold3_params.json
├── headers/
├── example_header_alignment_jeanzay.slurm
├── example_header_jobarray_jeanzay.slurm
└── example_header_post_treatment_jeanzay.slurm
├── input/
├── log/
├── output/
└── run_massivefold.shThe directory massivefold_runs is created, which contains:
AFmassive_params.jsonto set the run parameters for AFmassive,ColabFold_params.jsonto set the run parameters for ColabFold,AlphaFold3_params.jsonto set the run parameters for AlphaFold3,headers' directory, containing the headers that must be created to use MassiveFold. Examples are given for the Jean Zay national CNRS French cluster (ready to use, see the installation on Jean Zay to run MassiveFold directly on Jean Zay),inputwhich contains the FASTA sequences,logwith the logs of the MassiveFold runs (debug purposes),outputwhich contains the predictions,run_massivefold.shbeing the script to run MassiveFold
On a GPU cluster:
- the administrator only needs to install the environments:
cd MassiveFold
./install.sh --only-envs- the user only needs to install the remaining files:
cd MassiveFold
./install.sh --no-env --alphafold-db <AF_DB_PATH> --colabfold-db <CF_DB_PATH> --alphafold3-db <AF3_DB_PATH>- Create header files
Refer to Jobfile's header for this installation step.
To run MassiveFold in parallel on your cluster/server, it is required to build custom jobfile headers for each step.
They are three and should be named as follows: {step}.slurm (alignment.slurm, jobarray.slurm and
post_treatment.slurm). The headers contain the parameters to give to SLURM for the jobs running (#SBATCH parameters).
They have to be added in MassiveFold/massivefold_runs/headers/ directory. Depending on your installation it can be
another path, this path has to be set in the AFmassive_params.json and/or ColabFold_params.json and/or
AlphaFold3_params.json as jobfile_headers_dir parameter.
Headers for Jean Zay cluster are provided as examples to follow (named example_header_<step>_jeanzay.slurm), to use
them, rename each one following the previously mentioned naming convention.
- Set custom parameters
Each cluster has its own specifications in parameterizing job files. For flexibility needs, you can add your custom
parameters in your headers, and then in the AFmassive_params.json file and/or ColabFold_params.json file and/or
AlphaFold3_params.json so that you can dynamically change their values in the json file.
To illustrate these "special needs", here is an example of parameters that can be used on the French national Jean Zay cluster to specify GPU type, time limits or the project on which the hours are used:
Go to AFmassive_params.json/ColabFold_params.json/AlphaFold3_params.json location and modify it:
cd MassiveFold/massivefold_runs
"custom_params":
{
"jeanzay_gpu": "v100",
"jeanzay_project": "<project>",
"jeanzay_account": "<project>@v100",
"jeanzay_gpu_with_memory": "v100-32g",
"jeanzay_alignment_time": "05:00:00",
"jeanzay_jobarray_time": "15:00:00"
}And specify them in the jobfile headers (such as here for MassiveFold/headers/jobarray.slurm)
#SBATCH --account=$jeanzay_account
#SBATCH -C $jeanzay_gpu_with_memory
#SBATCH --time=$jeanzay_jobarray_time
The jobfiles for each step are built by combining the jobfile header that you have to create in MassiveFold/massivefold_runs/headers/ with the jobfile body in massivefold/parallelization/templates/.
Only the headers have to be adapted in function of your computation infrastructure. They contain the parameters to give to
SLURM for the job running (#SBATCH parameters).
Each of the three headers (alignment, jobarray and post treatment) must be located in the headers directory
(see File architecture section).
Their names should be identical to:
- alignment.slurm
- jobarray.slurm
- post_treatment.slurm
The templates work with the parameters provided in AFmassive_params.json and/or ColabFold_params.json and/or
AlphaFold3_params.json files, given as a parameter to the run_massivefold.sh script.
These parameters are substituted in the template job files thanks to the python library string.Template.
Refer to How to add a parameter for parameters substitution.
- Requirement: In the jobarray's jobfile header (massivefold_runs/headers/jobarray.slurm) should be stated that
it is a job array and the number of tasks in it has to be given. The task number argument is substituted with the
$substitute_batch_number parameter.
It should be expressed as:
#SBATCH --array=0-$substitute_batch_number
For example, if there are 45 batches, with 1 batch per task of the job array, the substituted expression will be:
#SBATCH --array=0-44
- Add these lines too in the headers, it is necessary to store MassiveFold's log:
In alignment.slurm:
#SBATCH --error=${logs_dir}/${sequence_name}/${run_name}/alignment.log
#SBATCH --output=${logs_dir}/${sequence_name}/${run_name}/alignment.log
In jobarray.slurm:
#SBATCH --error=${logs_dir}/${sequence_name}/${run_name}/jobarray_%a.log
#SBATCH --output=${logs_dir}/${sequence_name}/${run_name}/jobarray_%a.log
In post_treatment.slurm:
#SBATCH --output=${logs_dir}/${sequence_name}/${run_name}/post_treatment.log
#SBATCH --error=${logs_dir}/${sequence_name}/${run_name}/post_treatment.log
We provide headers for the Jean Zay French CNRS national GPU cluster (IDRIS,) that can also be used as examples for your own infrastructure.
- Add $new_parameter or ${new_parameter} in the template's header where you want its value to be set and
in the "custom_params" section of
AFmassive_params.jsonand/orColabFold_params.jsonand/orAlphaFold3_params.jsonwhere its value can be specified and modified conveniently for each run.
Example in the json parameters file for Jean Zay headers:
"custom_params":
{
"jeanzay_account": "project@v100",
"jeanzay_gpu_with_memory": "v100-32g",
"jeanzay_jobarray_time": "10:00:00"
}Where "project" is your 3 letter project with allocated hours on Jean Zay.
- These parameters will be substituted in the header where the parameter keys are located:
#SBATCH --account=$jeanzay_account
#SBATCH --error=${logs_dir}/${sequence_name}/${run_name}/jobarray_%a.log
#SBATCH --output=${logs_dir}/${sequence_name}/${run_name}/jobarray_%a.log
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --hint=nomultithread
#SBATCH --gpus-per-node=1
#SBATCH --array=0-$substitute_batch_number
#SBATCH --time=$jeanzay_jobarray_time
##SBATCH --qos=qos_gpu-t4 # Uncomment for job requiring more than 20h (max 16 GPUs)
#SBATCH -C $jeanzay_gpu_with_memory # GPU type+memory
- Never use single $ symbol for other uses than parameter/value substitution from the json file.
To use $ inside the template files (bash variables or other uses), use instead $$ as an escape following string.Template documentation.
To use it on Jean Zay, the only installation steps are:
git clone https://github.com/GBLille/MassiveFold.git
./install.shTo use AlphaFold3, copy your weights to the ~/af3_datadir directory and run the following command:
ln -s $DSDIR/Alphafold3/* ~/af3_datadir/The same file architecture is built, follow the usage section to use MassiveFold.
And specify the project you want to use in the AFmassive_params.json or ColabFold_params.json or AlphaFold3_params.json,
replacing the <project> value by the 3-letters project name.
N.B.: on Jean-Zay, AlphaFold3 runs only on A100 and H100.
We recommend a 5 TB fast storage to host the sequence databases for AFmassive, ColabFold and AlphaFold3. The requirements in RAM depend on the length of the sequence(s) but 128GB should work in most cases. A GPU with at least 16 GB RAM is also recommended, knowing that more memory allows to model larger systems.
To uninstall MassiveFold, remove the three conda environments (massivefold, mf-afmassive-1.1.5, mf-colabfold-1.5.5 and
mf-alphafold-3.0.1) and remove the MassiveFold folder you cloned. Make sure you copy all the files and folders you want
to keep from the output and log directories somewhere else.
Edit the AFmassive_params.json and/or ColabFold_params.json and/or AlphaFold3_params.json parameters file
(see file architecture).
Set first the parameters of your run in the
AFM_run section of the AFmassive_params.json, for instance:
"AFM_run":
{
"model_preset": "multimer",
"max_recycles": "20",
"templates": "true",
"dropout": "false",
"dropout_structure_module": "false",
"dropout_rates_filename": "",
"stop_recycling_below": "0",
"max_template_date": "2024-08-31",
"min_score": "0",
"max_score": "1",
"db_preset": "full_dbs",
"early_stop_tolerance": "0.5",
"bfd_max_hits": "100000",
"mgnify_max_hits": "501",
"uniprot_max_hits": "50000",
"uniref_max_hits": "10000"
}or in the ColabFold_params.json file, for instance:
"CF_run":
{
"model_preset": "multimer",
"pair_strategy": "greedy",
"use_dropout": "false",
"num_recycle": "20",
"recycle_early_stop_tolerance": "0.5",
"stop_at_score": "100",
"disable_cluster_profile": "false"
}or in the AlphaFold3_params.json file, for instance
"AF3_run":
{
"fasta_chains": ["protein","protein"],
"ligand": [
{"ccdCodes": [""], "smiles": ""}
],
"PTMs": [
[{"type": "", "sequence": "", "positions": []}]
],
"model_preset": "multimer",
"max_template_date": "2024-11-28",
"num_diffusion_samples": "5",
"unpairedMsa": "true",
"pairedMsa": "true",
"templates": "true"
}N.B.: "fasta_chains" has to be filled before starting a run because no default value is provided. It specifies the
type of each chain of the fasta file among "protein", "dna" and "rna". In this example, the two chains in the fasta
file are proteins.
Then you can set the parameters of the custom_params section if necessary and the plots section.
Activate the conda environment, then launch MassiveFold.
conda activate massivefold
./run_massivefold.sh -s <SEQUENCE_PATH> -r <RUN_NAME> -p <NUMBER_OF_PREDICTIONS_PER_MODEL> -f <JSON_PARAMETERS_FILE> -t <TOOL> N.B.: on the Jean Zay cluster, simply load the massivefold module. To be able to run on H100 or A100, uncomment the
corresponding last lines of the jobarray.slurm header. Example for H100:
module purge
module load arch/h100
Example for AFmassive:
./run_massivefold.sh -s input/H1140.fasta -r afm_default_run -p 5 -f AFmassive_params.json -t AFmassiveExample for ColabFold:
./run_massivefold.sh -s input/H1140.fasta -r cf_default_run -p 5 -f ColabFold_params.json -t ColabFoldExample for AlphaFold3:
./run_massivefold.sh -s input/H1140.fasta -r af3_default_run -p 5 -f AlphaFold3_params.json -t AlphaFold3For more help and list of required and facultative parameters, run:
./run_massivefold.sh -hHere is the help message given by this command:
Usage: ./run_massivefold.sh -s str -r str -p int -f str -t str [ -b int | [[-C str | -c] [-w int]] ] [-m str] [-n str] [-a] [-o]
./run_massivefold.sh -h for more details
Required arguments:
-s| --sequence: path of the sequence(s) to infer, should be a 'fasta' file
-r| --run: name chosen for the run to organize in outputs.
-p| --predictions_per_model: number of predictions computed for each neural network model.
If used with -t AlphaFold3, -p is the number of seeds used. Each seed will have 5 samples predicted.
In total, with -p n, you will have 5n predictions computed.
-f| --parameters: json file's path containing the parameters used for this run.
-t| --tool: (default: 'AFmassive') Use either AFmassive, AlphaFold3 or ColabFold in structure prediction for MassiveFold
Facultative arguments:
-b| --batch_size: (default: 25) number of predictions per batch. If -b > -p, batch size will be set to -p value.
-C| --calibration_from: path of a previous run to calibrate the batch size from (see --calibrate).
-w| --wall_time: (default: 20) total time available for calibration computations, unit is hours.
-m| --msas_precomputed: path to directory that contains computed msas.
-n| --top_n_models: uses the n neural network models with best ranking confidence from this run's path.
-j| --jobid: jobid of an alignment job to wait for inference, skips the alignments.
Facultative options:
-o| --only_msas: only compute alignments, the first step of MassiveFold
-c| --calibrate: calibrate --batch_size value. Searches from the previous runs for the same 'fasta' path given
in --sequence and uses the longest prediction time found to compute the maximal number of predictions per batch.
This maximal number depends on the total time given by --wall_time.
-a| --recompute_msas: purges previous alignment step and recomputes msas.It launches MassiveFold with the same parameters introduced above but instead of running AFmassive or ColabFold or AlphaFold3 a single time, it divides it into multiple batches.
For the following examples, we assume that --model_preset=multimer as it is the majority of cases to run MassiveFold in parallel.
However, --model_preset=monomer_ptm works too and needs to be adapted accordingly, at least the models to use (if parameter not set as default).
You can decide how the run will be divided by assigning run_massivefold.sh parameters e.g.:
./run_massivefold.sh -s ./input/H1140.fasta -r 1005_preds -p 67 -b 25 -f AFmassive_params.json -t AFmassiveThe predictions are computed individually for each neural network (NN) model, -p or --predictions_per_model
allows to specify the number of predictions desired for each chosen model.
These --predictions_per_model are then divided into batches with a fixed -b or --batch_size to optimize the
run in parallel as each batch can be computed on a different GPU, if available.
The last batch of each NN model is generally smaller than the others to match the number of predictions fixed by
--predictions_per_model.
N.B.: an interest to use run_massivefold.sh on a single server with a single GPU is to be able to run massive
sampling for a structure in low priority, allowing other jobs with higher priority to be run in between.
For example, with -b 25 and -p 67 the predictions are divided into the following batches (separated runs), which are repeated for each NN model:
- First batch: --start_prediction=0 and --end_prediction=24
- Second batch: --start_prediction=25 and --end_prediction=49
- Third batch: --start_prediction=50 and --end_prediction=67
By default (if --models_to_use is not assigned), all NN models are used: with --model_preset=multimer,
15 models in total = 5 neural network models
The prediction number per model can be adjusted, here with 67 per model and 15 models, it amounts to 1005 predictions in total divided into 45 batches, these batches can therefore be run in parallel on a GPU cluster infrastructure.
The batch size can also be auto calibrated with the -c or -C parameters if at least one basic run has already been
performed. The -c parameter will automatically search in the output folder that corresponds to the input sequence for
the longest prediction duration. These options have to be coupled with the -w walltime parameter (it is advised to
adapt this walltime value to the one of the job). For instance:
./run_massivefold.sh -s ./input/H1140.fasta -r 1005_preds -p 67 -f AFmassive_params.json -c -w 10 -t AFmassiveIn addition to the parameters displayed with -h option, the json parameters file set with -f or --parameters
should be organized like the AFmassive_params.json or ColabFold_params.json or AlphaFold3_params.json file.
Each section of AFmassive_params.json or ColabFold_params.json or AlphaFold3_params.json is used for a different
purpose.
The massivefold section designates the whole run parameters.
"massivefold":
{
"run_massivefold": "run_AFmassive.py",
"run_massivefold_plots": "../massivefold/massivefold_plots.py",
"data_dir": "$DSDIR/Alphafold-2024-04",
"uniref_database": "",
"jobfile_headers_dir": "./headers",
"jobfile_templates_dir": "../massivefold/parallelization/templates",
"output_dir": "./output",
"logs_dir": "./log",
"input_dir": "./input",
"scripts_dir": "../massivefold/parallelization",
"models_to_use": "",
"pkl_format": "full"
}The paths in the section are filled by install.sh but can be changed here if necessary.
Headers (jobfile_headers_dir) are specified to setup the run, in order to give the parameters that are required to
run the jobs on your cluster/server.
Build your own according to the Jobfile's header building section.
models_to_use is the list of NN models to use (for AFmassive and ColabFold). To select which NN models are used,
separate them with a comma e.g.: "model_3_multimer_v1,model_3_multimer_v3", by default all are used
pkl_format: how to manage pickle files (for AFmassive and ColabFold only)
- ‘full’ to keep the pickle files generated by the inference engine,
- ‘light’ to reduce its size by selecting main components, which are: number of recycles, PAE values, max PAE,
plddt scores, ptm scores, iptm scores and ranking confidence values (stored in ./light_pkl directory)
- ‘none’ to remove them
uniref_database is the parameter to fix the issue described in the Troubleshooting section (for AFmassive).
You can change specifically uniref30 database path at this location. If not specified ("uniref_database": ""), the
default uniref30 database path is used (same as AlphaFold2 configuration).
- The custom_params section is relative to the personalized parameters that you want to add for your own cluster. For instance, for the Jean Zay GPU cluster:
"custom_params":
{
"jeanzay_project": "<project>",
"jeanzay_account": "<project>@v100",
"jeanzay_gpu_with_memory": "v100-32g",
"jeanzay_alignment_time": "10:00:00",
"jeanzay_jobarray_time": "10:00:00"
}
As explained in How to add a parameter, these variables are substituted by their value when the jobfiles are created.
- For AFmassive, the AFM_run section gathers all the parameters used by MassiveFold for the run
(see AFmassive parameters
section). All parameters except --keep_pkl, --models_to_relax, --use_precomputed_msas, --alignment_only,
--start_prediction, --end_prediction, --fasta_path and --output_dir are exposed in this section.
You can adapt the parameter values in function of your needs.
The non exposed parameters mentioned before are set internally by the MassiveFold's pipeline or in the massivefold section (models_to_use and pkl_format).
"AFM_run":
{
"model_preset": "multimer",
"max_recycles": "20",
"templates": "true",
"dropout": "false",
"dropout_structure_module": "false",
"dropout_rates_filename": "",
"stop_recycling_below": "0",
"max_template_date": "2024-08-31",
"min_score": "0",
"max_score": "1",
"db_preset": "full_dbs",
"early_stop_tolerance": "0.5",
"bfd_max_hits": "100000",
"mgnify_max_hits": "501",
"uniprot_max_hits": "50000",
"uniref_max_hits": "10000"
}Lastly, the plots section is used for the MassiveFold plotting module. Only DM_plddt_PAE and CF_PAEs plots are
available for AlphaFold3. The recycles plot is not available for AFmassive monomers.
"plots":
{
"MF_plots_top_n_predictions": "10",
"MF_plots_chosen_plots": "coverage,DM_plddt_PAE,CF_PAEs,score_distribution,recycles"
}In MassiveFold, ligands and post modifications are configured in the AlphaFold3_params.json file.
For ligands, the "ligand" section has to be filled in with a CCD code or a SMILES code. In case of several, use several
entries in the JSON as in the following example.
For post-translational modifications, the "modifications" section has to be filled in.
These are the available modifications as of yet:
| Name | Chain type | Target residue | Target base |
|---|---|---|---|
| glycosylation | protein | N, S, T, K | null |
| phosphorylation | protein | S, T | null |
| methylation | protein & dna | R | C |
| hydroxylation | protein | P | null |
| acetylation | protein | K | null |
| cyclization | protein | E | null |
The "modifications" section contains as many entries (list) as the number of chains in the fasta file. The order of
these chains is the same as in the fasta file and in the "fasta_chains" section.
For each modification, these two keys are required:
"type"is the name of the modification (e.g.:glycosylation)"positions"is a list of the positions on the (fasta) sequence where the modifications have to be linked
If the modification is a glycosylation, another key is needed:
"sequence"is the sequence of the glycan in IUPAC code
The following example shows 2 protein chains (the example H1140) with 3 ligands, a total of three glycosylations and one phosphorylation. The first protein chain is glycosylated once (residue 36) and phosphorylated once (residue 20), the second is glycosylated twice (same glycan on residues 74 and 84).
"AF3_run": {
"fasta_chains": ["protein","protein"],
"ligand": [
{"ccdCodes": ["NAG"], "smiles": ""},
{"ccdCodes": ["KGM"], "smiles": ""},
{"ccdCodes": [""], "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"}
],
"modifications": [
[
{
"type": "glycosylation",
"sequence": "Gal(1-4)GlcNAc(1-2)Man(1-3)[Gal(1-4)GlcNAc(1-2)[Gal(1-4)GlcNAc(1-6)]Man(1-6)]Man(1-4)GlcNAc(1-4)[Fuc(1-6)]GlcNAc",
"positions": [36]
},
{
"type": "phosphorylation",
"sequence": "",
"positions": [20]
}
],
[
{
"type": "glycosylation",
"sequence": "Gal(1-4)GlcNAc(1-2)Man(1-3)[Gal(1-4)GlcNAc(1-2)[Gal(1-4)GlcNAc(1-6)]Man(1-6)]Man(1-4)GlcNAc(1-4)[Fuc(1-6)]GlcNAc",
"positions": [74,84]
}
]
],
"model_preset": "multimer",
"max_template_date": "2024-11-28",
"num_diffusion_samples": "5",
"unpairedMsa": "true",
"pairedMsa": "true",
"templates": "true"
}Because the relaxation takes time and resources to compute and that the MassiveFold process splits the predictions in
many batches, the “use_gpu_relax” and “models_to_relax” parameters are set to “false” and “none” respectively. Indeed,
if the relaxation is activated during the process, it will be run per batches, before the final ranking, resulting in
relaxed structures that wouldn't necessarily be the best predictions. Instead, we recommend to use the colabfold_relax
program provided in the mf-colabfold conda environment and developed by the ColabFold team, once all the predicted
structures are produced and ranked. It allows to relax only selected PDB structures.
For help, type:
colabfold_relax -hWe provide a gather_runs.py script in the massivefold folder that allows to collate the results of several runs. It
gathers all the results and ranks them all. Run python3 gather_runs.py -h for help.
We also provide an extract_scores.py script that allows to extract the scores from pickle files and create rankings
(notably useful for interrupted runs). Run python3 extract_scores.py -h for help.
To launch a screening round, run:
./run_massivefold_screening.sh -s <receptor_fasta_file> -l <ligand_list_csv> -f <AlphaFold3_params.json>Run -h for help, as usual:
./run_massivefold_screening.sh -hThis csv has 3 columns: "id", "smiles", and "ccdCode".
Each row is a ligand to use for the screening round. "id" designates the name of the ligand, it can simply be the number of the ligand in the list.
For the ligand sequence, use either "smiles" or "ccdCode" but not both, respectively in the SMILES format or the Chemical Compound Dictionnary code format (ccdCode) found at https://www.ebi.ac.uk/pdbe-srv/pdbechem/.
The predictions for each ligand will be located in a dedicated folder, itself located in an output folder
named with the name of the csv file. All the results, notably the scores, can be gathered with the gather_runs.py
script that can be found in the massivefold folder. See this section.
MassiveFold plotting module can also be used on a MassiveFold output to evaluate visually its predictions.
Here is an example of a basic command you can run:
conda activate massivefold
massivefold_plots.py --input_path=<path_to_MF_output> --chosen_plots=DM_plddt_PAE-
--input_path: it designates MassiveFold output dir and the directory to store the plots except if you want them in a separate directory (use
--output_pathfor this purpose) -
--chosen_plots: plots you want to get. You can give a list of plot names separated by a coma (e.g.:
--chosen_plots=coverage,DM_plddt_PAE,CF_PAEs).
Here is the list of available plots:
- DM_plddt_PAE: Deepmind's plots for predicted lddt per residue and predicted aligned error matrix
- CF_plddts: ColabFold's plot for predicted lddt per residue
- CF_PAEs: ColabFold's plot for predicted aligned error of the n best predictions set with --top_n_predictions
- coverage: ColabFold's plot for sequence alignment coverage
- score_distribution: performs 3 plots that summarize the score's distribution at three levels:
- distribution_comparison: ranking confidence distribution comparison between various MassiveFold outputs, typically
useful for runs with different sets of parameters on the same input sequence(s).
- recycles: ranking confidence during the recycle process (only for multimers and ColabFold monomers)
--top_n_predictions: (default 10), number of best predictions to take into account for plotting--runs_to_compare: names of the runs you want to compare on their distribution, this argument is coupled with --chosen_plots=distribution_comparison
More help with
conda activate massivefold
massivefold_plots.py --helpWhile using AlphaFold2 or another AlphaFold2 based software like AFmassive, you can encounter a bug similar to this one in the msas generation:
WARNING: maximum number of residues 32763 exceeded in sequence UniRef100_A0A5E4B6R9_consensus
or
WARNING: maximum number of residues 32763 exceeded in sequence UniRef100_UPI000F443DA9 titin n=1 Tax=Lagenorhynchus obliquidens TaxID=90247 RepID=UPI000F443DA9
It makes the msas unusable and causes any inference on the sequence to crash. This was referenced in a
github issue and a fix was provided.
According to this fix, download an updated version of the uniref30 database. To apply this modification to MassiveFold,
set the uniref_database parameter in the AFmassive_params.json file to the updated database similarly to this:
"massivefold":
{
"run_massivefold": "run_AFmassive.py",
"run_massivefold_plots": "../massivefold/massivefold_plots.py",
"data_dir": "/path/to/databases/AlphaFold/",
"uniref_database": "/path/to/databases/AlphaFold/UniRef30_2023_02_hhsuite/UniRef30_2023_02",
"jobfile_headers_dir": "./headers",
"jobfile_templates_dir": "../massivefold/parallelization/templates",
"output_dir": "./output",
"logs_dir": "./log",
"input_dir": "./input",
"scripts_dir": "../massivefold/parallelization",
"models_to_use": "",
"pkl_format": "full"
}MassiveFold can't run without SLURM. However, the mf-afmassive-1.1.5, mf-colabfold-1.5.5 and mf-alphafold-3.0.1
conda environments created at the installation allow to use respectively AFmassive, ColabFold and AlphaFold3 without parallelization.
Their usage is detailed on their respective GitHub webpages.
MassiveFold is currently using the ColabFold 1.5.5 conda environment. It runs colabfold_search for the alignments and
colabfold_batch for structure inference. Unfortunately, in this version, colabfold_batch doesn't allow to take a
templates file as input, which means it would run the templates search for every inference job, which is not optimal.
To be able to use templates with ColabFold, you can directly load the mf-colabfold-1.5.5 environment and run
colabfold_batch with the appropriate parameters, the --save-all parameter being mandatory (see colabfold_batch -h).
With default parameters for alignments and templates research, it will query the MSA server.
In case you would like to format the ColabFold output files to get the MassiveFold format, you need to put the ColabFold
ouput files in a batch_0 subfolder folder following the MassiveFold output file architecture :
output
└── <sequence name>
└── <run name>
└── batch_0 Then, run the following scripts from the massivefold_runs folder, replacing <sequence> and <run> generic names
with yours.
python3 ../massivefold/parallelization/unifier.py \
--to_convert output/<sequence>/<run>/batch_0/ \
--conversion output_singular \
--tool ColabFold
python3 ../massivefold/parallelization/organize_outputs.py \
--batches_path output/<sequence>/<run>/
To generate plots, you can then use the massivefold_plots.py script as described here.
If you use MassiveFold in your work, please cite:
Raouraoua N. et al. MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive
sampling. 2024. Nature Computational Science, DOI: 10.1038/s43588-024-00714-4,
https://www.nature.com/articles/s43588-024-00714-4
Nessim Raouraoua (UGSF - UMR 8576, France)
Claudio Mirabello (NBIS, Sweden)
Thibaut Véry (IDRIS, France)
Christophe Blanchet (IFB, France)
Björn Wallner (Linköping University, Sweden)
Marc F Lensink (UGSF - UMR8576, France)
Guillaume Brysbaert (UGSF - UMR 8576, France)
This work was carried out as part of Work Package 4 of the MUDIS4LS project led by the French Bioinformatics Institute (IFB). It was initiated at the IDRIS Open Hackathon, part of the Open Hackathons program. The authors would like to acknowledge OpenACC-Standard.org for their support.