Flowr.root -- A flow matching based foundation model for joint multi-purpose structure-aware 3D ligand generation and affinity prediction
This is a research repository introducing FLOWR.root.
-
GPU: CUDA-compatible GPU with at least 40GB VRAM recommended for inference
-
Installation time Installation takes roughly 5 minutes on a normal computer.
-
Package Manager: mamba
Install via:curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh bash Miniforge3-$(uname)-$(uname -m).sh
-
Create the Environment
Install the required environment using mamba:mamba env create -f environment.yml
If you are on a MacBook (tested on Apple M3 Max), install via:
mamba env create -f environment_mac.yml
-
Activate the Environment
conda activate flowr_root
-
Set PYTHONPATH
Ensure the repository directory is in your Python path:export PYTHONPATH="$PWD"
A Jupyter Notebook tutorial is provided at examples/examples.ipynb alongside a few protein-ligand complexes to play around with! You can also run this on your MacBook - install the respective environment and you are good to go (see above).
We provide all datasets in PDB and SDF format, as well as a fully trained FLOWR.root model checkpoint.
For training and generation, we provide basic bash and SLURM scripts in the scripts/ directory. These scripts are intended to be modified and adjusted according to your computational resources and experimental needs.
Download the datasets and the FLOWR.root checkpoint here: Google Drive.
If you provide a protein PDB/CIF file, you need to provide a ligand file (SDF/MOL/PDB) as well to cut out the pocket (default: 7A cutoff - modify if needed). We recommend using (Schrödinger-)prepared complexes for best results with the protein and ligand being protonated.
Note, if you want to run conditional generation, you need to provide a ligand file as reference. Crucially, there are two different modes, "global" and "local". Global: If you want to run scaffold hopping or elaboration (func_group_inpainting, scaffold_inpainting), interaction- (interaction_inpainting) or core-conditional (core_inpainting) generation, simply specifiy it via the respective flags. Local: If you want to replace a core, or a fragment of your reference ligand, specify the --substructure_inpainting flag and provide the atom indices with the --substructure flag.
Modify scripts/generate_pdb.sl according to your requirements, then submit the job via SLURM:
sbatch scripts/generate_pdb.slConditional Generation Options:
--substructure_inpainting: Enable substructure-constrained generation--substructure: Atom indices that you want to change (!) (e.g.,21 23 30 31 32 33 34 35)--scaffold_inpainting: Scaffold-constrained generation (using RDKit to extract RDKit)--func_group_inpainting: Functional group-constrained generation (using RDKit to extract all functional groups)--core_inpainting: Core-constrained generation (using RDKit to extract all cores)--linker_inpainting: Linker-constrained generation mode (using RDKit to extract all linkers)--interaction_inpainting: Interaction-constrained generation mode--compute_interactions: Needed for interaction_inpainting (using ProLIF to extract interactions)
Post-processing Options:
-
--filter_valid_unique: Filter for valid and unique molecules -
--filter_diversity: Apply diversity filtering -
--diversity_threshold: Tanimoto similarity threshold for diversity (default: 0.7) -
--optimize_gen_ligs: Optimize geometries in-pocket (using RDKit) -
--optimize_gen_ligs_hs: Optimize ligand hydrogens in-pocket (using RDKit) -
--filter_cond_substructure: Filter to ensure inpainting constraint is satisfied -
--filter_pb_valid: Filter by PoseBusters validity for generated molecules (using PoseBusters) -
--calculate_strain_energies: Calculate strain energies for generated molecules (using RDKit) -
--compute_interaction_recovery: Calculate interaction recovery (using ProLIF) -
Output: Generated ligands are saved as an SDF file at the specified location (save_dir) alongside the extracted pockets. The SDF file also contains predicted affinity values (pIC50, pKi, pKd, pEC50)
-
Runtime: Depends on system size, hardware specs. and batch size, but roughly 15s for 100 ligands on an H100 GPU.
Provide a protein PDB/CIF and a ligand file (SDF/MOL/PDB)
Modify scripts/predict_aff.sl according to your requirements, then submit the job via SLURM:
sbatch scripts/predict_aff.sl- Output: Ligands are saved as an SDF file at the specified location (save_dir). The SDF file contains predicted affinity values (pIC50, pKi, pKd, pEC50)
For ligand-only generation without a protein context, you can use the SDF-based generation script. All inpainting modes can be used here as well. Note, use the flowr_root_v2_mol.ckpt for that!
Modify scripts/generate_sdf.sl according to your requirements:
Conditional Generation Options:
--substructure_inpainting: Enable substructure-constrained generation--substructure: Atom indices that you want to change (!) (e.g.,21 23 30 31 32 33 34 35)--scaffold_inpainting: Scaffold-constrained generation (using RDKit to extract RDKit)--func_group_inpainting: Functional group-constrained generation (using RDKit to extract all functional groups)--core_inpainting: Core-constrained generation (using RDKit to extract all cores)--linker_inpainting: Linker-constrained generation mode (using RDKit to extract all linkers)
Post-processing Options:
--filter_valid_unique: Filter for valid and unique molecules--filter_diversity: Apply diversity filtering--diversity_threshold: Tanimoto similarity threshold for diversity (default: 0.9)--add_hs_gen_mols: Add hydrogens to generated molecules (using RDKit)--optimize_gen_mols_rdkit: Optimize geometries (using RDKit)--optimize_gen_mols_xtb: Optimize geometries (using xTB)--calculate_strain_energies: Calculate strain energies for generated molecules (using RDKit)--filter_cond_substructure: Filter to ensure inpainting constraint is satisfied
Submit the job via SLURM:
sbatch scripts/generate_sdf.sl-
Output: Generated ligands are saved as an SDF file at the specified location (save_dir).
-
Runtime: Depends on the number of molecules, hardware specs, and batch size.
To train FLOWR.root on preprocessed datasets downloaded from Google Drive, modify scripts/train.sh to your needs and run
bash scripts/train.sh- Output: Checkpoints will be saved at the specified location (save_dir).
To train/finetune FLOWR.root on your own custom datasets, you'll need to preprocess your protein-ligand complexes into the required LMDB format. The flowr/data/preprocess_data/ directory contains all necessary SLURM batch scripts to streamline this workflow.
Your input data should be organized in a folder named data/ with the following structure:
- Ligand files: SDF format
- Protein files: PDB format
- Naming convention: Files must share a consistent system identifier, like
data/ ├── system_1.sdf ├── system_1.pdb ├── system_2.sdf ├── system_2.pdb └── ...
The preprocessing pipeline consists of three sequential steps:
This script parallelizes the preprocessing across multiple jobs, creating N LMDB databases.
-
Modify
flowr/data/preprocess_data/custom_data/preprocess.slaccording to:- Your compute environment (partition, memory, time limits)
- Your folder structure (paths to
data/directory) - Number of parallel jobs via
num_jobsparameter (e.g.,num_jobs=100for larger,num_jobs=10for smaller datasets) - SLURM array size (
--array=1-Nwhere N ≥ num_jobs)
-
Submit the job:
sbatch flowr/data/preprocess_data/custom_data/preprocess.sl
Once all preprocessing jobs complete, merge the individual LMDB chunks into a single database.
-
Modify
flowr/data/preprocess_data/custom_data/merge.slif needed -
Submit the merge job:
sbatch flowr/data/preprocess_data/custom_data/merge.sl
-
Output: Unified LMDB saved in final/ folder
This final step computes essential data distribution statistics required for training.
-
Modify
flowr/data/preprocess_data/custom_data/data_statistics.slaccording to your split preference: -
Submit the statistics job:
sbatch flowr/data/preprocess_data/custom_data/data_statistics.sl
Option A: Custom Train/Val/Test Split
- Place your
splits.npzfile (with keys idx_train, idx_val and idx_test containing indices) in thefinal/folder - Comment out
--val_sizeand--test_sizeparameters indata_statistics.sl
Option B: Random Split
- The script will automatically create train/val/test splits with the specified sizes
- Modify
--val_sizeand--test_sizeas needed - Adjust
--seedfor reproducibility
- Output: Statistics saved alongside the final LMDB database
FLOWR.root can be fine-tuned on your custom datasets using full model or LoRA fine-tuning.
Before fine-tuning, ensure you have:
- Preprocessed your custom dataset following the Data Preprocessing workflow
- Downloaded the pre-trained FLOWR.root checkpoint from Google Drive
-
Modify
scripts/finetune.slaccording to your setup -
Submit the full fine-tuning job:
sbatch scripts/finetune.sl
-
Modify
scripts/finetune_lora.slaccording to your setup. -
Submit the LoRA fine-tuning job:
sbatch scripts/finetune_lora.sl
Contributions are welcome! If you have ideas, bug fixes, or improvements, please open an issue or submit a pull request.
This project is licensed under the MIT License.
If you use FLOWR.root in your research, please cite it as follows:
@misc{cremer2025flowrrootflowmatchingbased,
title={FLOWR.root: A flow matching based foundation model for joint multi-purpose structure-aware 3D ligand generation and affinity prediction},
author={Julian Cremer and Tuan Le and Mohammad M. Ghahremanpour and Emilia Sługocka and Filipe Menezes and Djork-Arné Clevert},
year={2025},
eprint={2510.02578},
archivePrefix={arXiv},
primaryClass={q-bio.BM},
url={https://arxiv.org/abs/2510.02578},
}