This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.
Matcha, is a molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively
to progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R^3, SO(3), and SO(2)).
We enhance the prediction quality through a dedicated scoring model and apply
unsupervised physical validity filters to eliminate unrealistic poses.
Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25× faster than modern large-scale co-folding models.
- Installation
- Datasets
- Preparing the config file
- Running inference with one script
- Running inference step-by-step
- Benchmarking and pocket-aligned RMSD computation
- License
- Citation
To install the matcha package, do the following:
cd matcha
pip install -e .If you want to make predictions for existing test datasets, you need to download them first. Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.
If you need to predict some new data, you need to construct a dataset folder with the following structure:
- dataset_path
- uid1
- f'{uid1}_protein.pdb' - protein structure file
- f'{uid1}_ligand.sdf' - ligand sdf file (can be an arbitrary conformation or a true position: we compute random conformers anyway)
- uid2
- f'{uid2}_protein.pdb'
- f'{uid2}_ligand.sdf'
- ...
- You need to provide paths to the required datasets. Edit the
configs/paths/paths.yamlfile: addposebusters_data_dir,astex_data_dir,pdbbind_data_dir,dockgen_data_dir. For a new dataset provide its path inany_data_dir.
If you do not need all datasets, comment unnecessary datasets in the config's test_dataset_types, keeping only those that are needed.
- Provide data paths to keep intermediate and final data:
cache_path: path where intermediate dataset cache files will be storeddata_folder: path where intermediate files will be stored (eg. ESM embeddings)inference_results_folder: <path_to_inference_results>
- Download checkpoints for Matcha for here. You need to download the
pipelinefolder. Provide the path to the folder where you store them (folder that containspipeline) in thecheckpoints_folderin paths.yaml.
The script to compute all preprocessing and inference scripts in one.
Provide --compute_final_metrics if your dataset has true ligand positions in f'{uid_i}_ligand.sdf', so we can compute RMSD metrics and PoseBusters filters.
Argument -n inference_folder_name is a name of a folder where to store inference results for dataset.
CUDA_VISIBLE_DEVICES=0 python scripts/full_inference.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n_samples 40 --compute_final_metricsThis script will provide a step-by-step computation of protein ESM embeddings, docking predictions, physically-aware unsupervised post-filtration, scoring and saving predictions to sdf.
Run scripts to compute protein sequences anf ESM embeddings for all test datasets.
python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 python scripts/compute_esm_embeddings.py -p configs/paths/paths.yamlTo run inference:
CUDA_VISIBLE_DEVICES=0 python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n_samples 40Then run posebusters_unsupervised.py to do fast unsupervised post-filtration and save_preds_for_npbench.py to get top-1 ranked pose. The top-ranked poses for each complex are stored in the inference_results_folder/inference_folder_name/sdf_predictions.
You need to provide the same --n_samples, as for the run_inference_pipeline.py.
CUDA_VISIBLE_DEVICES=0 python scripts/posebusters_unsupervised.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n_samples 40
python scripts/save_preds_for_npbench.py -p configs/paths/paths.yaml -n inference_folder_nameOptionally, you can compute sample-level and dataset-level metrics (such as symmetry-corrected RMSD and passing PoseBusters filters) having your predictions.
Individual metrics for each sample are stored in the folder with inference results in file f'{dataset_name}_final_preds_all_metrics.npy'.
Dataset-level statistics are stored in the folder with inference results in file f'{dataset_name}_final_metrics.csv'.
This can be done with the command:
python scripts/compute_metrics.py -p configs/paths/paths.yaml -n inference_folder_nameIf you need to compute metrics for other docking tool predictions, you firtsly need to construct a folder with method predictions with a certain structure. Docking methods can be divided into those that modify the protein structure (eg. co-folding methods, FlowDock, NeuralPlexer) and those that use the reference protein structure (eg. Matcha, DiffDock, SMINA). We have a separate flag has_pred_proteins to differentiate these cases. When has_pred_proteins=False, the folder needs to contain only ligands, while when has_pred_proteins=True, you need to store both proteins and ligands.
The structure of the folder with predictions:
- method_name
- dataset_name
- uid1
- conf_0
- 'prot.pdb' - protein structure file (optional, needed when has_pred_proteins=True)
- 'lig_0.sdf' - ligand sdf file
- uid2
- conf_0
- 'prot.pdb' - protein structure file (optional, needed when has_pred_proteins=True)
- 'lig_0.sdf' - ligand sdf file
- ...
Then, you need to run the script compute_aligned_rmsd.py to align protein pockets and to compute RMSD and posebusters metrics.
It has two types of alignment ('base' and 'pocket'), among which we prefer 'base' option because it does not produce overoptimistic results (see Appendix D in the original paper for the discussion on this topic).
You need to set methods_data dict inside the script, providing method names and their has_pred_proteins flags.
Also, you need to provide the path where you store initial method predictions (--init-preds-path) as well as dataset_names you need to evaluate (dataset_names).
Then, run the script:
python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
If you use Matcha in your work, please cite our paper:
@misc{frolova2025matchamultistageriemannianflow,
title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking},
author={Daria Frolova and Talgat Daulbaev and Egor Sevryugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
year={2025},
eprint={2510.14586},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.14586},
}