Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.

Overview

Matcha, is a molecular docking pipeline that combines multi-stage flow matching with learned scoring and physical validity filtering. Our approach consists of three sequential stages applied consecutively to progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R^3, SO(3), and SO(2)). We enhance the prediction quality through a dedicated scoring model and apply unsupervised physical validity filters to eliminate unrealistic poses.

Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 25× faster than modern large-scale co-folding models.

Content

Installation

To install the matcha package, do the following:

cd matcha
pip install -e .

Datasets

Existing datasets

If you want to make predictions for existing test datasets, you need to download them first. Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.

Adding new dataset

If you need to predict some new data, you need to construct a dataset folder with the following structure:

- dataset_path
    - uid1
        - f'{uid1}_protein.pdb' - protein structure file
        - f'{uid1}_ligand.sdf' - ligand sdf file (can be an arbitrary conformation or a true position: we compute random conformers anyway)
    - uid2
        - f'{uid2}_protein.pdb'
        - f'{uid2}_ligand.sdf' 
    - ...

Preparing the config file

You need to provide paths to the required datasets. Edit the configs/paths/paths.yaml file: add posebusters_data_dir, astex_data_dir, pdbbind_data_dir, dockgen_data_dir. For a new dataset provide its path in any_data_dir.

If you do not need all datasets, comment unnecessary datasets in the config's test_dataset_types, keeping only those that are needed.

Provide data paths to keep intermediate and final data:

cache_path: path where intermediate dataset cache files will be stored
data_folder: path where intermediate files will be stored (eg. ESM embeddings)
inference_results_folder: <path_to_inference_results>

Download checkpoints for Matcha for here. You need to download the pipeline folder. Provide the path to the folder where you store them (folder that contains pipeline) in the checkpoints_folder in paths.yaml.

Running inference with one script

The script to compute all preprocessing and inference scripts in one. Provide --compute_final_metrics if your dataset has true ligand positions in f'{uid_i}_ligand.sdf', so we can compute RMSD metrics and PoseBusters filters. Argument -n inference_folder_name is a name of a folder where to store inference results for dataset.

CUDA_VISIBLE_DEVICES=0 python scripts/full_inference.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n_samples 40 --compute_final_metrics

This script will provide a step-by-step computation of protein ESM embeddings, docking predictions, physically-aware unsupervised post-filtration, scoring and saving predictions to sdf.

Running inference step-by-step

Preprocessing

Run scripts to compute protein sequences anf ESM embeddings for all test datasets.

python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 python scripts/compute_esm_embeddings.py -p configs/paths/paths.yaml

Inference

To run inference:

CUDA_VISIBLE_DEVICES=0 python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n_samples 40

Then run posebusters_unsupervised.py to do fast unsupervised post-filtration and save_preds_for_npbench.py to get top-1 ranked pose. The top-ranked poses for each complex are stored in the inference_results_folder/inference_folder_name/sdf_predictions. You need to provide the same --n_samples, as for the run_inference_pipeline.py.

CUDA_VISIBLE_DEVICES=0 python scripts/posebusters_unsupervised.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n_samples 40
python scripts/save_preds_for_npbench.py -p configs/paths/paths.yaml -n inference_folder_name

Metrics computation

Optionally, you can compute sample-level and dataset-level metrics (such as symmetry-corrected RMSD and passing PoseBusters filters) having your predictions. Individual metrics for each sample are stored in the folder with inference results in file f'{dataset_name}_final_preds_all_metrics.npy'.

Dataset-level statistics are stored in the folder with inference results in file f'{dataset_name}_final_metrics.csv'. This can be done with the command:

python scripts/compute_metrics.py -p configs/paths/paths.yaml -n inference_folder_name

Benchmarking and pocket-aligned RMSD computation

If you need to compute metrics for other docking tool predictions, you firtsly need to construct a folder with method predictions with a certain structure. Docking methods can be divided into those that modify the protein structure (eg. co-folding methods, FlowDock, NeuralPlexer) and those that use the reference protein structure (eg. Matcha, DiffDock, SMINA). We have a separate flag has_pred_proteins to differentiate these cases. When has_pred_proteins=False, the folder needs to contain only ligands, while when has_pred_proteins=True, you need to store both proteins and ligands.

The structure of the folder with predictions:

- method_name
  - dataset_name
    - uid1
      - conf_0
        - 'prot.pdb' - protein structure file (optional, needed when has_pred_proteins=True)
        - 'lig_0.sdf' - ligand sdf file
    - uid2
      - conf_0
        - 'prot.pdb' - protein structure file (optional, needed when has_pred_proteins=True)
        - 'lig_0.sdf' - ligand sdf file
    - ...

Then, you need to run the script compute_aligned_rmsd.py to align protein pockets and to compute RMSD and posebusters metrics. It has two types of alignment ('base' and 'pocket'), among which we prefer 'base' option because it does not produce overoptimistic results (see Appendix D in the original paper for the discussion on this topic). You need to set methods_data dict inside the script, providing method names and their has_pred_proteins flags. Also, you need to provide the path where you store initial method predictions (--init-preds-path) as well as dataset_names you need to evaluate (dataset_names). Then, run the script:

python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>

License

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

If you use Matcha in your work, please cite our paper:

@misc{frolova2025matchamultistageriemannianflow,
      title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking}, 
      author={Daria Frolova and Talgat Daulbaev and Egor Sevryugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
      year={2025},
      eprint={2510.14586},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.14586}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
matcha		matcha
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Overview

Content

Installation

Datasets

Existing datasets

Adding new dataset

Preparing the config file

Running inference with one script

Running inference step-by-step

Preprocessing

Inference

Metrics computation

Benchmarking and pocket-aligned RMSD computation

License

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

LigandPro/Matcha

Folders and files

Latest commit

History

Repository files navigation

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Overview

Content

Installation

Datasets

Existing datasets

Adding new dataset

Preparing the config file

Running inference with one script

Running inference step-by-step

Preprocessing

Inference

Metrics computation

Benchmarking and pocket-aligned RMSD computation

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages