Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
Project description
Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking
This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.
News
We have updated the repository and the Hugging Face checkpoints: the current code and weights correspond to the improved pipeline and much better results! The previous version, however, remains available as tag v1 (git checkout v1).
Overview
Matcha is a molecular docking pipeline that combines multi-stage flow matching with physical validity filtering. It consists of three sequential stages that progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R³, SO(3), and SO(2)). Physical validity filters eliminate unrealistic poses, and GNINA minimization and scoring ranks final predictions.
Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 31× faster than modern large-scale co-folding models.
Content
- Installation
- CLI usage
- Datasets
- Preparing the config file
- Protein preprocessing (for GNINA)
- Running inference step-by-step
- Benchmarking and pocket-aligned RMSD computation
- License
- Citation
Installation
# Install with uv
uv sync
Or with pip:
pip install -e .
CLI usage
The matcha CLI wraps the full inference pipeline (ESM embeddings, 3-stage docking, PoseBusters filtering) with GNINA minimization and scoring into a single command.
Single ligand
uv run matcha -r protein.pdb -l ligand.sdf -o results/
Batch mode (multi-ligand file or directory)
uv run matcha -r protein.pdb --ligand-dir ligands.sdf -o results/
All molecules are processed in a single pipeline pass (native batching).
Key options
| Flag | Description |
|---|---|
-r, --receptor |
Protein structure (.pdb) |
-l, --ligand |
Single ligand (.sdf/.mol/.mol2/.pdb) |
--ligand-dir |
Multi-ligand .sdf file or directory |
-o, --out |
Output directory |
-g, --device |
auto, cpu, cuda, cuda:N, or mps (Apple Metal) |
--gpus |
Multi-GPU batch sharding, e.g. --gpus 2,3 (batch dir mode only) |
--n-samples |
Poses per ligand (default: 40) |
--scorer |
gnina (default), custom, or none |
--scorer-minimize / --no-scorer-minimize |
GNINA minimization (default: on) |
--autobox-ligand |
Box center from reference ligand |
--center-x/y/z |
Manual box center (Å) |
--overwrite |
Overwrite existing run |
Run matcha --help for the full list.
Multi-GPU batch mode (2/3 GPUs)
For large ligand directories, Matcha can shard ligands across multiple GPUs by launching one process per GPU.
# 2 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 2,3 --box-json target_box.json -o out_2gpu
# 3 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 1,2,3 --box-json target_box.json -o out_3gpu
Outputs are merged into:
out_<...>/<run-name>/merged/out_<...>/<run-name>/benchmark_summary.jsonout_<...>/<run-name>/benchmark_summary.md
Search space
- Blind docking (default): searches the entire protein surface.
- Autobox:
--autobox-ligand ref.sdf— centers the search on a reference ligand. - Manual box:
--center-x X --center-y Y --center-z Z— explicit coordinates.
Output
Single mode produces <run-name>_best.sdf (top pose) and <run-name>_poses.sdf (all ranked poses). Batch mode creates best_poses/ and all_poses/ directories with per-ligand SDFs. A detailed log file is written alongside the results.
Datasets
Existing datasets
Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.
Adding new dataset
Use a dataset folder with the following structure:
dataset_path/
uid1/
uid1_protein.pdb
uid1_ligand.sdf
uid2/
uid2_protein.pdb
uid2_ligand.sdf
...
Preparing the config file
-
Edit
configs/paths/paths.yaml: setposebusters_data_dir,astex_data_dir,pdbbind_data_dir,dockgen_data_dir(orany_data_dirfor a new dataset). Comment out unneeded entries intest_dataset_types. -
Set paths for intermediate and final data:
cache_path,data_folder,inference_results_folderpreprocessed_receptors_base: root directory for preprocessed protein structures used by the GNINA affinity scripts (see Protein preprocessing). Required when using GNINA steps; layout:{preprocessed_receptors_base}/{dataset}_{uid}/{dataset}_{uid}_protein.pdb.
-
Download checkpoints from Hugging Face (LigandPro/Matcha) (the
matcha_pipelinefolder). Setcheckpoints_folderin paths.yaml to the folder that contains it.
Protein preprocessing (for GNINA)
Protein structures used by the GNINA affinity scripts must be preprocessed (hydrogenation, PDBQT, etc.). We use the dockprep-pipeline for receptor and ligand preparation; see that repository for a minimal pipeline (Reduce/OpenMM hydrogenation, Meeko PDBQT). Further details are in the paper.
Running inference step-by-step
Preprocessing
uv run python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 uv run python scripts/compute_esm_embeddings.py -p configs/paths/paths.yaml
Matcha inference
CUDA_VISIBLE_DEVICES=<gpu_device_id> uv run python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n-samples 20
Pose selection and filtration
To run the full pipeline including GNINA affinity, minimization, top-pose selection, and metrics:
uv run bash scripts/final_inference_pipeline.sh -n inference_folder_name -c configs/base.yaml -p configs/paths/paths.yaml -d <gpu_device_id> -s 20 -g </path/to/gnina_executable> [--compute_final_metrics]
You must set preprocessed_receptors_base in paths.yaml (or provide preprocessed structures as required by the GNINA scripts) and pass -g with the path to your GNINA runner script.
If you pass --compute_final_metrics, the script will compute dataset-level metrics for top-1 pose for each complex.
Metrics include the computation of symmetry-corrected RMSD and PoseBusters filters.
Benchmarking and pocket-aligned RMSD computation
For other docking methods, prepare a folder of predictions with the structure described in the script. Then:
uv run python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>
Set methods_data and dataset_names inside the script as needed.
For each method in methods_data, set flag has_predicted_proteins that indicates that the protein pdb itself has coordinates that differ from the original holo structure.
Choose between base and pocket alignment (see Appendix G in the paper).
By default we use pocket alignment for methods that have predicted protein structures (eg. AlphaFold3), and base for rigid docking methods (eg. DiffDock). In the latter case for rigid docking the alignment is not performed, but the results are rearranged for the further metrics computation.
The resulting structures will appear in the inference_results_folder/<baseline_method_name>_<pocket_alignment_type>.
After aligning the predicted structures to the original holo protein structure, metrics from the best SDF predictions can be computed with:
uv run python scripts/compute_metrics_from_sdf.py -p configs/paths/paths.yaml -n <baseline_method_name>_<pocket_alignment_type> --prediction-type best_base_predictions
License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Citation
If you use Matcha in your work, please cite:
@misc{frolova2025matchamultistageriemannianflow,
title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking},
author={Daria Frolova and Talgat Daulbaev and Egor Sevriugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
year={2025},
eprint={2510.14586},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.14586},
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file matcha-2.0.0.tar.gz.
File metadata
- Download URL: matcha-2.0.0.tar.gz
- Upload date:
- Size: 103.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
098180630105f1a20f4df225c7108e1754ca9db85f0148f6de968a317b1e7517
|
|
| MD5 |
1b4982156266967bf59c233c6ab89ba2
|
|
| BLAKE2b-256 |
7807aa2500a9f964a7e42b1e54d64d56e45bdad1c549142aa2e9a75b5bb166f3
|
File details
Details for the file matcha-2.0.0-py3-none-any.whl.
File metadata
- Download URL: matcha-2.0.0-py3-none-any.whl
- Upload date:
- Size: 108.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
59f47a248a2407baf0f6c677da231a3bcc7f4b416f97348fcdddab0fd8e3e26b
|
|
| MD5 |
a54982606107625cb744d7af483a643f
|
|
| BLAKE2b-256 |
f32508b77b55676e55aa31943bf78bf6d831222b4cbf36b6299f3176c2ca7fbd
|