Skip to main content

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

Project description

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.

News

We have updated the repository and the Hugging Face checkpoints: the current code and weights correspond to the improved pipeline and much better results! The previous version, however, remains available as tag v1 (git checkout v1).

Overview

Matcha is a molecular docking pipeline that combines multi-stage flow matching with physical validity filtering. It consists of three sequential stages that progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R³, SO(3), and SO(2)). Physical validity filters eliminate unrealistic poses, and GNINA minimization and scoring ranks final predictions.

pipeline architecture

Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 31× faster than modern large-scale co-folding models.

results

Content

Installation

# Install with uv
uv sync

Or with pip:

pip install -e .

CLI usage

The matcha CLI wraps the full inference pipeline (ESM embeddings, 3-stage docking, PoseBusters filtering) with GNINA minimization and scoring into a single command.

Single ligand

uv run matcha -r protein.pdb -l ligand.sdf -o results/

Batch mode (multi-ligand file or directory)

uv run matcha -r protein.pdb --ligand-dir ligands.sdf -o results/

All molecules are processed in a single pipeline pass (native batching).

Key options

Flag Description
-r, --receptor Protein structure (.pdb)
-l, --ligand Single ligand (.sdf/.mol/.mol2/.pdb)
--ligand-dir Multi-ligand .sdf file or directory
-o, --out Output directory
-g, --device auto, cpu, cuda, cuda:N, or mps (Apple Metal)
--gpus Multi-GPU batch sharding, e.g. --gpus 2,3 (batch dir mode only)
--n-samples Poses per ligand (default: 40)
--scorer gnina (default), custom, or none
--scorer-minimize / --no-scorer-minimize GNINA minimization (default: on)
--autobox-ligand Box center from reference ligand
--center-x/y/z Manual box center (Å)
--overwrite Overwrite existing run

Run matcha --help for the full list.

Multi-GPU batch mode (2/3 GPUs)

For large ligand directories, Matcha can shard ligands across multiple GPUs by launching one process per GPU.

# 2 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 2,3 --box-json target_box.json -o out_2gpu

# 3 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 1,2,3 --box-json target_box.json -o out_3gpu

Outputs are merged into:

  • out_<...>/<run-name>/merged/
  • out_<...>/<run-name>/benchmark_summary.json
  • out_<...>/<run-name>/benchmark_summary.md

Search space

  • Blind docking (default): searches the entire protein surface.
  • Autobox: --autobox-ligand ref.sdf — centers the search on a reference ligand.
  • Manual box: --center-x X --center-y Y --center-z Z — explicit coordinates.

Output

Single mode produces <run-name>_best.sdf (top pose) and <run-name>_poses.sdf (all ranked poses). Batch mode creates best_poses/ and all_poses/ directories with per-ligand SDFs. A detailed log file is written alongside the results.

Datasets

Existing datasets

Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.

Adding new dataset

Use a dataset folder with the following structure:

dataset_path/
    uid1/
        uid1_protein.pdb
        uid1_ligand.sdf
    uid2/
        uid2_protein.pdb
        uid2_ligand.sdf
    ...

Preparing the config file

  1. Edit configs/paths/paths.yaml: set posebusters_data_dir, astex_data_dir, pdbbind_data_dir, dockgen_data_dir (or any_data_dir for a new dataset). Comment out unneeded entries in test_dataset_types.

  2. Set paths for intermediate and final data:

    • cache_path, data_folder, inference_results_folder
    • preprocessed_receptors_base: root directory for preprocessed protein structures used by the GNINA affinity scripts (see Protein preprocessing). Required when using GNINA steps; layout: {preprocessed_receptors_base}/{dataset}_{uid}/{dataset}_{uid}_protein.pdb.
  3. Download checkpoints from Hugging Face (LigandPro/Matcha) (the matcha_pipeline folder). Set checkpoints_folder in paths.yaml to the folder that contains it.

Protein preprocessing (for GNINA)

Protein structures used by the GNINA affinity scripts must be preprocessed (hydrogenation, PDBQT, etc.). We use the dockprep-pipeline for receptor and ligand preparation; see that repository for a minimal pipeline (Reduce/OpenMM hydrogenation, Meeko PDBQT). Further details are in the paper.

Running inference step-by-step

Preprocessing

uv run python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 uv run python scripts/compute_esm_embeddings.py -p configs/paths/paths.yaml

Matcha inference

CUDA_VISIBLE_DEVICES=<gpu_device_id> uv run python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n-samples 20

Pose selection and filtration

To run the full pipeline including GNINA affinity, minimization, top-pose selection, and metrics:

uv run bash scripts/final_inference_pipeline.sh -n inference_folder_name -c configs/base.yaml -p configs/paths/paths.yaml -d <gpu_device_id> -s 20 -g </path/to/gnina_executable> [--compute_final_metrics]

You must set preprocessed_receptors_base in paths.yaml (or provide preprocessed structures as required by the GNINA scripts) and pass -g with the path to your GNINA runner script. If you pass --compute_final_metrics, the script will compute dataset-level metrics for top-1 pose for each complex. Metrics include the computation of symmetry-corrected RMSD and PoseBusters filters.

Benchmarking and pocket-aligned RMSD computation

For other docking methods, prepare a folder of predictions with the structure described in the script. Then:

uv run python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>

Set methods_data and dataset_names inside the script as needed. For each method in methods_data, set flag has_predicted_proteins that indicates that the protein pdb itself has coordinates that differ from the original holo structure. Choose between base and pocket alignment (see Appendix G in the paper). By default we use pocket alignment for methods that have predicted protein structures (eg. AlphaFold3), and base for rigid docking methods (eg. DiffDock). In the latter case for rigid docking the alignment is not performed, but the results are rearranged for the further metrics computation. The resulting structures will appear in the inference_results_folder/<baseline_method_name>_<pocket_alignment_type>.

After aligning the predicted structures to the original holo protein structure, metrics from the best SDF predictions can be computed with:

uv run python scripts/compute_metrics_from_sdf.py -p configs/paths/paths.yaml -n <baseline_method_name>_<pocket_alignment_type> --prediction-type best_base_predictions

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

Citation

If you use Matcha in your work, please cite:

@misc{frolova2025matchamultistageriemannianflow,
      title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking}, 
      author={Daria Frolova and Talgat Daulbaev and Egor Sevriugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
      year={2025},
      eprint={2510.14586},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.14586}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcha-2.0.0.tar.gz (103.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

matcha-2.0.0-py3-none-any.whl (108.7 kB view details)

Uploaded Python 3

File details

Details for the file matcha-2.0.0.tar.gz.

File metadata

  • Download URL: matcha-2.0.0.tar.gz
  • Upload date:
  • Size: 103.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for matcha-2.0.0.tar.gz
Algorithm Hash digest
SHA256 098180630105f1a20f4df225c7108e1754ca9db85f0148f6de968a317b1e7517
MD5 1b4982156266967bf59c233c6ab89ba2
BLAKE2b-256 7807aa2500a9f964a7e42b1e54d64d56e45bdad1c549142aa2e9a75b5bb166f3

See more details on using hashes here.

File details

Details for the file matcha-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: matcha-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 108.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for matcha-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59f47a248a2407baf0f6c677da231a3bcc7f4b416f97348fcdddab0fd8e3e26b
MD5 a54982606107625cb744d7af483a643f
BLAKE2b-256 f32508b77b55676e55aa31943bf78bf6d831222b4cbf36b6299f3176c2ca7fbd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page