Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
Programming Language
Topic
- Scientific/Engineering :: Chemistry

Project description

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

This is an official implementation of the paper Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking.

News

We have updated the repository and the Hugging Face checkpoints: the current code and weights correspond to the improved pipeline and much better results! The previous version, however, remains available as tag v1 (git checkout v1).

Overview

Matcha is a molecular docking pipeline that combines multi-stage flow matching with physical validity filtering. It consists of three sequential stages that progressively refine docking predictions, each implemented as a flow matching model operating on appropriate geometric spaces (R³, SO(3), and SO(2)). Physical validity filters eliminate unrealistic poses, and GNINA minimization and scoring ranks final predictions.

pipeline architecture

Compared to various approaches, Matcha demonstrates superior performance on Astex and PDBBind test sets in terms of docking success rate and physical plausibility. Moreover, our method works approximately 31× faster than modern large-scale co-folding models.

Content

Installation
CLI usage
Datasets
- Existing datasets
- Adding new dataset
Preparing the config file
Protein preprocessing (for GNINA)
Running inference step-by-step
Benchmarking and pocket-aligned RMSD computation
License
Citation

Installation

# Install with uv
uv sync

Or with pip:

pip install -e .

CLI usage

The matcha CLI wraps the full inference pipeline (ESM embeddings, 3-stage docking, PoseBusters filtering) with GNINA minimization and scoring into a single command.

Single ligand

uv run matcha -r protein.pdb -l ligand.sdf -o results/

Batch mode (multi-ligand file or directory)

uv run matcha -r protein.pdb --ligand-dir ligands.sdf -o results/

All molecules are processed in a single pipeline pass (native batching).

Key options

Flag	Description
`-r`, `--receptor`	Protein structure (`.pdb`)
`-l`, `--ligand`	Single ligand (`.sdf`/`.mol`/`.mol2`/`.pdb`)
`--ligand-dir`	Multi-ligand `.sdf` file or directory
`-o`, `--out`	Output directory
`-g`, `--device`	`auto`, `cpu`, `cuda`, `cuda:N`, or `mps` (Apple Metal)
`--gpus`	Multi-GPU batch sharding, e.g. `--gpus 2,3` (batch dir mode only)
`--n-samples`	Poses per ligand (default: 40)
`--scorer`	`gnina` (default), `custom`, or `none`
`--scorer-minimize` / `--no-scorer-minimize`	GNINA minimization (default: on)
`--autobox-ligand`	Box center from reference ligand
`--center-x/y/z`	Manual box center (Å)
`--overwrite`	Overwrite existing run

Run matcha --help for the full list.

Multi-GPU batch mode (2/3 GPUs)

For large ligand directories, Matcha can shard ligands across multiple GPUs by launching one process per GPU.

# 2 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 2,3 --box-json target_box.json -o out_2gpu

# 3 GPUs
uv run matcha -r protein.pdb --ligand-dir ligands/ --gpus 1,2,3 --box-json target_box.json -o out_3gpu

Outputs are merged into:

out_<...>/<run-name>/merged/
out_<...>/<run-name>/benchmark_summary.json
out_<...>/<run-name>/benchmark_summary.md

Search space

Blind docking (default): searches the entire protein surface.
Autobox: --autobox-ligand ref.sdf — centers the search on a reference ligand.
Manual box: --center-x X --center-y Y --center-z Z — explicit coordinates.

Output

Single mode produces <run-name>_best.sdf (top pose) and <run-name>_poses.sdf (all ranked poses). Batch mode creates best_poses/ and all_poses/ directories with per-ligand SDFs. A detailed log file is written alongside the results.

Datasets

Existing datasets

Astex and PoseBusters datasets can be downloaded here. PDBBind_processed can be found here. DockGen can be downloaded from here.

Adding new dataset

Use a dataset folder with the following structure:

dataset_path/
    uid1/
        uid1_protein.pdb
        uid1_ligand.sdf
    uid2/
        uid2_protein.pdb
        uid2_ligand.sdf
    ...

Preparing the config file

Edit configs/paths/paths.yaml: set posebusters_data_dir, astex_data_dir, pdbbind_data_dir, dockgen_data_dir (or any_data_dir for a new dataset). Comment out unneeded entries in test_dataset_types.
Set paths for intermediate and final data:
- cache_path, data_folder, inference_results_folder
- preprocessed_receptors_base: root directory for preprocessed protein structures used by the GNINA affinity scripts (see Protein preprocessing). Required when using GNINA steps; layout: {preprocessed_receptors_base}/{dataset}_{uid}/{dataset}_{uid}_protein.pdb.
Download checkpoints from Hugging Face (LigandPro/Matcha) (the matcha_pipeline folder). Set checkpoints_folder in paths.yaml to the folder that contains it.

Protein preprocessing (for GNINA)

Protein structures used by the GNINA affinity scripts must be preprocessed (hydrogenation, PDBQT, etc.). We use the dockprep-pipeline for receptor and ligand preparation; see that repository for a minimal pipeline (Reduce/OpenMM hydrogenation, Meeko PDBQT). Further details are in the paper.

Running inference step-by-step

Preprocessing

uv run python scripts/prepare_esm_sequences.py -p configs/paths/paths.yaml
CUDA_VISIBLE_DEVICES=0 uv run python scripts/compute_esm_embeddings.py -p configs/paths/paths.yaml

Matcha inference

CUDA_VISIBLE_DEVICES=<gpu_device_id> uv run python scripts/run_inference_pipeline.py -c configs/base.yaml -p configs/paths/paths.yaml -n inference_folder_name --n-samples 20

Pose selection and filtration

To run the full pipeline including GNINA affinity, minimization, top-pose selection, and metrics:

uv run bash scripts/final_inference_pipeline.sh -n inference_folder_name -c configs/base.yaml -p configs/paths/paths.yaml -d <gpu_device_id> -s 20 -g </path/to/gnina_executable> [--compute_final_metrics]

You must set preprocessed_receptors_base in paths.yaml (or provide preprocessed structures as required by the GNINA scripts) and pass -g with the path to your GNINA runner script. If you pass --compute_final_metrics, the script will compute dataset-level metrics for top-1 pose for each complex. Metrics include the computation of symmetry-corrected RMSD and PoseBusters filters.

Benchmarking and pocket-aligned RMSD computation

For other docking methods, prepare a folder of predictions with the structure described in the script. Then:

uv run python scripts/compute_aligned_rmsd.py -p configs/paths/paths.yaml -a base --init-preds-path <path_to_initial_preds>

Set methods_data and dataset_names inside the script as needed. For each method in methods_data, set flag has_predicted_proteins that indicates that the protein pdb itself has coordinates that differ from the original holo structure. Choose between base and pocket alignment (see Appendix G in the paper). By default we use pocket alignment for methods that have predicted protein structures (eg. AlphaFold3), and base for rigid docking methods (eg. DiffDock). In the latter case for rigid docking the alignment is not performed, but the results are rearranged for the further metrics computation. The resulting structures will appear in the inference_results_folder/<baseline_method_name>_<pocket_alignment_type>.

After aligning the predicted structures to the original holo protein structure, metrics from the best SDF predictions can be computed with:

uv run python scripts/compute_metrics_from_sdf.py -p configs/paths/paths.yaml -n <baseline_method_name>_<pocket_alignment_type> --prediction-type best_base_predictions

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

If you use Matcha in your work, please cite:

@misc{frolova2025matchamultistageriemannianflow,
      title={Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking}, 
      author={Daria Frolova and Talgat Daulbaev and Egor Sevriugov and Sergei A. Nikolenko and Dmitry N. Ivankov and Ivan Oseledets and Marina A. Pak},
      year={2025},
      eprint={2510.14586},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.14586}, 
}

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Science/Research
Programming Language
Topic
- Scientific/Engineering :: Chemistry

Release history Release notifications | RSS feed

This version

2.0.0

Apr 6, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

matcha-2.0.0.tar.gz (103.5 kB view details)

Uploaded Apr 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

matcha-2.0.0-py3-none-any.whl (108.7 kB view details)

Uploaded Apr 6, 2026 Python 3

File details

Details for the file matcha-2.0.0.tar.gz.

File metadata

Download URL: matcha-2.0.0.tar.gz
Upload date: Apr 6, 2026
Size: 103.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for matcha-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`098180630105f1a20f4df225c7108e1754ca9db85f0148f6de968a317b1e7517`
MD5	`1b4982156266967bf59c233c6ab89ba2`
BLAKE2b-256	`7807aa2500a9f964a7e42b1e54d64d56e45bdad1c549142aa2e9a75b5bb166f3`

See more details on using hashes here.

File details

Details for the file matcha-2.0.0-py3-none-any.whl.

File metadata

Download URL: matcha-2.0.0-py3-none-any.whl
Upload date: Apr 6, 2026
Size: 108.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for matcha-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`59f47a248a2407baf0f6c677da231a3bcc7f4b416f97348fcdddab0fd8e3e26b`
MD5	`a54982606107625cb744d7af483a643f`
BLAKE2b-256	`f32508b77b55676e55aa31943bf78bf6d831222b4cbf36b6299f3176c2ca7fbd`

See more details on using hashes here.

matcha 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Matcha: Multi-Stage Riemannian Flow Matching for Accurate and Physically Valid Molecular Docking

News

Overview

Content

Installation

CLI usage

Single ligand

Batch mode (multi-ligand file or directory)

Key options

Multi-GPU batch mode (2/3 GPUs)

Search space

Output

Datasets

Existing datasets

Adding new dataset

Preparing the config file

Protein preprocessing (for GNINA)

Running inference step-by-step

Preprocessing

Matcha inference

Pose selection and filtration

Benchmarking and pocket-aligned RMSD computation

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes