a framework for ingesting, validating, canonicalizing, and adapting retrosynthesis model outputs to a unified benchmark standard.
Project description
RetroCast: A Unified Format for Multistep Retrosynthesis
The Problem
Every multistep retrosynthesis model returns routes in a different format. AiZynthFinder uses bipartite molecule-reaction graphs. Retro* outputs precursor maps. DirectMultiStep produces recursive dictionaries. SynPlanner has its own schema. This fragmentation makes working with routes unnecessarily difficult.
The Solution
RetroCast provides:
-
A canonical data model for retrosynthesis routes (
schemas.py) - a simple, recursiveMolecule/ReactionStep/Routestructure that any model output can be cast into. -
Tested adapters for every major model - AiZynthFinder, Retro*, DirectMultiStep, SynPlanner, Syntheseus, ASKCOS, RetroChimera, DreamRetro, MultiStepTTL, SynLlama, PARoutes (14 adapters and counting).
-
Reproducible infrastructure - UV-managed dependencies with conflict resolution, locked versions, and deterministic processing with cryptographic hashing.
-
Curated evaluation sets - Subsets of the PaRoutes n=1 and n=5 test sets (100, 200, 500, 1k, 2k targets) designed to preserve statistical properties while enabling faster benchmarking.
Quick Start
Install
git clone https://github.com/ischemist/project-procrustes
cd project-procrustes
No need to manage virtual environments - UV handles everything.
Run Any Model in Three Commands
Example: AiZynthFinder with MCTS
# 1. Download model assets (once)
uv run scripts/aizynthfinder/1-download-assets.py data/models/aizynthfinder
# 2. Prepare stock file (once)
uv run --extra aizyn scripts/aizynthfinder/2-prepare-stock.py \
--files data/models/assets/retrocast-bb-stock-v3-canon.csv \
--source plain \
--output data/models/assets/retrocast-bb-stock-v3.hdf5 \
--target hdf5
# 3. Run predictions
uv run --extra aizyn scripts/aizynthfinder/3-run-aizyn-mcts.py --target-name "uspto-190"
Example: DirectMultiStep
# 1. Download model checkpoint
bash scripts/directmultistep/1-download-assets.sh
# 2. Run predictions
uv run --extra dms --extra torch-gpu scripts/directmultistep/2-run-dms.py \
--model-name "explorer-xl" \
--use-fp16 \
--target-name "uspto-190"
Each model follows the same pattern: numbered scripts in scripts/<model-name>/. UV automatically handles conflicting dependencies (PyTorch versions, NumPy pinning, etc.) via optional dependency groups.
Convert to Unified Format
Once you have raw model outputs, convert them to the canonical RetroCast format:
# Process a single model run
uv run scripts/process-predictions.py process --model aizynthfinder-mcts --dataset uspto-190
# List available models
uv run scripts/process-predictions.py list
# Show configuration for a specific model
uv run scripts/process-predictions.py info --model directmultistep
This will:
- Validate the raw output using model-specific schemas
- Transform it via the appropriate adapter to
Routeobjects - Deduplicate routes
- Save canonical output with a deterministic hash
Use as a Python Library
You can also use RetroCast programmatically to adapt individual routes from any supported model:
from retrocast import adapt_single_route, TargetIdentity
# Define your target
target = TargetIdentity(id="aspirin", smiles="CC(=O)Oc1ccccc1C(=O)O")
# Your model's raw prediction (e.g., DMS format)
raw_route = {
"smiles": "CC(=O)Oc1ccccc1C(=O)O",
"children": [
{"smiles": "Oc1ccccc1C(=O)O", "children": []},
{"smiles": "CC(=O)Cl", "children": []}
]
}
# Adapt to unified format - works with both route-centric (DMS, AiZynth)
# and target-centric (RetroChimera, ASKCOS) adapter formats
route = adapt_single_route(raw_route, target, adapter_name="dms")
if route:
print(f"Route depth: {route.length}")
print(f"Starting materials: {len(route.leaves)}")
See docs/api_usage.md for complete API documentation and examples.
Available Models
Adapters are implemented and tested for:
- AiZynthFinder (MCTS, Retro*)
- Retro* (original implementation)
- DirectMultiStep (Flash, Explorer variants)
- SynPlanner
- Syntheseus (BFS, Retro-0)
- ASKCOS
- RetroChimera
- DreamRetro
- MultiStepTTL
- SynLlama
- PARoutes
See retrocast-config.yaml for full configuration details.
Evaluation Sets
We provide curated subsets of the PaRoutes benchmark:
- uspto-190: Full USPTO test set (190 targets)
- paroutes-n1-{100,200,500,1k,2k}: Stratified subsets of the n=1 test set
- paroutes-n5-{100,200,500,1k,2k}: Stratified subsets of the n=5 test set
Each subset is:
- Hashed for reproducibility
- Balanced across route lengths and complexities
- Small enough for rapid iteration (100 targets ~10min vs 10k targets ~10hrs)
Subsets are selected such that top-k accuracy on the subset is within 0.05-1% of the full set, depending on size.
The Canonical Format
At the core of RetroCast is a clean recursive schema (src/retrocast/schemas.py):
class Molecule(BaseModel):
smiles: SmilesStr
inchikey: InchiKeyStr
synthesis_step: ReactionStep | None # None = leaf (starting material)
metadata: dict[str, Any]
class ReactionStep(BaseModel):
reactants: list[Molecule]
mapped_smiles: ReactionSmilesStr | None
template: str | None
reagents: list[SmilesStr] | None
solvents: list[SmilesStr] | None
metadata: dict[str, Any]
class Route(BaseModel):
target: Molecule
rank: int
solvability: dict[str, bool] # per building block set
metadata: dict[str, Any]
Every route from every model gets cast into this structure. No ambiguity, no special cases.
Architecture
RetroCast is built on three principles:
-
Adapters are the air gap - All model-specific logic is isolated in pluggable adapters. The core pipeline never touches raw formats directly.
-
Contracts, not handshakes - Pydantic schemas enforce validation at every boundary. Invalid data is rejected immediately.
-
Deterministic & auditable - Every output is identified by a cryptographic hash of its inputs. Results are reproducible and traceable.
The pipeline:
load raw data → adapter → Route → deduplicate → save + manifest
See docs/adapters.md for details on adding new adapters.
Citation
If you use RetroCast in your research, please cite:
# ArXiv citation - TODO: add link
License
MIT License - see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file retrocast-0.3.0.tar.gz.
File metadata
- Download URL: retrocast-0.3.0.tar.gz
- Upload date:
- Size: 609.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01f7d08f046151e220cc23c728af48978aed3fd1f6e11e7112a86df53a49b629
|
|
| MD5 |
f1022076c09b7818cb4082c5f7657e0c
|
|
| BLAKE2b-256 |
d68598182eb2d90104bb1e6954fb8c342d4eef754dfc50fcb48ace29203b4d47
|
File details
Details for the file retrocast-0.3.0-py3-none-any.whl.
File metadata
- Download URL: retrocast-0.3.0-py3-none-any.whl
- Upload date:
- Size: 97.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.11 {"installer":{"name":"uv","version":"0.9.11"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05ea68977eaef7f4b3dea5b07d78ddd419ed2fb4d86a1eaca46b00f7bce1ef03
|
|
| MD5 |
67c4a354d13152b08956d61ae2089555
|
|
| BLAKE2b-256 |
7f371c8a7db0049f280837c59641a70d29d9a6ead6bf00b617bf1238a354adc1
|