A lightweight toolkit for cleaning and standardizing organic reaction datasets.
Reason this release was yanked:
This is a pre-mature version.
Project description
chemrxn-cleaner
A lightweight toolkit for loading, cleaning, and standardizing organic reaction datasets before machine-learning or analytics workflows.
Prerequisites
- Python 3.9+
These dependencies are pulled in automatically when installing chemrxn-cleaner, with the exception of platform-specific RDKit wheels.
Installation
pip install chemrxn-cleaner
If you are working from a clone of this repository, install in editable mode to develop locally:
pip install -e .
How to Use the Package
ChemRxn-Cleaner aims to make cleaning reproducible. A typical workflow has five steps:
- Load reaction SMILES and metadata from USPTO
.rsmifiles or ORD protocol bundles. - Parse the SMILES strings into structured
ReactionRecordobjects. - Apply built-in or custom filters (length, element, metadata predicates, etc.).
- Canonicalize the surviving reactions to obtain consistent representations.
- Summarize the cleaning results or export the cleaned reactions downstream.
1. Loading Reaction Data
from chemrxn_cleaner.loader import load_uspto_rsmi, load_ord_pb_reaction_smiles
# USPTO .rsmi loader (metadata fields stored in meta["fields"])
uspto_rxns = load_uspto_rsmi("data/uspto_sample.rsmi", keep_meta=True)
# ORD dataset loader with optional metadata extraction
from chemrxn_cleaner.extractor import ord_procedure_yields_meta
ord_rxns = load_ord_pb_reaction_smiles(
"data/ord_dataset.pb.gz",
meta_extractor=ord_procedure_yields_meta,
)
Both loaders return lists of (reaction_smiles, metadata_dict) tuples which feed directly into the cleaning utilities.
2. Running the Built-in Cleaning Pipeline
from chemrxn_cleaner import basic_cleaning_pipeline
cleaned_ord = basic_cleaning_pipeline(ord_rxns)
basic_cleaning_pipeline parses, filters, and canonicalizes every reaction using the default filter stack (has_product, all_molecules_valid). The result is a list of immutable ReactionRecord instances with reactants, reagents, products, and meta attributes.
3. Custom Filters and Canonicalization Options
The cleaning helpers are composable; you can control the filter order and canonicalization behavior explicitly:
from chemrxn_cleaner.cleaning import clean_and_canonicalize
from chemrxn_cleaner.filters import (
default_filters,
max_smiles_length,
element_filter,
meta_filter,
)
from chemrxn_cleaner.types import ElementFilterRule
filters = default_filters() + [
max_smiles_length(250),
element_filter(
forbidList=ElementFilterRule(
reactantElements=[],
reagentElements=[],
productElements=["Cl", "Br"],
)
),
meta_filter(lambda meta: meta.get("procedure", {}).get("setup.atmosphere") == "N2"),
]
cleaned_custom = clean_and_canonicalize(
ord_rxns,
filters=filters,
isomeric=False, # drop stereochemistry if desired
drop_failed_parse=True,
)
Filters are simple callables accepting a ReactionRecord and returning True/False, so you can author domain-specific predicates without touching the core pipeline.
4. Working with Metadata
Metadata travels with each ReactionRecord through the pipeline. This makes it easy to slice cleaned reactions based on the original dataset context or to serialize extra descriptors:
subset = [
rec for rec in cleaned_custom
if rec.meta.get("yields") and rec.meta["yields"][0]["yield_percent"] > 80
]
Custom metadata extractors (see chemrxn_cleaner/extractor.py) can capture procedure notes, yields, or any other ORD fields you care about.
5. Reporting and Exporting
from chemrxn_cleaner import reporting
report = reporting.summarize_cleaning(ord_rxns, cleaned_custom)
report.pretty_print()
# Export canonical reaction SMILES + metadata for downstream use
import json
with open("cleaned_ord.jsonl", "w", encoding="utf-8") as f:
for rec in cleaned_custom:
payload = {
"reactants": rec.reactants,
"reagents": rec.reagents,
"products": rec.products,
"meta": rec.meta,
}
f.write(json.dumps(payload) + "\n")
Quick Start
from chemrxn_cleaner.loader import load_uspto_rsmi
from chemrxn_cleaner import basic_cleaning_pipeline, reporting
rxns = load_uspto_rsmi("/path/to/file.rsmi", keep_meta=True)
cleaned = basic_cleaning_pipeline(rxns)
report = reporting.summarize_cleaning(rxns, cleaned)
report.pretty_print()
End-to-End Example Script
The repository ships with examples/clean_data_example.py, a runnable walkthrough that ties everything together:
- Load USPTO
.rsmifiles and ORD.pb.gzdatasets. - Apply default + custom filters (length, element whitelist/blacklist, metadata predicates).
- Generate cleaning reports.
- Persist cleaned reactions to disk.
Run it with:
python examples/clean_data_example.py
Use this script as a template—swap in your file paths, tweak the filter stack, and tailor the export code to match the format your downstream tools expect.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chemrxn_cleaner-0.0.4.tar.gz.
File metadata
- Download URL: chemrxn_cleaner-0.0.4.tar.gz
- Upload date:
- Size: 14.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad07d75eca885b51b17ced84b67b5d8ad24f6e1bfb2b3392549e33be5d0cb28c
|
|
| MD5 |
a7f3e074a3b180f21a38d1a2fc5a7506
|
|
| BLAKE2b-256 |
f7d6b78ed9669b9f16ad458f043b04c08b6a59b32025b156eb844bd49b0036da
|
File details
Details for the file chemrxn_cleaner-0.0.4-py3-none-any.whl.
File metadata
- Download URL: chemrxn_cleaner-0.0.4-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f141be57aa73c739f867e29b498178549c9844ea2077d0a89a3decc31e7b0326
|
|
| MD5 |
13a79a1527e4bbeb858e0d73d30da3d9
|
|
| BLAKE2b-256 |
3a02b3956650b555025315c9d45a9fad02199814203cf6ceef68f94116988433
|