A lightweight toolkit for cleaning and standardizing organic reaction datasets.

These details have not been verified by PyPI

Project links

Homepage

Reason this release was yanked:

This is a pre-mature version.

Project description

chemrxn-cleaner

A lightweight toolkit for loading, cleaning, and standardizing organic reaction datasets before machine-learning or analytics workflows.

Prerequisites

Python 3.9–3.10 (Python 3.12+ currently has rdkit compatibility issues)
RDKit (installable from rdkit-pypi on PyPI)
Open Reaction Database schema utilities: pip install ord-schema

These dependencies are pulled in automatically when installing chemrxn-cleaner, with the exception of platform-specific RDKit wheels.

Installation

pip install chemrxn-cleaner

If you are working from a clone of this repository, install in editable mode to develop locally:

pip install -e .

How to Use the Package

ChemRxn-Cleaner aims to make cleaning reproducible. A typical workflow has five steps:

Load reaction SMILES and metadata from USPTO .rsmi files or ORD protocol bundles.
Parse the SMILES strings into structured ReactionRecord objects.
Apply built-in or custom filters (length, element, metadata predicates, etc.).
Canonicalize the surviving reactions to obtain consistent representations.
Summarize the cleaning results or export the cleaned reactions downstream.

1. Loading Reaction Data

from chemrxn_cleaner.loader import load_uspto_rsmi, load_ord_pb_reaction_smiles

# USPTO .rsmi loader (metadata fields stored in meta["fields"])
uspto_rxns = load_uspto_rsmi("data/uspto_sample.rsmi", keep_meta=True)

# ORD dataset loader with optional metadata extraction
from chemrxn_cleaner.extractor import ord_procedure_yields_meta
ord_rxns = load_ord_pb_reaction_smiles(
    "data/ord_dataset.pb.gz",
    meta_extractor=ord_procedure_yields_meta,
)

Both loaders return lists of (reaction_smiles, metadata_dict) tuples which feed directly into the cleaning utilities.

2. Running the Built-in Cleaning Pipeline

from chemrxn_cleaner import basic_cleaning_pipeline

cleaned_ord = basic_cleaning_pipeline(ord_rxns)

basic_cleaning_pipeline parses, filters, and canonicalizes every reaction using the default filter stack (has_product, all_molecules_valid). The result is a list of immutable ReactionRecord instances with reactants, reagents, products, and meta attributes.

3. Custom Filters and Canonicalization Options

The cleaning helpers are composable; you can control the filter order and canonicalization behavior explicitly:

from chemrxn_cleaner.cleaning import clean_and_canonicalize
from chemrxn_cleaner.filters import (
    default_filters,
    max_smiles_length,
    element_filter,
    meta_filter,
)
from chemrxn_cleaner.types import ElementFilterRule

filters = default_filters() + [
    max_smiles_length(250),
    element_filter(
        forbidList=ElementFilterRule(
            reactantElements=[],
            reagentElements=[],
            productElements=["Cl", "Br"],
        )
    ),
    meta_filter(lambda meta: meta.get("procedure", {}).get("setup.atmosphere") == "N2"),
]

cleaned_custom = clean_and_canonicalize(
    ord_rxns,
    filters=filters,
    isomeric=False,      # drop stereochemistry if desired
    drop_failed_parse=True,
)

Filters are simple callables accepting a ReactionRecord and returning True/False, so you can author domain-specific predicates without touching the core pipeline.

4. Working with Metadata

Metadata travels with each ReactionRecord through the pipeline. This makes it easy to slice cleaned reactions based on the original dataset context or to serialize extra descriptors:

subset = [
    rec for rec in cleaned_custom
    if rec.meta.get("yields") and rec.meta["yields"][0]["yield_percent"] > 80
]

Custom metadata extractors (see chemrxn_cleaner/extractor.py) can capture procedure notes, yields, or any other ORD fields you care about.

5. Reporting and Exporting

from chemrxn_cleaner import reporting

report = reporting.summarize_cleaning(ord_rxns, cleaned_custom)
report.pretty_print()

# Export canonical reaction SMILES + metadata for downstream use
import json
with open("cleaned_ord.jsonl", "w", encoding="utf-8") as f:
    for rec in cleaned_custom:
        payload = {
            "reactants": rec.reactants,
            "reagents": rec.reagents,
            "products": rec.products,
            "meta": rec.meta,
        }
        f.write(json.dumps(payload) + "\n")

Quick Start

from chemrxn_cleaner.loader import load_uspto_rsmi
from chemrxn_cleaner import basic_cleaning_pipeline, reporting

rxns = load_uspto_rsmi("/path/to/file.rsmi", keep_meta=True)
cleaned = basic_cleaning_pipeline(rxns)

report = reporting.summarize_cleaning(rxns, cleaned)
report.pretty_print()

End-to-End Example Script

The repository ships with examples/clean_data_example.py, a runnable walkthrough that ties everything together:

Load USPTO .rsmi files and ORD .pb.gz datasets.
Apply default + custom filters (length, element whitelist/blacklist, metadata predicates).
Generate cleaning reports.
Persist cleaned reactions to disk.

Run it with:

python examples/clean_data_example.py

Use this script as a template—swap in your file paths, tweak the filter stack, and tailor the export code to match the format your downstream tools expect.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.0

Dec 11, 2025

0.0.4 yanked

Nov 16, 2025

Reason this release was yanked:

This is a pre-mature version.

This version

0.0.3 yanked

Nov 16, 2025

Reason this release was yanked:

This is a pre-mature version.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemrxn_cleaner-0.0.3.tar.gz (14.1 kB view details)

Uploaded Nov 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chemrxn_cleaner-0.0.3-py3-none-any.whl (13.7 kB view details)

Uploaded Nov 16, 2025 Python 3

File details

Details for the file chemrxn_cleaner-0.0.3.tar.gz.

File metadata

Download URL: chemrxn_cleaner-0.0.3.tar.gz
Upload date: Nov 16, 2025
Size: 14.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for chemrxn_cleaner-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`fadb49621b229c105e88835fbe86e8adfecccd7089c85b89c5cc5aa0c7a1361b`
MD5	`b97791f4665236587692e0f2bc6d5d4d`
BLAKE2b-256	`56ea62a230902743088b4ff390edf0230d3796bcf2348e1cefff879c9310fb4d`

See more details on using hashes here.

File details

Details for the file chemrxn_cleaner-0.0.3-py3-none-any.whl.

File metadata

Download URL: chemrxn_cleaner-0.0.3-py3-none-any.whl
Upload date: Nov 16, 2025
Size: 13.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for chemrxn_cleaner-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a1650f7e60c6b1b2d076ebebf44ac70f1e53b0af8bd7c8e714875e6678511edc`
MD5	`f360f32999fce63222a16d29727cb3dd`
BLAKE2b-256	`066ecd93e4497badd726e5b35f6433466b10703522c621661e2f7c0c475ecc17`

See more details on using hashes here.

chemrxn-cleaner 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chemrxn-cleaner

Prerequisites

Installation

How to Use the Package

1. Loading Reaction Data

2. Running the Built-in Cleaning Pipeline

3. Custom Filters and Canonicalization Options

4. Working with Metadata

5. Reporting and Exporting

Quick Start

End-to-End Example Script

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes