Skip to main content

A lightweight toolkit for cleaning and standardizing organic reaction datasets.

Reason this release was yanked:

This is a pre-mature version.

Project description

chemrxn-cleaner

A lightweight toolkit for loading, cleaning, and standardizing organic reaction datasets before machine-learning or analytics workflows.

Prerequisites

  • Python 3.9–3.10 (Python 3.12+ currently has rdkit compatibility issues)
  • RDKit (installable from rdkit-pypi on PyPI)
  • Open Reaction Database schema utilities: pip install ord-schema

These dependencies are pulled in automatically when installing chemrxn-cleaner, with the exception of platform-specific RDKit wheels.

Installation

pip install chemrxn-cleaner

If you are working from a clone of this repository, install in editable mode to develop locally:

pip install -e .

How to Use the Package

ChemRxn-Cleaner aims to make cleaning reproducible. A typical workflow has five steps:

  1. Load reaction SMILES and metadata from USPTO .rsmi files or ORD protocol bundles.
  2. Parse the SMILES strings into structured ReactionRecord objects.
  3. Apply built-in or custom filters (length, element, metadata predicates, etc.).
  4. Canonicalize the surviving reactions to obtain consistent representations.
  5. Summarize the cleaning results or export the cleaned reactions downstream.

1. Loading Reaction Data

from chemrxn_cleaner.loader import load_uspto_rsmi, load_ord_pb_reaction_smiles

# USPTO .rsmi loader (metadata fields stored in meta["fields"])
uspto_rxns = load_uspto_rsmi("data/uspto_sample.rsmi", keep_meta=True)

# ORD dataset loader with optional metadata extraction
from chemrxn_cleaner.extractor import ord_procedure_yields_meta
ord_rxns = load_ord_pb_reaction_smiles(
    "data/ord_dataset.pb.gz",
    meta_extractor=ord_procedure_yields_meta,
)

Both loaders return lists of (reaction_smiles, metadata_dict) tuples which feed directly into the cleaning utilities.

2. Running the Built-in Cleaning Pipeline

from chemrxn_cleaner import basic_cleaning_pipeline

cleaned_ord = basic_cleaning_pipeline(ord_rxns)

basic_cleaning_pipeline parses, filters, and canonicalizes every reaction using the default filter stack (has_product, all_molecules_valid). The result is a list of immutable ReactionRecord instances with reactants, reagents, products, and meta attributes.

3. Custom Filters and Canonicalization Options

The cleaning helpers are composable; you can control the filter order and canonicalization behavior explicitly:

from chemrxn_cleaner.cleaning import clean_and_canonicalize
from chemrxn_cleaner.filters import (
    default_filters,
    max_smiles_length,
    element_filter,
    meta_filter,
)
from chemrxn_cleaner.types import ElementFilterRule

filters = default_filters() + [
    max_smiles_length(250),
    element_filter(
        forbidList=ElementFilterRule(
            reactantElements=[],
            reagentElements=[],
            productElements=["Cl", "Br"],
        )
    ),
    meta_filter(lambda meta: meta.get("procedure", {}).get("setup.atmosphere") == "N2"),
]

cleaned_custom = clean_and_canonicalize(
    ord_rxns,
    filters=filters,
    isomeric=False,      # drop stereochemistry if desired
    drop_failed_parse=True,
)

Filters are simple callables accepting a ReactionRecord and returning True/False, so you can author domain-specific predicates without touching the core pipeline.

4. Working with Metadata

Metadata travels with each ReactionRecord through the pipeline. This makes it easy to slice cleaned reactions based on the original dataset context or to serialize extra descriptors:

subset = [
    rec for rec in cleaned_custom
    if rec.meta.get("yields") and rec.meta["yields"][0]["yield_percent"] > 80
]

Custom metadata extractors (see chemrxn_cleaner/extractor.py) can capture procedure notes, yields, or any other ORD fields you care about.

5. Reporting and Exporting

from chemrxn_cleaner import reporting

report = reporting.summarize_cleaning(ord_rxns, cleaned_custom)
report.pretty_print()

# Export canonical reaction SMILES + metadata for downstream use
import json
with open("cleaned_ord.jsonl", "w", encoding="utf-8") as f:
    for rec in cleaned_custom:
        payload = {
            "reactants": rec.reactants,
            "reagents": rec.reagents,
            "products": rec.products,
            "meta": rec.meta,
        }
        f.write(json.dumps(payload) + "\n")

Quick Start

from chemrxn_cleaner.loader import load_uspto_rsmi
from chemrxn_cleaner import basic_cleaning_pipeline, reporting

rxns = load_uspto_rsmi("/path/to/file.rsmi", keep_meta=True)
cleaned = basic_cleaning_pipeline(rxns)

report = reporting.summarize_cleaning(rxns, cleaned)
report.pretty_print()

End-to-End Example Script

The repository ships with examples/clean_data_example.py, a runnable walkthrough that ties everything together:

  1. Load USPTO .rsmi files and ORD .pb.gz datasets.
  2. Apply default + custom filters (length, element whitelist/blacklist, metadata predicates).
  3. Generate cleaning reports.
  4. Persist cleaned reactions to disk.

Run it with:

python examples/clean_data_example.py

Use this script as a template—swap in your file paths, tweak the filter stack, and tailor the export code to match the format your downstream tools expect.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemrxn_cleaner-0.0.3.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chemrxn_cleaner-0.0.3-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file chemrxn_cleaner-0.0.3.tar.gz.

File metadata

  • Download URL: chemrxn_cleaner-0.0.3.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for chemrxn_cleaner-0.0.3.tar.gz
Algorithm Hash digest
SHA256 fadb49621b229c105e88835fbe86e8adfecccd7089c85b89c5cc5aa0c7a1361b
MD5 b97791f4665236587692e0f2bc6d5d4d
BLAKE2b-256 56ea62a230902743088b4ff390edf0230d3796bcf2348e1cefff879c9310fb4d

See more details on using hashes here.

File details

Details for the file chemrxn_cleaner-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for chemrxn_cleaner-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a1650f7e60c6b1b2d076ebebf44ac70f1e53b0af8bd7c8e714875e6678511edc
MD5 f360f32999fce63222a16d29727cb3dd
BLAKE2b-256 066ecd93e4497badd726e5b35f6433466b10703522c621661e2f7c0c475ecc17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page