Skip to main content

A lightweight toolkit for cleaning and standardizing organic reaction datasets.

Reason this release was yanked:

This is a pre-mature version.

Project description

chemrxn-cleaner

A lightweight toolkit for loading, cleaning, and standardizing organic reaction datasets before machine-learning or analytics workflows.

Prerequisites

  • Python 3.9+

These dependencies are pulled in automatically when installing chemrxn-cleaner, with the exception of platform-specific RDKit wheels.

Installation

pip install chemrxn-cleaner

If you are working from a clone of this repository, install in editable mode to develop locally:

pip install -e .

How to Use the Package

ChemRxn-Cleaner aims to make cleaning reproducible. A typical workflow has five steps:

  1. Load reaction SMILES and metadata from USPTO .rsmi files or ORD protocol bundles.
  2. Parse the SMILES strings into structured ReactionRecord objects.
  3. Apply built-in or custom filters (length, element, metadata predicates, etc.).
  4. Canonicalize the surviving reactions to obtain consistent representations.
  5. Summarize the cleaning results or export the cleaned reactions downstream.

1. Loading Reaction Data

from chemrxn_cleaner.loader import load_uspto_rsmi, load_ord_pb_reaction_smiles

# USPTO .rsmi loader (metadata fields stored in meta["fields"])
uspto_rxns = load_uspto_rsmi("data/uspto_sample.rsmi", keep_meta=True)

# ORD dataset loader with optional metadata extraction
from chemrxn_cleaner.extractor import ord_procedure_yields_meta
ord_rxns = load_ord_pb_reaction_smiles(
    "data/ord_dataset.pb.gz",
    meta_extractor=ord_procedure_yields_meta,
)

Both loaders return lists of (reaction_smiles, metadata_dict) tuples which feed directly into the cleaning utilities.

2. Running the Built-in Cleaning Pipeline

from chemrxn_cleaner import basic_cleaning_pipeline

cleaned_ord = basic_cleaning_pipeline(ord_rxns)

basic_cleaning_pipeline parses, filters, and canonicalizes every reaction using the default filter stack (has_product, all_molecules_valid). The result is a list of immutable ReactionRecord instances with reactants, reagents, products, and meta attributes.

3. Custom Filters and Canonicalization Options

The cleaning helpers are composable; you can control the filter order and canonicalization behavior explicitly:

from chemrxn_cleaner.cleaning import clean_and_canonicalize
from chemrxn_cleaner.filters import (
    default_filters,
    max_smiles_length,
    element_filter,
    meta_filter,
)
from chemrxn_cleaner.types import ElementFilterRule

filters = default_filters() + [
    max_smiles_length(250),
    element_filter(
        forbidList=ElementFilterRule(
            reactantElements=[],
            reagentElements=[],
            productElements=["Cl", "Br"],
        )
    ),
    meta_filter(lambda meta: meta.get("procedure", {}).get("setup.atmosphere") == "N2"),
]

cleaned_custom = clean_and_canonicalize(
    ord_rxns,
    filters=filters,
    isomeric=False,      # drop stereochemistry if desired
    drop_failed_parse=True,
)

Filters are simple callables accepting a ReactionRecord and returning True/False, so you can author domain-specific predicates without touching the core pipeline.

4. Working with Metadata

Metadata travels with each ReactionRecord through the pipeline. This makes it easy to slice cleaned reactions based on the original dataset context or to serialize extra descriptors:

subset = [
    rec for rec in cleaned_custom
    if rec.meta.get("yields") and rec.meta["yields"][0]["yield_percent"] > 80
]

Custom metadata extractors (see chemrxn_cleaner/extractor.py) can capture procedure notes, yields, or any other ORD fields you care about.

5. Reporting and Exporting

from chemrxn_cleaner import reporting

report = reporting.summarize_cleaning(ord_rxns, cleaned_custom)
report.pretty_print()

# Export canonical reaction SMILES + metadata for downstream use
import json
with open("cleaned_ord.jsonl", "w", encoding="utf-8") as f:
    for rec in cleaned_custom:
        payload = {
            "reactants": rec.reactants,
            "reagents": rec.reagents,
            "products": rec.products,
            "meta": rec.meta,
        }
        f.write(json.dumps(payload) + "\n")

Quick Start

from chemrxn_cleaner.loader import load_uspto_rsmi
from chemrxn_cleaner import basic_cleaning_pipeline, reporting

rxns = load_uspto_rsmi("/path/to/file.rsmi", keep_meta=True)
cleaned = basic_cleaning_pipeline(rxns)

report = reporting.summarize_cleaning(rxns, cleaned)
report.pretty_print()

End-to-End Example Script

The repository ships with examples/clean_data_example.py, a runnable walkthrough that ties everything together:

  1. Load USPTO .rsmi files and ORD .pb.gz datasets.
  2. Apply default + custom filters (length, element whitelist/blacklist, metadata predicates).
  3. Generate cleaning reports.
  4. Persist cleaned reactions to disk.

Run it with:

python examples/clean_data_example.py

Use this script as a template—swap in your file paths, tweak the filter stack, and tailor the export code to match the format your downstream tools expect.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemrxn_cleaner-0.0.4.tar.gz (14.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chemrxn_cleaner-0.0.4-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file chemrxn_cleaner-0.0.4.tar.gz.

File metadata

  • Download URL: chemrxn_cleaner-0.0.4.tar.gz
  • Upload date:
  • Size: 14.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for chemrxn_cleaner-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ad07d75eca885b51b17ced84b67b5d8ad24f6e1bfb2b3392549e33be5d0cb28c
MD5 a7f3e074a3b180f21a38d1a2fc5a7506
BLAKE2b-256 f7d6b78ed9669b9f16ad458f043b04c08b6a59b32025b156eb844bd49b0036da

See more details on using hashes here.

File details

Details for the file chemrxn_cleaner-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for chemrxn_cleaner-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f141be57aa73c739f867e29b498178549c9844ea2077d0a89a3decc31e7b0326
MD5 13a79a1527e4bbeb858e0d73d30da3d9
BLAKE2b-256 3a02b3956650b555025315c9d45a9fad02199814203cf6ceef68f94116988433

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page