A lightweight toolkit for cleaning and standardizing organic reaction datasets.

These details have not been verified by PyPI

Project links

Homepage

Project description

chemrxn-cleaner

Lightweight helpers for parsing, cleaning, filtering, reporting, and exporting organic reaction datasets before ML or analytics workflows.

Design

---
config:
  layout: elk
---
flowchart TB
  subgraph Sources["Sources"]
    uspto["USPTO .rsmi 
        load_uspto"]:::sourcesNode
    ord["ORD .pb/.pb.gz 
        load_ord"]:::sourcesNode
    csv["CSV 
        load_csv"]:::sourcesNode
    json["JSON 
        load_json"]:::sourcesNode
    custom["Custom format register_input_format"]:::sourcesNode
  end
  subgraph IO["Loader Registry"]
    load["load_reactions(fmt=...) → 
        ReactionRecord"]:::ioNode
  end
  subgraph Cleaning["Parser + Filter + Cleaner"]
    parse["parse_reaction_smiles"]:::cleanNode
    filterStack["ReactionFilter 
        has_product  
        all_molecules_valid meta_filter  
        element_filter  
        max_smiles_length 
        similarity_filter 
        ......"]:::filterStack
    clean["clean_reactions clean_and_canonicalize"]:::cleanNode
  end
  subgraph Output["Output"]
    export["export_reaction_records"]:::outputNode
    df["records_to_dataframe"]:::outputNode
    dataset["ForwardReactionDataset"]:::outputNode
  end
  uspto --> load
  ord --> load
  csv --> load
  json --> load
  custom --> load
  load --> records[/"List[ReactionRecord]"/]:::ioNode
  records --> parse
  parse --> clean
  filterStack -.-> clean
  clean --> cleaned[/"Cleaned ReactionRecords"/]:::cleanNode & stats["CleaningStats + FilterStats"]:::cleanNode
  cleaned --> export & df
  df --> dataset

  %% Highlight subgraphs/blocks
  style Sources fill:#E3F2FD,stroke:#0277BD,stroke-width:2px
  style IO fill:#F1F8E9,stroke:#689F38,stroke-width:2px
  style Cleaning fill:#FFF8E1,stroke:#FFC107,stroke-width:2px
  style Output fill:#FFEBEE,stroke:#D32F2F,stroke-width:2px

  %% Individual node color
  classDef sourcesNode fill:#90CAF9,stroke:#1565C0,color:#08306b;
  classDef ioNode fill:#AED581,stroke:#33691E,color:#1B5E20;
  classDef cleanNode fill:#FFE082,stroke:#FFB300,color:#795548;
  classDef filterStack fill:#C8E6C9,stroke:#388E3C,color:#1B5E20;
  classDef outputNode fill:#FFCDD2,stroke:#C62828,color:#B71C1C;

Installation

Python 3.9+ with RDKit available (platform-specific wheels are not bundled).
The package depends on ord-schema, pandas, tqdm, and torch (for the ML helpers).

pip install chemrxn-cleaner

Developing locally? Install in editable mode:

pip install -e .

Quick start

from chemrxn_cleaner import (
    clean_and_canonicalize,
    clean_reactions_with_report,
    default_filters,
    export_reaction_records,
    load_reactions,
)

raw = load_reactions("data/sample.rsmi", fmt="uspto", keep_meta=True)
filters = default_filters()

cleaned = clean_and_canonicalize(raw, filters=filters)
print(f"Kept {len(cleaned)}/{len(raw)} reactions after cleaning")

# Need per-filter stats? Call the reporting variant instead:
cleaned_with_report, stats = clean_reactions_with_report(raw, filters=filters)
print(f"Failed parses: {stats.n_failed_parse}, dropped: {stats.n_input - stats.n_output}")

export_reaction_records(cleaned, "cleaned.json", fmt="json")
export_reaction_records(cleaned, "cleaned.csv", fmt="csv")

Loading reaction data

Use the registry-driven load_reactions(..., fmt=...) helper or call the individual loaders directly. Built-in formats are auto-registered when the package is imported.

USPTO .rsmi (optionally keep tab-separated metadata in extra_metadata["fields"]):

from chemrxn_cleaner import load_reactions

uspto_rxns = load_reactions("data/uspto_sample.rsmi", fmt="uspto", keep_meta=True)

ORD .pb/.pb.gz (populates reaction_id, basic conditions, yields, and extra_metadata["reaction_index"]):

ord_rxns = load_reactions(
    "data/ord_dataset.pb.gz",
    fmt="ord",
    generate_if_missing=True,
    allow_incomplete=True,
    canonical=True,
)

CSV (either assemble reaction SMILES from columns or read a pre-built column). The optional mapper(record, row) can enrich or skip rows by returning None.

from chemrxn_cleaner import load_reactions
from chemrxn_cleaner.types import ReactionRecord

csv_rxns = load_reactions(
    "data/reactions.csv",
    fmt="csv",
    reactant_columns=["reactant_a", "reactant_b"],
    reagent_columns=["catalyst"],
    product_columns=["product"],
    mapper=lambda record, row: (
        record.extra_metadata.update({"temperature": row.get("temp_c")}) or record
    ),
)

csv_rxns_prebuilt = load_reactions(
    "data/reactions.csv",
    fmt="csv",
    reaction_smiles_column="rxn_smiles",
)

JSON (supply a mapper per entry):

from chemrxn_cleaner import load_reactions
from chemrxn_cleaner.parser import parse_reaction_smiles

def map_json_entry(item):
    rec = parse_reaction_smiles(f"{item['reactants']}>>{item['products']}", strict=False)
    rec.source = "json"
    rec.extra_metadata.update(item.get("meta", {}))
    return rec

json_rxns = load_reactions("data/reactions.json", fmt="json", mapper=map_json_entry)

Custom formats: Register your own loader and call it through the registry.

from chemrxn_cleaner import load_reactions, register_input_format
from chemrxn_cleaner.types import ReactionRecord

def load_my_format(path: str):
    rec = ReactionRecord(reaction_smiles="A>B>C", source="myfmt")
    return [rec]

register_input_format("myfmt", load_my_format)
rxns = load_reactions("my_file.txt", fmt="myfmt")

Cleaning and filters

clean_reactions parses missing reactant/reagent/product lists, applies filters, and optionally drops failed parses. clean_and_canonicalize also canonicalizes every SMILES; call it with filters=default_filters() for the default stack (has_product, all_molecules_valid, strict parsing, isomeric SMILES).

from chemrxn_cleaner import (
    clean_and_canonicalize,
    clean_reactions_with_report,
    default_filters,
    max_smiles_length,
)
from chemrxn_cleaner.filters import ElementFilterRule, element_filter, meta_filter
from chemrxn_cleaner.utils import similarity_filter

filters = default_filters() + [
    max_smiles_length(250),
    element_filter(
        forbidList=ElementFilterRule([], ["Cl"], []),
    ),
    meta_filter(lambda meta: meta.get("source") == "trusted"),
    similarity_filter("c1ccccc1", role="reactant", threshold=0.6),
]

cleaned = clean_and_canonicalize(
    rxn_smiles_list=uspto_rxns,
    filters=filters,
    isomeric=True,
)

# Get a per-filter report alongside the cleaned list
cleaned, stats = clean_reactions_with_report(uspto_rxns, filters=filters)

Filters are simple callables returning True/False. Compose meta_filter, element_filter, max_smiles_length, similarity_filter, or author your own to encode domain rules.

Reporting and exporting

Use clean_reactions_with_report to capture filter-level counters, failed parse counts, and the final number of reactions kept.

from chemrxn_cleaner import clean_reactions_with_report, export_reaction_records

cleaned, stats = clean_reactions_with_report(raw, filters=filters)
print(f"Input: {stats.n_input}, output: {stats.n_output}, failed_parse: {stats.n_failed_parse}")
for name, fstats in stats.per_filter.items():
    print(f"{name}: applied={fstats.applied}, passed={fstats.passed}, failed={fstats.failed}")

export_reaction_records(cleaned, "cleaned.json", fmt="json")
export_reaction_records(cleaned, "cleaned.csv", fmt="csv")

# Combine stats from parallel cleaning runs
from chemrxn_cleaner.reporter import CleaningStats

combined_stats = CleaningStats.combine([stats_worker_1, stats_worker_2, stats_worker_3])
print(combined_stats.n_input, combined_stats.n_output)

Working with `ReactionRecord`

ReactionRecord stores the parsed reaction (reaction_smiles, reactants, reagents, products) plus identifiers (reaction_id, source, source_ref, source_file_path), optional conditions (temperature_c, time_hours, pressure_bar, ph, solvents, catalysts, bases, additives, atmosphere, scale_mmol), yields (yield_value, yield_type), success/selectivity flags, and arbitrary extra_metadata. Use to_dict()/from_dict() for serialization, and show() to render a reaction image when RDKit visualization is available.

ML utilities

records_to_dataframe converts ReactionRecords to a pandas DataFrame for quick EDA/export.
train_valid_test_split produces a deterministic random split.
ForwardReactionDataset is a minimal PyTorch Dataset for forward prediction; each record should expose reactant_smiles, reagent_smiles, and product_smiles attributes if you plan to use it with a model.

from chemrxn_cleaner import ForwardReactionDataset, records_to_dataframe, train_valid_test_split

df = records_to_dataframe(cleaned)
train, valid, test = train_valid_test_split(cleaned, seed=123)

for r in cleaned:
    r.reactant_smiles = r.reactants
    r.reagent_smiles = r.reagents
    r.product_smiles = r.products

dataset = ForwardReactionDataset(train, use_agents=True)
example = dataset[0]

Examples

An interactive walkthrough lives at examples/example.ipynb. It demonstrates loading ORD, USPTO, JSON, and CSV datasets, applying filter stacks (including similarity filtering), and exporting cleaned reactions. Open it in Jupyter and swap in your own file paths to mirror the workflows.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Dec 11, 2025

0.0.4 yanked

Nov 16, 2025

Reason this release was yanked:

This is a pre-mature version.

0.0.3 yanked

Nov 16, 2025

Reason this release was yanked:

This is a pre-mature version.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chemrxn_cleaner-0.1.0.tar.gz (29.9 kB view details)

Uploaded Dec 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chemrxn_cleaner-0.1.0-py3-none-any.whl (31.6 kB view details)

Uploaded Dec 11, 2025 Python 3

File details

Details for the file chemrxn_cleaner-0.1.0.tar.gz.

File metadata

Download URL: chemrxn_cleaner-0.1.0.tar.gz
Upload date: Dec 11, 2025
Size: 29.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for chemrxn_cleaner-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`f95b37c549c6fd118dda7733dcd760c600b2c78f82c7fd06ffb8d739425d1be7`
MD5	`c982e62d01e873c47cf6f18d53afa293`
BLAKE2b-256	`6b98cc438c4681d243298a82b91349536a6a4ae198c0889173fa56fdde0d2b4f`

See more details on using hashes here.

File details

Details for the file chemrxn_cleaner-0.1.0-py3-none-any.whl.

File metadata

Download URL: chemrxn_cleaner-0.1.0-py3-none-any.whl
Upload date: Dec 11, 2025
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for chemrxn_cleaner-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`168d02a7fbc5c40c9ac17a39faf4a312d5ce25853b1fb1c9c6b8269253aa9c32`
MD5	`2e900cdc2754792bacaac42e0c0bd02e`
BLAKE2b-256	`5ba8872de0d4d1cf1fcce1a942f399bc8bdb7a18a8297a50056cf8f5e81767d9`

See more details on using hashes here.

chemrxn-cleaner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chemrxn-cleaner

Design

Installation

Quick start

Loading reaction data

Cleaning and filters

Reporting and exporting

Working with `ReactionRecord`

ML utilities

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

chemrxn-cleaner 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

chemrxn-cleaner

Design

Installation

Quick start

Loading reaction data

Cleaning and filters

Reporting and exporting

Working with ReactionRecord

ML utilities

Examples

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Working with `ReactionRecord`