A comprehensive toolkit for molecular data curation, validation, cleaning, and normalization

These details have not been verified by PyPI

Project links

Project description

MEHC Curation

A comprehensive Python toolkit for molecular data curation, including validation, cleaning, normalization, and refinement pipelines.

Features

Validation: Validate SMILES strings and remove unwanted molecular types (mixtures, inorganics, organometallics)
Cleaning: Remove salts and neutralize charged molecules
Normalization: Normalize tautomers and stereoisomers
Refinement: Complete pipeline orchestrating all stages
Parallel Processing: Efficient parallel processing using all available CPUs by default
Comprehensive Reporting: Generate detailed reports for each processing stage

Installation

Prerequisites

Before installing mehc-curation, you need to install RDKit, which is best installed via conda:

conda install -c conda-forge rdkit

Install from PyPI

pip install mehc-curation

Install from source

git clone https://github.com/biochem-data-sci/mehc-curation.git
cd mehc-curation
pip install -e .

Quick Start

Python API

import pandas as pd
from mehc_curation.validation import ValidationStage
from mehc_curation.cleaning import CleaningStage
from mehc_curation.normalization import NormalizationStage
from mehc_curation.refinement import RefinementStage

# Load your SMILES data
df = pd.read_csv("your_data.csv")

# Validation
validator = ValidationStage(df)
validated_df = validator.complete_validation()

# Cleaning
cleaner = CleaningStage(validated_df)
cleaned_df = cleaner.complete_cleaning()

# Normalization
normalizer = NormalizationStage(cleaned_df)
normalized_df = normalizer.complete_normalization()

# Complete refinement pipeline
refiner = RefinementStage(df)
refined_df = refiner.complete_refinement(
    output_dir="./output",
    get_report=True
)

Command Line Interface

# Validation
python -m mehc_curation.validation -i input.csv -o output/ -c 5

# Cleaning
python -m mehc_curation.cleaning -i input.csv -o output/ -c 3

# Normalization
python -m mehc_curation.normalization -i input.csv -o output/ -c 3

# Complete refinement
python -m mehc_curation.refinement -i input.csv -o output/ --get_report

Modules

Validation Module

Validates SMILES strings and removes unwanted molecular types:

validate_smi(): Validate SMILES strings
rm_mixture(): Remove mixture compounds
rm_inorganic(): Remove inorganic compounds
rm_organometallic(): Remove organometallic compounds
complete_validation(): Run all validation steps

Cleaning Module

Cleans SMILES strings:

cl_salt(): Remove salts from SMILES
neutralize(): Neutralize charged molecules
complete_cleaning(): Run all cleaning steps

Normalization Module

Normalizes SMILES strings:

detautomerize(): Normalize tautomers
destereoisomerize(): Remove stereoisomers
complete_normalization(): Run all normalization steps

Refinement Module

Complete refinement pipeline:

complete_refinement(): Orchestrates validation, cleaning, and normalization stages

Configuration

CPU Usage

By default, the library uses all available CPUs (n_cpu=-1). You can specify the number of CPUs:

# Use all CPUs (default)
refiner.complete_refinement(n_cpu=-1)

# Use specific number of CPUs
refiner.complete_refinement(n_cpu=4)

# Use single CPU
refiner.complete_refinement(n_cpu=1)

Output Directories

output_dir is optional for every stage. If you omit it, data stays in memory and any generated reports are written to the current working directory.
When you do provide an output_dir, the folder will be created automatically if it does not exist, and both CSV outputs and reports are saved beneath it.

Duplicate Handling

param_deduplicate now defaults to True for all validation, cleaning, and normalization entry points so that duplicate rows are removed automatically unless you opt out.

Requirements

Python >= 3.7
pandas >= 1.3.0
parallel-pandas >= 0.2.8
RDKit (install via conda: conda install -c conda-forge rdkit)

License

MIT License - see LICENSE file for details

Citation

If you use this library in your research, please cite:

@software{mehc_curation,
  title={MEHC-curation: An Automated Python Framework for High-Quality Molecular Dataset Preparation},
  author={Chinh Pham and Nhat-Anh Nguyen-Dang and Thanh-Hoang Nguyen-Vo and Binh P. Nguyen},
  month={dec},
  year={2025},
  version={1.0.4},
  url={https://github.com/biochem-data-sci/mehc-curation},
  license={MIT},
  doi={10.5281/zenodo.17567530}, 
  publisher={Zenodo}
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and questions, please open an issue on GitHub.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.5

Nov 14, 2025

This version

1.0.4

Nov 10, 2025

1.0.3

Nov 9, 2025

1.0.2

Nov 9, 2025

1.0.1

Nov 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mehc_curation-1.0.4.tar.gz (46.0 kB view details)

Uploaded Nov 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mehc_curation-1.0.4-py3-none-any.whl (56.9 kB view details)

Uploaded Nov 10, 2025 Python 3

File details

Details for the file mehc_curation-1.0.4.tar.gz.

File metadata

Download URL: mehc_curation-1.0.4.tar.gz
Upload date: Nov 10, 2025
Size: 46.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for mehc_curation-1.0.4.tar.gz
Algorithm	Hash digest
SHA256	`da277055bd3681c7a7124fce53cc91b092feea54dd7cafe9bcd1e66d7aa7256d`
MD5	`7f1514a689765d445e75685fc7718ed7`
BLAKE2b-256	`c824a71fd28d4aeaa7cc4a2b220eb0fb811709fc1431b4f3fe715a75b8bc0c12`

See more details on using hashes here.

File details

Details for the file mehc_curation-1.0.4-py3-none-any.whl.

File metadata

Download URL: mehc_curation-1.0.4-py3-none-any.whl
Upload date: Nov 10, 2025
Size: 56.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for mehc_curation-1.0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`422d6b8de7fc38e8fa2d75712e2ad41d793e303c0c166fbb3d40fc55f1560fa1`
MD5	`b9e09607698d6dc6ec64cbcbc2e39d3c`
BLAKE2b-256	`3ab3de0374ae45d21b6a7f193187d86126183310556a03796ce3100aa0b6565a`

See more details on using hashes here.

mehc-curation 1.0.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MEHC Curation

Features

Installation

Prerequisites

Install from PyPI

Install from source

Quick Start

Python API

Command Line Interface

Modules

Validation Module

Cleaning Module

Normalization Module

Refinement Module

Configuration

CPU Usage

Output Directories

Duplicate Handling

Requirements

License

Citation

Contributing

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes