Skip to main content

A comprehensive toolkit for molecular data curation, validation, cleaning, and normalization

Project description

MEHC Curation

A comprehensive Python toolkit for molecular data curation, including validation, cleaning, normalization, and refinement pipelines.

Features

  • Validation: Validate SMILES strings and remove unwanted molecular types (mixtures, inorganics, organometallics)
  • Cleaning: Remove salts and neutralize charged molecules
  • Normalization: Normalize tautomers and stereoisomers
  • Refinement: Complete pipeline orchestrating all stages
  • Parallel Processing: Efficient parallel processing using all available CPUs by default
  • Comprehensive Reporting: Generate detailed reports for each processing stage

Installation

Prerequisites

Before installing mehc-curation, you need to install RDKit, which is best installed via conda:

conda install -c conda-forge rdkit

Install from PyPI

pip install mehc-curation

Install from source

git clone https://github.com/biochem-data-sci/mehc-curation.git
cd mehc-curation
pip install -e .

Quick Start

Python API

import pandas as pd
from mehc_curation.validation import ValidationStage
from mehc_curation.cleaning import CleaningStage
from mehc_curation.normalization import NormalizationStage
from mehc_curation.refinement import RefinementStage

# Load your SMILES data
df = pd.read_csv("your_data.csv")

# Validation
validator = ValidationStage(df)
validated_df = validator.complete_validation()

# Cleaning
cleaner = CleaningStage(validated_df)
cleaned_df = cleaner.complete_cleaning()

# Normalization
normalizer = NormalizationStage(cleaned_df)
normalized_df = normalizer.complete_normalization()

# Complete refinement pipeline
refiner = RefinementStage(df)
refined_df = refiner.complete_refinement(
    output_dir="./output",
    get_report=True
)

Command Line Interface

# Validation
python -m mehc_curation.validation -i input.csv -o output/ -c 5

# Cleaning
python -m mehc_curation.cleaning -i input.csv -o output/ -c 3

# Normalization
python -m mehc_curation.normalization -i input.csv -o output/ -c 3

# Complete refinement
python -m mehc_curation.refinement -i input.csv -o output/ --get_report

Modules

Validation Module

Validates SMILES strings and removes unwanted molecular types:

  • validate_smi(): Validate SMILES strings
  • rm_mixture(): Remove mixture compounds
  • rm_inorganic(): Remove inorganic compounds
  • rm_organometallic(): Remove organometallic compounds
  • complete_validation(): Run all validation steps

Cleaning Module

Cleans SMILES strings:

  • cl_salt(): Remove salts from SMILES
  • neutralize(): Neutralize charged molecules
  • complete_cleaning(): Run all cleaning steps

Normalization Module

Normalizes SMILES strings:

  • detautomerize(): Normalize tautomers
  • destereoisomerize(): Remove stereoisomers
  • complete_normalization(): Run all normalization steps

Refinement Module

Complete refinement pipeline:

  • complete_refinement(): Orchestrates validation, cleaning, and normalization stages

Configuration

CPU Usage

By default, the library uses all available CPUs (n_cpu=-1). You can specify the number of CPUs:

# Use all CPUs (default)
refiner.complete_refinement(n_cpu=-1)

# Use specific number of CPUs
refiner.complete_refinement(n_cpu=4)

# Use single CPU
refiner.complete_refinement(n_cpu=1)

Requirements

  • Python >= 3.7
  • pandas >= 1.3.0
  • parallel-pandas >= 0.2.8
  • RDKit (install via conda: conda install -c conda-forge rdkit)

License

MIT License - see LICENSE file for details

Citation

If you use this library in your research, please cite:

@software{mehc_curation,
  title={MEHC-curation: An Automated Python Framework for High-Quality Molecular Dataset Preparation},
  author={Chinh Pham and Nhat-Anh Nguyen-Dang and Thanh-Hoang Nguyen-Vo and Binh P. Nguyen},
  month={dec},
  year={2025},
  version={1.0.2},
  url={https://github.com/biochem-data-sci/mehc-curation},
  license={MIT},
  doi={10.5281/zenodo.17562247}, 
  publisher={Zenodo}
}

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and questions, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mehc_curation-1.0.2.tar.gz (43.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mehc_curation-1.0.2-py3-none-any.whl (56.2 kB view details)

Uploaded Python 3

File details

Details for the file mehc_curation-1.0.2.tar.gz.

File metadata

  • Download URL: mehc_curation-1.0.2.tar.gz
  • Upload date:
  • Size: 43.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for mehc_curation-1.0.2.tar.gz
Algorithm Hash digest
SHA256 97b4c957736fc80227293f99f41177e8b8eb2baaf4af8297e5a4d18cf9da8c1c
MD5 b578f6fc990168d6af201ce2227a044d
BLAKE2b-256 566d365eae4ea385b05a173c0c45cc04f13768818545ce5905bc0828ad0f1bed

See more details on using hashes here.

File details

Details for the file mehc_curation-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: mehc_curation-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 56.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for mehc_curation-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6e7911148cfd1606f7a3b2569cc59463108ab586ad177ea78630d3127ef66f4e
MD5 3669d9116cb06e6a48b68417255a3a78
BLAKE2b-256 af5ab3d6eb8e43e1a6e83229e978d594f626ef72afb705941c25f760f85ca785

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page