A comprehensive toolkit for molecular data curation, validation, cleaning, and normalization
Project description
MEHC Curation
A comprehensive Python toolkit for molecular data curation, including validation, cleaning, normalization, and refinement pipelines.
Features
- Validation: Validate SMILES strings and remove unwanted molecular types (mixtures, inorganics, organometallics)
- Cleaning: Remove salts and neutralize charged molecules
- Normalization: Normalize tautomers and stereoisomers
- Refinement: Complete pipeline orchestrating all stages
- Parallel Processing: Efficient parallel processing using all available CPUs by default
- Comprehensive Reporting: Generate detailed reports for each processing stage
Installation
Prerequisites
Before installing mehc-curation, you need to install RDKit, which is best installed via conda:
conda install -c conda-forge rdkit
Install from PyPI
pip install mehc-curation
Install from source
git clone https://github.com/biochem-data-sci/mehc-curation.git
cd mehc-curation
pip install -e .
Quick Start
Python API
import pandas as pd
from mehc_curation.validation import ValidationStage
from mehc_curation.cleaning import CleaningStage
from mehc_curation.normalization import NormalizationStage
from mehc_curation.refinement import RefinementStage
# Load your SMILES data
df = pd.read_csv("your_data.csv")
# Validation
validator = ValidationStage(df)
validated_df = validator.complete_validation()
# Cleaning
cleaner = CleaningStage(validated_df)
cleaned_df = cleaner.complete_cleaning()
# Normalization
normalizer = NormalizationStage(cleaned_df)
normalized_df = normalizer.complete_normalization()
# Complete refinement pipeline
refiner = RefinementStage(df)
refined_df = refiner.complete_refinement(
output_dir="./output",
get_report=True
)
Command Line Interface
# Validation
python -m mehc_curation.validation -i input.csv -o output/ -c 5
# Cleaning
python -m mehc_curation.cleaning -i input.csv -o output/ -c 3
# Normalization
python -m mehc_curation.normalization -i input.csv -o output/ -c 3
# Complete refinement
python -m mehc_curation.refinement -i input.csv -o output/ --get_report
Modules
Validation Module
Validates SMILES strings and removes unwanted molecular types:
validate_smi(): Validate SMILES stringsrm_mixture(): Remove mixture compoundsrm_inorganic(): Remove inorganic compoundsrm_organometallic(): Remove organometallic compoundscomplete_validation(): Run all validation steps
Cleaning Module
Cleans SMILES strings:
cl_salt(): Remove salts from SMILESneutralize(): Neutralize charged moleculescomplete_cleaning(): Run all cleaning steps
Normalization Module
Normalizes SMILES strings:
detautomerize(): Normalize tautomersdestereoisomerize(): Remove stereoisomerscomplete_normalization(): Run all normalization steps
Refinement Module
Complete refinement pipeline:
complete_refinement(): Orchestrates validation, cleaning, and normalization stages
Configuration
CPU Usage
By default, the library uses all available CPUs (n_cpu=-1). You can specify the number of CPUs:
# Use all CPUs (default)
refiner.complete_refinement(n_cpu=-1)
# Use specific number of CPUs
refiner.complete_refinement(n_cpu=4)
# Use single CPU
refiner.complete_refinement(n_cpu=1)
Requirements
- Python >= 3.7
- pandas >= 1.3.0
- parallel-pandas >= 0.2.8
- RDKit (install via conda:
conda install -c conda-forge rdkit)
License
MIT License - see LICENSE file for details
Citation
If you use this library in your research, please cite:
@software{mehc_curation,
title={MEHC-curation: An Automated Python Framework for High-Quality Molecular Dataset Preparation},
author={Chinh Pham and Nhat-Anh Nguyen-Dang and Thanh-Hoang Nguyen-Vo and Binh P. Nguyen},
month={dec},
year={2025},
version={1.0.2},
url={https://github.com/biochem-data-sci/mehc-curation},
license={MIT},
doi={10.5281/zenodo.17562247},
publisher={Zenodo}
}
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
For issues and questions, please open an issue on GitHub.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mehc_curation-1.0.2.tar.gz.
File metadata
- Download URL: mehc_curation-1.0.2.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97b4c957736fc80227293f99f41177e8b8eb2baaf4af8297e5a4d18cf9da8c1c
|
|
| MD5 |
b578f6fc990168d6af201ce2227a044d
|
|
| BLAKE2b-256 |
566d365eae4ea385b05a173c0c45cc04f13768818545ce5905bc0828ad0f1bed
|
File details
Details for the file mehc_curation-1.0.2-py3-none-any.whl.
File metadata
- Download URL: mehc_curation-1.0.2-py3-none-any.whl
- Upload date:
- Size: 56.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6e7911148cfd1606f7a3b2569cc59463108ab586ad177ea78630d3127ef66f4e
|
|
| MD5 |
3669d9116cb06e6a48b68417255a3a78
|
|
| BLAKE2b-256 |
af5ab3d6eb8e43e1a6e83229e978d594f626ef72afb705941c25f760f85ca785
|