Skip to main content

Toolkit for harmonizing SMILES strings to canonical + isomeric + Kekulized convention (RDKit / COCONUT 2.0)

Project description

HARMONSMILE: Harmonize SMILES Strings for Cheminformatics and Machine Learning

License: LGPL v3 Version PyPI Python


Description

HARMONSMILE solves a common problem in cheminformatics: SMILES strings for the same molecule look different depending on the source (PubChem, ChEMBL, COCONUT, in-house databases). This inconsistency breaks comparisons, deduplication, and machine learning pipelines that expect a uniform molecular representation.


Purpose

The primary objective of HARMONSMILE is to automate the preparation of molecular datasets for cheminformatics workflows and phase 1 machine learning applications within the computational drug discovery pipeline.

The platform enables:

  • Data Harmonization: Standardizes SMILES strings to a consistent format — canonical + isomeric + Kekulized — ensuring that the same molecule is represented identically across different datasets and sources. It follos the RDKit convention for canonicalization, which is widely adopted in the cheminformatics community.

Installation

pip install harmonsmile

RDKit is required and installed automatically (rdkit>=2022.09).


Quick Start

Python API

Standardize a single SMILES string:

from harmonsmile import RDKitStandardizer

std = RDKitStandardizer()
print(std.to_iso_kek("c1ccccc1"))    # canonical + isomeric + Kekulized
print(std.to_conn_kek("c1ccccc1"))   # canonical + connectivity-only + Kekulized

Fetch properties from PubChem and harmonize:

from harmonsmile import PubChemIngest, Config

cfg = Config(
    input_path="examples/example_pubchem.csv",   # requires: id, PubChem CID
    output_path="results/example_pubchem_harmonized.csv",
)
PubChemIngest(cfg).run()

Fetch properties from ChEMBL and harmonize:

from harmonsmile import ChEMBLIngest

ChEMBLIngest(
    input_path="examples/example_chembl.csv",    # requires: id, ChEMBL ID
    output_path="results/example_chembl_harmonized.csv",
).run()

Harmonize any file with a SMILES column (COCONUT, in-house, etc.):

from harmonsmile import SMILESPrep

SMILESPrep(
    input_path="examples/example_smiles.txt",
    smiles_col="SMILES",                      # any column name
    output_path="results/example_smiles_harmonized.csv",
).run()

Command-Line Interface

# PubChem pipeline
harmonsmile --pubchem-in examples/database1.csv --pubchem-out results/database1_harmonized.csv

# SMILES pipeline (COCONUT, independent, etc.)
harmonsmile --smiles-in examples/database2.csv --smiles-col canonical_smiles \
            --smiles-out results/database2_harmonized.csv

# Both pipelines in one run
harmonsmile \
  --pubchem-in examples/database1.csv --pubchem-out results/database1_harmonized.csv \
  --smiles-in  examples/database2.csv --smiles-col  canonical_smiles \
  --smiles-out results/database2_harmonized.csv

# Single Entry — fetch one compound by ID
harmonsmile --pubchem-cid 2723949
harmonsmile --chembl-id CHEMBL294199

# Check version
harmonsmile --version

Also available as a Python module:

python -m harmonsmile --pubchem-in examples/database1.csv --pubchem-out results/out.csv

Pipelines

Pipeline Source Input API
PubChemIngest PubChem CSV with PubChem CID column REST (public)
ChEMBLIngest ChEMBL CSV with ChEMBL ID column REST (public)
SMILESPrep Any CSV/Excel with any SMILES column — (local file)

All pipelines append a SMILES_RDKit column with the harmonized SMILES.


Input Format

Pipeline Required columns
PubChemIngest id (optional), PubChem CID
ChEMBLIngest id (optional), ChEMBL ID
SMILESPrep id (optional), <smiles_col> (any name)

Supported file formats: CSV, TSV, XLSX, XLS.


Roadmap

  • v0.2.0CoconutIngest: knows COCONUT 2.0 schema automatically (canonical_smiles, identifier, molecular properties).
  • v0.3.0 — ML-ready features: ECFP fingerprints (with/without chirality), InChI/InChIKey for deduplication and robust cross-database matching.

Development

Project Structure

HARMONSMILE/
├── harmonsmile/
│   ├── __init__.py        # Public API
│   ├── __main__.py        # python -m harmonsmile entry point
│   ├── _cli.py            # CLI implementation
│   ├── chembl.py          # ChEMBL REST client
│   ├── config.py          # Config dataclass
│   ├── io.py              # Table I/O utilities
│   ├── pipelines.py       # PubChemIngest, ChEMBLIngest, SMILESPrep
│   ├── pubchem.py         # PubChem REST client
│   ├── standardize.py     # RDKitStandardizer
│   └── version.py         # Package version metadata
├── tests/                 # Unit test suite (pytest) — 119 tests
├── examples/              # Example scripts and datasets
├── results/               # Output data (not installed)
├── logs/                  # Error logs (not installed)
├── pyproject.toml
├── environment.yml
├── requirements-dev.txt
├── CHANGELOG.md
├── CITATION.cff
├── COPYING
├── COPYING.LESSER
├── LICENSE
└── README.md

Running Tests

python -m pytest tests -p no:cacheprovider --basetemp .pytest_tmp

Contributing

Contributions are welcome. Please open an issue before submitting a pull request. Follow the existing code style: NumPy-style docstrings, type hints, and SPDX license headers in all source files.


Citation

If you use HARMONSMILE in your research, please cite it using the metadata in CITATION.cff or the format below:

Contreras-Torres, F. F. (2026). HARMONSMILE: Harmonize SMILES Strings for
Cheminformatics and Machine Learning (v0.1.3). Tecnologico de Monterrey.
https://github.com/NanoBiostructuresRG/harmonsmile

Author

Developed by Flavio F. Contreras-Torres (Tecnológico de Monterrey) Monterrey, Mexico – May 2026


License

This project is licensed under the terms of the GNU Lesser General Public License v3.0 or later. SPDX identifier: LGPL-3.0-or-later.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

harmonsmile-0.1.3.tar.gz (43.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

harmonsmile-0.1.3-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file harmonsmile-0.1.3.tar.gz.

File metadata

  • Download URL: harmonsmile-0.1.3.tar.gz
  • Upload date:
  • Size: 43.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for harmonsmile-0.1.3.tar.gz
Algorithm Hash digest
SHA256 2b9a789f2674fc60d9a904a60e72c5138a9c2843cdf3780591d1fafe8a154ee8
MD5 a92dc8d6df7dd049eb471444a5390239
BLAKE2b-256 fb192d6d3b8f061dd4ce343851934bb7cc1b96085e8c71ae5357adb011bcf4b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for harmonsmile-0.1.3.tar.gz:

Publisher: publish-to-pypi.yml on NanoBiostructuresRG/harmonsmile

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file harmonsmile-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: harmonsmile-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for harmonsmile-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ea134ad6d02967debbe4ae9e068f7d7a99230c027a3a75695423129e1f69a050
MD5 90a40b5bac229e4f38ef2a2f101d97a8
BLAKE2b-256 ee51d14929dcc9c15fa9697c672adaa0716c64fb68cf7886a532e7b515f908f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for harmonsmile-0.1.3-py3-none-any.whl:

Publisher: publish-to-pypi.yml on NanoBiostructuresRG/harmonsmile

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page