Toolkit for harmonizing SMILES strings to canonical + isomeric + Kekulized convention (RDKit / COCONUT 2.0)
Project description
HARMONSMILE: Harmonize SMILES Strings for Cheminformatics and Machine Learning
Description
HARMONSMILE solves a common problem in cheminformatics: SMILES strings for the same molecule look different depending on the source (PubChem, ChEMBL, COCONUT, in-house databases). This inconsistency breaks comparisons, deduplication, and machine learning pipelines that expect a uniform molecular representation.
Purpose
The primary objective of HARMONSMILE is to automate the preparation of molecular datasets for cheminformatics workflows and phase 1 machine learning applications within the computational drug discovery pipeline.
The platform enables:
- Data Harmonization: Standardizes SMILES strings to a consistent format — canonical + isomeric + Kekulized — ensuring that the same molecule is represented identically across different datasets and sources. It follos the RDKit convention for canonicalization, which is widely adopted in the cheminformatics community.
Installation
pip install harmonsmile
RDKit is required and installed automatically (
rdkit>=2022.09).
Quick Start
Python API
Standardize a single SMILES string:
from harmonsmile import RDKitStandardizer
std = RDKitStandardizer()
print(std.to_iso_kek("c1ccccc1")) # canonical + isomeric + Kekulized
print(std.to_conn_kek("c1ccccc1")) # canonical + connectivity-only + Kekulized
Fetch properties from PubChem and harmonize:
from harmonsmile import PubChemIngest, Config
cfg = Config(
input_path="examples/example_pubchem.csv", # requires: id, PubChem CID
output_path="results/example_pubchem_harmonized.csv",
)
PubChemIngest(cfg).run()
Fetch properties from ChEMBL and harmonize:
from harmonsmile import ChEMBLIngest
ChEMBLIngest(
input_path="examples/example_chembl.csv", # requires: id, ChEMBL ID
output_path="results/example_chembl_harmonized.csv",
).run()
Harmonize any file with a SMILES column (COCONUT, in-house, etc.):
from harmonsmile import SMILESPrep
SMILESPrep(
input_path="examples/example_smiles.txt",
smiles_col="SMILES", # any column name
output_path="results/example_smiles_harmonized.csv",
).run()
Command-Line Interface
# PubChem pipeline
harmonsmile --pubchem-in examples/database1.csv --pubchem-out results/database1_harmonized.csv
# SMILES pipeline (COCONUT, independent, etc.)
harmonsmile --smiles-in examples/database2.csv --smiles-col canonical_smiles \
--smiles-out results/database2_harmonized.csv
# Both pipelines in one run
harmonsmile \
--pubchem-in examples/database1.csv --pubchem-out results/database1_harmonized.csv \
--smiles-in examples/database2.csv --smiles-col canonical_smiles \
--smiles-out results/database2_harmonized.csv
# Single Entry — fetch one compound by ID
harmonsmile --pubchem-cid 2723949
harmonsmile --chembl-id CHEMBL294199
# Check version
harmonsmile --version
Also available as a Python module:
python -m harmonsmile --pubchem-in examples/database1.csv --pubchem-out results/out.csv
Pipelines
| Pipeline | Source | Input | API |
|---|---|---|---|
PubChemIngest |
PubChem | CSV with PubChem CID column |
REST (public) |
ChEMBLIngest |
ChEMBL | CSV with ChEMBL ID column |
REST (public) |
SMILESPrep |
Any | CSV/Excel with any SMILES column | — (local file) |
All pipelines append a SMILES_RDKit column with the harmonized SMILES.
Input Format
| Pipeline | Required columns |
|---|---|
PubChemIngest |
id (optional), PubChem CID |
ChEMBLIngest |
id (optional), ChEMBL ID |
SMILESPrep |
id (optional), <smiles_col> (any name) |
Supported file formats: CSV, TSV, XLSX, XLS.
Roadmap
- v0.2.0 —
CoconutIngest: knows COCONUT 2.0 schema automatically (canonical_smiles,identifier, molecular properties). - v0.3.0 — ML-ready features: ECFP fingerprints (with/without chirality), InChI/InChIKey for deduplication and robust cross-database matching.
Development
Project Structure
HARMONSMILE/
├── harmonsmile/
│ ├── __init__.py # Public API
│ ├── __main__.py # python -m harmonsmile entry point
│ ├── _cli.py # CLI implementation
│ ├── chembl.py # ChEMBL REST client
│ ├── config.py # Config dataclass
│ ├── io.py # Table I/O utilities
│ ├── pipelines.py # PubChemIngest, ChEMBLIngest, SMILESPrep
│ ├── pubchem.py # PubChem REST client
│ ├── standardize.py # RDKitStandardizer
│ └── version.py # Package version metadata
├── tests/ # Unit test suite (pytest) — 119 tests
├── examples/ # Example scripts and datasets
├── results/ # Output data (not installed)
├── logs/ # Error logs (not installed)
├── pyproject.toml
├── environment.yml
├── requirements-dev.txt
├── CHANGELOG.md
├── CITATION.cff
├── COPYING
├── COPYING.LESSER
├── LICENSE
└── README.md
Running Tests
python -m pytest tests -p no:cacheprovider --basetemp .pytest_tmp
Contributing
Contributions are welcome. Please open an issue before submitting a pull request. Follow the existing code style: NumPy-style docstrings, type hints, and SPDX license headers in all source files.
Citation
If you use HARMONSMILE in your research, please cite it using the metadata in CITATION.cff or the format below:
Contreras-Torres, F. F. (2026). HARMONSMILE: Harmonize SMILES Strings for
Cheminformatics and Machine Learning (v0.1.3). Tecnologico de Monterrey.
https://github.com/NanoBiostructuresRG/harmonsmile
Author
Developed by Flavio F. Contreras-Torres (Tecnológico de Monterrey) Monterrey, Mexico – May 2026
License
This project is licensed under the terms of the
GNU Lesser General Public License v3.0 or later.
SPDX identifier: LGPL-3.0-or-later.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file harmonsmile-0.1.3.tar.gz.
File metadata
- Download URL: harmonsmile-0.1.3.tar.gz
- Upload date:
- Size: 43.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b9a789f2674fc60d9a904a60e72c5138a9c2843cdf3780591d1fafe8a154ee8
|
|
| MD5 |
a92dc8d6df7dd049eb471444a5390239
|
|
| BLAKE2b-256 |
fb192d6d3b8f061dd4ce343851934bb7cc1b96085e8c71ae5357adb011bcf4b7
|
Provenance
The following attestation bundles were made for harmonsmile-0.1.3.tar.gz:
Publisher:
publish-to-pypi.yml on NanoBiostructuresRG/harmonsmile
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
harmonsmile-0.1.3.tar.gz -
Subject digest:
2b9a789f2674fc60d9a904a60e72c5138a9c2843cdf3780591d1fafe8a154ee8 - Sigstore transparency entry: 1575603829
- Sigstore integration time:
-
Permalink:
NanoBiostructuresRG/harmonsmile@9b2f3c554088f7a4d083f7ef221f995c18d1fa86 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/NanoBiostructuresRG
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@9b2f3c554088f7a4d083f7ef221f995c18d1fa86 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file harmonsmile-0.1.3-py3-none-any.whl.
File metadata
- Download URL: harmonsmile-0.1.3-py3-none-any.whl
- Upload date:
- Size: 34.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea134ad6d02967debbe4ae9e068f7d7a99230c027a3a75695423129e1f69a050
|
|
| MD5 |
90a40b5bac229e4f38ef2a2f101d97a8
|
|
| BLAKE2b-256 |
ee51d14929dcc9c15fa9697c672adaa0716c64fb68cf7886a532e7b515f908f7
|
Provenance
The following attestation bundles were made for harmonsmile-0.1.3-py3-none-any.whl:
Publisher:
publish-to-pypi.yml on NanoBiostructuresRG/harmonsmile
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
harmonsmile-0.1.3-py3-none-any.whl -
Subject digest:
ea134ad6d02967debbe4ae9e068f7d7a99230c027a3a75695423129e1f69a050 - Sigstore transparency entry: 1575603854
- Sigstore integration time:
-
Permalink:
NanoBiostructuresRG/harmonsmile@9b2f3c554088f7a4d083f7ef221f995c18d1fa86 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/NanoBiostructuresRG
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-to-pypi.yml@9b2f3c554088f7a4d083f7ef221f995c18d1fa86 -
Trigger Event:
workflow_dispatch
-
Statement type: