Skip to main content

A cheminformatics algorithm to classify homologous series.

Project description

An Algorithm to Classify Homologous Series

Powered by RDKit License Maintenance GitHub issues GitHub contributors DOI GitHub release PyPI version fury.io

Introduction

Homologous series are groups of chemical compounds sharing the same core structure(s) and different numbers of repeating units (RU) connected end-to-end.

This is an open-source algorithm to classify homologous series within compound datasets provided as SMILES, implemented using the RDKit.

For example, these series were classified in COCONUT and the NORMAN Suspect List Exchange, datasets containing natural products and environmental chemicals respectively.

CH2 Repeating Unit: coconut-hs

CF2 Repeating Unit: norman-hs

Requirements

The algorithm requires RDKit to be installed via conda-forge.

$ conda create -c conda-forge -n my-rdkit-env rdkit
$ conda activate my-rdkit-env

Installation

$ git clone https://github.com/adelenelai/onglai-classify-homologues
$ cd classify_homologues
$ pip install -e .

Note that pip installing the package is not enough; in addition, the repo must be cloned from GitHub because the algorithm runs as a script (see below).

Alternatively:

#from PyPI
$ pip install onglai-classify-homologues

Usage

Run:

$ python nextgen_classify_homols.py [-in <arg>] [-s <arg>] [-n <arg>] [-ru <arg>] [-min <arg>] [-max <arg>] 2>log
Flag Description
-in --input_csv path to input CSV containing SMILES and Name columns
-s --smiles name of column containing SMILES. Default is SMILES.
-n --names name of column containing Names. Default is Name.
-ru --repeatingunits chemical RU as SMARTS, enclosed within speech marks. Default is CH2 i.e., '[#6&H2]'.
-min --min_RU_in minimum length of RU chain, default is 3
-max --max__RU_in maximum length of RU chain, default is 30
-f --frag_steps no. times to fragment molecules to obtain cores, the default is 2

Try:

$ python nextgen_classify_homols.py -in ../../tests/test1_23.csv -s SMILES -n Name -ru '[#6&H2]' -min 3 -max 30 -f 2 2>log

Successful classification will generate an output directory containing the following files:

  1. A TXT file containing the summary of classification results and explanation of outputs (series_no codes)
  2. A CSV file containing 8 columns: series_no, cpd_name, CanoSmiles_FinalCores, SMILES, InChI, InChIKey, molecular_formula and monoisotopic_mass. The first column series_no contains the results of the homologous series classification. CanoSmiles_FinalCores indicates the common core shared by all members within a given series. The remaining columns contain information calculated based on the SMILES.
  3. A TXT file of unparseable SMILES that were removed (if all SMILES were parsed OK, then empty)

Reproducing Classification described in Lai et al.

Classification using default settings as described above. Code below runs for sample datasets provided in input/, full datasets have been archived on Zenodo (amend -in accordingly to classify full datasets).

#activate your rdkit environment

#NORMAN-SLE
$ python nextgen_classify_homols.py -in ../../input/pubchem_norman_sle_tree_parentcid_98116_2022-03-21_from115115_trial.csv -s isosmiles -n cmpdname 2>log

#PubChemLite
$ python nextgen_classify_homols.py -in ../../input/PubChemLite_exposomics_20220225_trial.csv -n CompoundName 2>log

#COCONUT
$ python nextgen_classify_homols.py -in ../../input/COCONUT_DB_2021-11_trial.txt 2>log

References and Links

  • Lai, A., Schaub, J., Steinbeck, C., Schymanski, E. L. An Algorithm to Classify Homologous Series in Compound Datasets. Preprint
  • Poster presented at the 17th German Cheminformatics Conference, Garmisch-Partenkirchen, Germany (May 8-10, 2022)

License

This project is licensed under Apache 2.0 - see LICENSE for details.

Our Research Groups

Environmental Cheminformatics Group at the

GitHub Logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onglai-classify-homologues-1.0.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onglai_classify_homologues-1.0.0-py3-none-any.whl (18.6 kB view details)

Uploaded Python 3

File details

Details for the file onglai-classify-homologues-1.0.0.tar.gz.

File metadata

File hashes

Hashes for onglai-classify-homologues-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6a992e4d03e8f682e7413066136bdc483bf9a919e3d644cc55515c0600ca4538
MD5 6c4785118f3b9cae0cd1d6201da49b1c
BLAKE2b-256 fc4a154043b4538f0d2b6ff82a10546cba6936d04444ae298ca8a8294f510e4c

See more details on using hashes here.

File details

Details for the file onglai_classify_homologues-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for onglai_classify_homologues-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5cec5ef1ea26793f4cece7a9e1798cdfa95dd8cb9870add1b59d656f53f91925
MD5 732f19b1fd3323df3208a967b9a79cb5
BLAKE2b-256 8adad829e0b022f5d98a982fd237ee45d42c03e3e62b9525ce86a43eec3339f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page