Skip to main content

Name-to-SMILES conversion

Project description

cholla_chem

version Maintenance License Run Tests Build Docs Open In Colab

This library is used for performant, comprehensive, and customizable name-to-SMILES conversions.

This library can use the following existing name-to-SMILES resolvers:

This library also implements the following new resolvers:

  • Manually curated dataset of common names not correctly resolved by other resolvers (e.g. 'NaH')
  • Structural formula resolver (e.g. 'CH3CH2CH2COOH')
  • Inorganic shorthand resolver (e.g. '[Cp*RhCl2]2')

The following string editing/manipulation strategies may be applied to compounds to assist with name-to-SMILES resolution:

  • String sanitization for special characters and mojibake encoding errors
  • Name correction for OCR errors, typos, pagination errors, etc.
  • Splitting compounds on common delimiters (useful for mixtures of compounds, e.g. 'BH3•THF')
  • Peptide shorthand expansion (e.g. 'cyclo(Asp-Arg-Val-Tyr-Ile-His-Pro-Phe)' -> 'cyclo(l-aspartyl-l-arginyl-l-valyl-l-tyrosyl-l-isoleucyl-l-histidyl-l-prolyl-l-phenylalanyl)')

When resolvers disagree on the SMILES for a given compound, a variety of SMILES selection methods can be employed to determine the "best" SMILES for a given compound name. See the documentation for more details.

Installation

Install cholla_chem with pip directly from this repo:

pip install git+https://github.com/denovochem/cholla_chem.git

Basic usage

Resolve chemical names to SMILES by passing a string or a list of strings:

from cholla_chem import resolve_compounds_to_smiles

resolved_smiles = resolve_compounds_to_smiles(compounds_list=['aspirin'])

"{'aspirin': 'CC(=O)Oc1ccccc1C(=O)O'}"

See detailed information including which resolver returned which SMILES with detailed_name_dict=True:

from cholla_chem import resolve_compounds_to_smiles

resolved_smiles = resolve_compounds_to_smiles(
    compounds_list=['2-acetyloxybenzoic acid'], 
    detailed_name_dict=True
)

"{'2-acetyloxybenzoic acid': {
    'SMILES': 'CC(=O)Oc1ccccc1C(=O)O',
    'SMILES_source': ['pubchem_default', 'opsin_default'],
    'SMILES_dict': {
        'CC(=O)Oc1ccccc1C(=O)O': ['pubchem_default', 'opsin_default']
    },
    'additional_info': {}
}}"

Advanced usage

Many aspects of the name-to-SMILES resolution process can be customized, including the resolvers that are used, the configuration of those resolvers, and the strategy used to pick the best SMILES.

In this example, we resolve chemical names with OPSIN, PubChem, and CIRPy, and use a custom consensus weighting approach to pick the best SMILES:

from cholla_chem import (
    OpsinNameResolver,
    PubChemNameResolver,
    CIRpyNameResolver,
    resolve_compounds_to_smiles,
)

opsin_resolver = OpsinNameResolver(
    resolver_name='opsin', 
    resolver_weight=4
)
pubchem_resolver =  PubChemNameResolver(
    resolver_name='pubchem', 
    resolver_weight=3
)
cirpy_resolver = CIRpyNameResolver(
    resolver_name='cirpy', 
    resolver_weight=2
)

resolved_smiles = resolve_compounds_to_smiles(
    compounds_list=['2-acetyloxybenzoic acid'],
    resolvers_list=[opsin_resolver, pubchem_resolver, cirpy_resolver],
    smiles_selection_mode='weighted',
    detailed_name_dict=True
)

"{'2-acetyloxybenzoic acid': {
    'SMILES': 'CC(=O)Oc1ccccc1C(=O)O',
    'SMILES_source': ['opsin', 'pubchem', 'cirpy'],
    'SMILES_dict': {
        'CC(=O)Oc1ccccc1C(=O)O': ['opsin', 'pubchem', 'cirpy']
    },
    'additional_info': {}
}}"

Command line interface

cholla_chem can be used as a command line tool. The command line interface can resolve single chemical names directly from the command line or read from a file.

Resolve compounds directly from the command line:

cholla-chem "aspirin"

Resolve compounds from a file:

cholla-chem --input names.txt --output results.tsv

See help for more options:

cholla-chem --help

See documentation for more details.

Documentation

Full documentation is available here

Contributing

  • Feature ideas and bug reports are welcome on the Issue Tracker.
  • Fork the source code on GitHub, make changes and file a pull request.

License

cholla_chem is licensed under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cholla_chem-0.2.0.tar.gz (13.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cholla_chem-0.2.0-py3-none-any.whl (13.1 MB view details)

Uploaded Python 3

File details

Details for the file cholla_chem-0.2.0.tar.gz.

File metadata

  • Download URL: cholla_chem-0.2.0.tar.gz
  • Upload date:
  • Size: 13.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cholla_chem-0.2.0.tar.gz
Algorithm Hash digest
SHA256 56db22416dfa9ff4e57b7bdd48a9b6d05534e218baf494b08f21f8f6e0aa99b1
MD5 32eb4ca177941c73f3940ac19de47f19
BLAKE2b-256 9432e7bc46580b99e884ba9767683acdf96d974d9c8cdc35ba96cfeca9e666c9

See more details on using hashes here.

Provenance

The following attestation bundles were made for cholla_chem-0.2.0.tar.gz:

Publisher: publish.yml on denovochem/cholla_chem

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cholla_chem-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cholla_chem-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for cholla_chem-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e367873ee327d3b375998114e2ada30b1e66863771b98799b1addb33d4bb5965
MD5 91c0b4b12b82ba88a93a38bb39a176cc
BLAKE2b-256 feac8e9cc84417e79cd81d121214ac72ce3cc75f2786ae28a798f4804bfaec99

See more details on using hashes here.

Provenance

The following attestation bundles were made for cholla_chem-0.2.0-py3-none-any.whl:

Publisher: publish.yml on denovochem/cholla_chem

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page