Skip to main content

Atom-in-SMILES tokenizer for SMILES strings

Project description

License: CC BY-NC 4.0

Atom-in-SMILES tokenization.

Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.


Installation

It can be installed using pip.

pip install atomInSmiles

or clone it from the GitHub repository and install locally.

git clone https://github.com/snu-lcbc/atom-in-SMILES
cd atom-in-SMILES
python setup.py install

Usage & Demo

Brief descriptions of the main functions:

Function Description
atomInSmiles.encode Converts a SMILES string into Atom-in-SMILES tokens.
atomInSmiles.decode Converts an Atom-in-SMILES tokens into SMILES string.
atomInSmiles.similarity Calcuates Tanimoto coefficient of two Atom-inSMILSE tokens.
import atomInSmiles

smiles = 'NCC(=O)O'

# SMILES -> atom-in-SMILES 
ais_tokens = atomInSmiles.encode(smiles) # '[NH2;!R;C] [CH2;!R;CN] [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C]'

# atom-in-SMILES -> SMILES
decoded_smiles = atomInSmiles.decode(ais_tokens) #'NCC(=O)O'

assert smiles == decoded_smiles

NOTE: By default, it first canonicalizes the input SMILES. In order to get atom-in-Smiles tokens with the same order of SMILES, the input SMILES should be provided with atom map numbers.

from rdkit.Chem import MolFromSmiles, MolToSmiles
import atomInSmiles

import atomInSmiles
# ensuring the order of SMILES in atom-in-SMILES. 
smiles = 'NCC(=O)O'
mol = MolFromSmiles(smiles)
random_smiles = MolToSmiles(mol, doRandom=True) # e.g 'C(C(=O)O)N' 

# mapping atomID into SMILES srting
tmp = MolFromSmiles(random_smiles)
for atom in tmp.GetAtoms():
    atom.SetAtomMapNum(atom.GetIdx())
smiles_1 = MolToSmiles(tmp) # 'C([C:1](=[O:2])[OH:3])[NH2:4]' 

# SMILES -> atom-in-SMILES
ais_tokens_1 = atomInSmiles.encode(smiles_1, with_atomMap=True) # '[CH2;!R;CN] ( [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C] ) [NH2;!R;C]'

# atom-in-SMILES -> SMILES
decoded_smiles_1 = atomInSmiles.decode(ais_tokens_1) # 'C(C(=O)O)N'

assert random_smiles == decoded_smiles_1

Implementations & Results

Implementation Items Description
Single-step retrosynthesis python src/predict.py to conduct an inference with the trained model
--model_type (SMILES, SELFIES, DeepSmiles, SmilesPE, AIS)
--checkpoint_name name of the checkpoint file checkpoints files
--input Tokenized input sequence
Molecular Property Prediction Molecular-property-prediction.ipynb MoleculeNet: Classification (ESOL, FreeSolv, Lipo.), Regression (BBBP, BACE, HIV)
Normalized repetition rate Normalized-Repetition-Rates.ipynb Natural products, drugs, metal complexes, lipids, stereoids, isomers
Fingerprint nature of AIS AIS-as-fingerprint.ipynb AIS fingerprint resolution
Single-token repetition (rep-l) rep-l_USPTO50k.ipynb USPTO-50K, retrosynthetic translations
input-output equivalent mapping GDB13-results.ipynb Augmented subset of GDB-13, noncanon-2-canon translations

For example, in retrosynthesis task:

python src/predict.py --model_type AIS  --checkpoint_name AIS_checkpoint.pth
 --input='[CH3;!R;O] [O;!R;CC] [C;!R;COO] ( = [O;!R;C] ) [c;R;CCS] 1 [cH;R;CC] [c;R;CCC] ( [CH2;!R;CC] [CH2;!R; CC] [CH2;!R;CC] [c;R;CCN] 2 [cH;R;CC] [c;R;CCC] 3 [c;R;CNO] ( = [O;!R;C] ) [nH;R;CC] [c;R;NNN] ( [NH2 ;!R;C] ) [n;R;CC] [c;R;CNN] 3 [nH;R;CC] 2 ) [cH;R;CS] [s;R;CC] 1'

License

CC BY-SA 4.0

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

atomInSmiles-1.0.2.tar.gz (11.5 kB view details)

Uploaded Source

Built Distribution

atomInSmiles-1.0.2-py3-none-any.whl (12.2 kB view details)

Uploaded Python 3

File details

Details for the file atomInSmiles-1.0.2.tar.gz.

File metadata

  • Download URL: atomInSmiles-1.0.2.tar.gz
  • Upload date:
  • Size: 11.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for atomInSmiles-1.0.2.tar.gz
Algorithm Hash digest
SHA256 e265b1cd26553f170f618cfa9af90d2a753e0485c101d76624cdc622ee1d0546
MD5 3473a60aa1c4478cce12ba912d70cdb5
BLAKE2b-256 cb66236fcbb57d13a682dd12de10d22e4800414de5c62607edcab50f8d44fa0b

See more details on using hashes here.

File details

Details for the file atomInSmiles-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: atomInSmiles-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for atomInSmiles-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 404ec847b3a8fda675b553f157704f4f8179a89e917de53c7bdfeb08780b9c68
MD5 1c8c3366869473a96fdec29ab9e275d1
BLAKE2b-256 c180851d3f3c15b77051d2292fb906814e36f43defe81b857a68ed95acf1d8bc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page