Atom-in-SMILES tokenizer for SMILES strings
Project description
Atom-in-SMILES tokenization.
Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.
Installation
It can be installed using pip.
pip install atomInSmiles
or clone it from the GitHub repository and install locally.
git clone https://github.com/snu-lcbc/atom-in-SMILES
cd atom-in-SMILES
python setup.py install
Usage & Demo
Brief descriptions of the main functions:
Function | Description |
---|---|
atomInSmiles.encode |
Converts a SMILES string into Atom-in-SMILES tokens. |
atomInSmiles.decode |
Converts an Atom-in-SMILES tokens into SMILES string. |
atomInSmiles.similarity |
Calcuates Tanimoto coefficient of two Atom-inSMILSE tokens. |
import atomInSmiles
smiles = 'NCC(=O)O'
# SMILES -> atom-in-SMILES
ais_tokens = atomInSmiles.encode(smiles) # '[NH2;!R;C] [CH2;!R;CN] [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C]'
# atom-in-SMILES -> SMILES
decoded_smiles = atomInSmiles.decode(ais_tokens) #'NCC(=O)O'
assert smiles == decoded_smiles
NOTE: By default, it first canonicalizes the input SMILES. In order to get atom-in-Smiles tokens with the same order of SMILES, the input SMILES should be provided with atom map numbers.
from rdkit.Chem import MolFromSmiles, MolToSmiles
import atomInSmiles
import atomInSmiles
# ensuring the order of SMILES in atom-in-SMILES.
smiles = 'NCC(=O)O'
mol = MolFromSmiles(smiles)
random_smiles = MolToSmiles(mol, doRandom=True) # e.g 'C(C(=O)O)N'
# mapping atomID into SMILES srting
tmp = MolFromSmiles(random_smiles)
for atom in tmp.GetAtoms():
atom.SetAtomMapNum(atom.GetIdx())
smiles_1 = MolToSmiles(tmp) # 'C([C:1](=[O:2])[OH:3])[NH2:4]'
# SMILES -> atom-in-SMILES
ais_tokens_1 = atomInSmiles.encode(smiles_1, with_atomMap=True) # '[CH2;!R;CN] ( [C;!R;COO] ( = [O;!R;C] ) [OH;!R;C] ) [NH2;!R;C]'
# atom-in-SMILES -> SMILES
decoded_smiles_1 = atomInSmiles.decode(ais_tokens_1) # 'C(C(=O)O)N'
assert random_smiles == decoded_smiles_1
Implementations & Results
Implementation | Items | Description |
---|---|---|
Single-step retrosynthesis | python src/predict.py |
to conduct an inference with the trained model |
--model_type |
(SMILES , SELFIES , DeepSmiles , SmilesPE , AIS ) |
|
--checkpoint_name |
name of the checkpoint file checkpoints files | |
--input |
Tokenized input sequence | |
Molecular Property Prediction | Molecular-property-prediction.ipynb | MoleculeNet: Classification (ESOL, FreeSolv, Lipo.), Regression (BBBP, BACE, HIV) |
Normalized repetition rate | Normalized-Repetition-Rates.ipynb | Natural products, drugs, metal complexes, lipids, stereoids, isomers |
Fingerprint nature of AIS | AIS-as-fingerprint.ipynb | AIS fingerprint resolution |
Single-token repetition (rep-l) | rep-l_USPTO50k.ipynb | USPTO-50K, retrosynthetic translations |
input-output equivalent mapping | GDB13-results.ipynb | Augmented subset of GDB-13, noncanon-2-canon translations |
For example, in retrosynthesis task:
python src/predict.py --model_type AIS --checkpoint_name AIS_checkpoint.pth
--input='[CH3;!R;O] [O;!R;CC] [C;!R;COO] ( = [O;!R;C] ) [c;R;CCS] 1 [cH;R;CC] [c;R;CCC] ( [CH2;!R;CC] [CH2;!R; CC] [CH2;!R;CC] [c;R;CCN] 2 [cH;R;CC] [c;R;CCC] 3 [c;R;CNO] ( = [O;!R;C] ) [nH;R;CC] [c;R;NNN] ( [NH2 ;!R;C] ) [n;R;CC] [c;R;CNN] 3 [nH;R;CC] 2 ) [cH;R;CS] [s;R;CC] 1'
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file atomInSmiles-1.0.2.tar.gz
.
File metadata
- Download URL: atomInSmiles-1.0.2.tar.gz
- Upload date:
- Size: 11.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e265b1cd26553f170f618cfa9af90d2a753e0485c101d76624cdc622ee1d0546 |
|
MD5 | 3473a60aa1c4478cce12ba912d70cdb5 |
|
BLAKE2b-256 | cb66236fcbb57d13a682dd12de10d22e4800414de5c62607edcab50f8d44fa0b |
File details
Details for the file atomInSmiles-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: atomInSmiles-1.0.2-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.3
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 404ec847b3a8fda675b553f157704f4f8179a89e917de53c7bdfeb08780b9c68 |
|
MD5 | 1c8c3366869473a96fdec29ab9e275d1 |
|
BLAKE2b-256 | c180851d3f3c15b77051d2292fb906814e36f43defe81b857a68ed95acf1d8bc |