Tokenize SMILES with substructure units
Project description
SMILES Pair Encoding (SmilesPE).
SMILES Pair Encoding (SmilesPE) trains a substructure tokenizer from a large set of SMILES strings (e.g., ChEMBL) based on byte-pair-encoding (BPE).
Overview
Installation
pip install SmilesPE
Usage Instructions
Basic Tokenizers
- Atom-level Tokenizer
from SmilesPE.pretokenizer import atomwise_tokenizer
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = atomwise_tokenizer(smi)
print(toks)
['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']
- K-mer Tokenzier
from SmilesPE.pretokenizer import kmer_tokenizer
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = kmer_tokenizer(smi, ngram=4)
print(toks)
['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']
The basic tokenizers are also compatible with SELFIES and DeepSMILES. Package installations are required.
Example of SELFIES
import selfies
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
sel = selfies.encoder(smi)
print(f'SELFIES string: {sel}')
> >> SELFIES string: [C][C][N+][Branch1_2][epsilon][C][Branch1_3][epsilon][C][C][c][c][c][c][c][c][Ring1][Branch1_1][Br]
toks = atomwise_tokenizer(sel)
print(toks)
> >> ['[C]', '[C]', '[N+]', '[Branch1_2]', '[epsilon]', '[C]', '[Branch1_3]', '[epsilon]', '[C]', '[C]', '[c]', '[c]', '[c]', '[c]', '[c]', '[c]', '[Ring1]', '[Branch1_1]', '[Br]']
toks = kmer_tokenizer(sel, ngram=4)
print(toks)
>>> ['[C][C][N+][Branch1_2]', '[C][N+][Branch1_2][epsilon]', '[N+][Branch1_2][epsilon][C]', '[Branch1_2][epsilon][C][Branch1_3]', '[epsilon][C][Branch1_3][epsilon]', '[C][Branch1_3][epsilon][C]', '[Branch1_3][epsilon][C][C]', '[epsilon][C][C][c]', '[C][C][c][c]', '[C][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][Ring1]', '[c][c][Ring1][Branch1_1]', '[c][Ring1][Branch1_1][Br]']
Example of DeepSMILES
import deepsmiles
converter = deepsmiles.Converter(rings=True, branches=True)
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
deepsmi = converter.encode(smi)
print(f'DeepSMILES string: {deepsmi}')> >> DeepSMILES string: CC[N+]C)C)Ccccccc6Br
toks = atomwise_tokenizer(deepsmi)
print(toks)
>>> ['C', 'C', '[N+]', 'C', ')', 'C', ')', 'C', 'c', 'c', 'c', 'c', 'c', 'c', '6', 'Br']
toks = kmer_tokenizer(deepsmi, ngram=4)
print(toks)
>>> ['CC[N+]C', 'C[N+]C)', '[N+]C)C', 'C)C)', ')C)C', 'C)Cc', ')Ccc', 'Cccc', 'cccc', 'cccc', 'cccc', 'ccc6', 'cc6Br']
Use the Pre-trained SmilesPE Tokenizer
Dowbload 'SPE_ChEMBL.txt'.
import codecs
from SmilesPE.tokenizer import *
spe_vob= codecs.open('../SPE_ChEMBL.txt')
spe = SPE_Tokenizer(spe_vob)
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
spe.tokenize(smi)
>>> 'CC [N+](C) (C)C c1ccccc1 Br'
Train a SmilesPE Tokenizer with a Custom Dataset
See train_SPE.ipynb for an example of training A SPE tokenizer on ChEMBL data.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
SmilesPE-0.0.3.tar.gz
(15.6 kB
view details)
Built Distribution
SmilesPE-0.0.3-py3-none-any.whl
(15.7 kB
view details)
File details
Details for the file SmilesPE-0.0.3.tar.gz
.
File metadata
- Download URL: SmilesPE-0.0.3.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7ceebc7d314e456a08f77d45f08fe4b638886901c0eac50f0cdb005b9f0912bc |
|
MD5 | ac151f898f038aab0f6becc2f620e78d |
|
BLAKE2b-256 | 5e5ca638fd96cdf4499eaed76d5dbcec734d98c4ddaf2a8f9e13e44e5151fa29 |
File details
Details for the file SmilesPE-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: SmilesPE-0.0.3-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f74279daa14945859546fb2de11c208b5116927ce5fe03b3cf46bcba96f5e58 |
|
MD5 | d9faf4f4f324a7018a099d8f9a933d6c |
|
BLAKE2b-256 | 6df9273f54d9d4b42779926291c82a5b3ffea30cff2492ebbe4ce08dccdcc949 |