Tokenize SMILES with substructure units

These details have not been verified by PyPI

Project links

Homepage

Project description

SMILES Pair Encoding (SmilesPE).

SMILES Pair Encoding (SmilesPE) trains a substructure tokenizer from a large set of SMILES strings (e.g., ChEMBL) based on byte-pair-encoding (BPE).

Overview

Installation

pip install SmilesPE

Usage Instructions

Basic Tokenizers

Atom-level Tokenizer

from SmilesPE.pretokenizer import atomwise_tokenizer

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = atomwise_tokenizer(smi)
print(toks)

['C', 'C', '[N+]', '(', 'C', ')', '(', 'C', ')', 'C', 'c', '1', 'c', 'c', 'c', 'c', 'c', '1', 'Br']

K-mer Tokenzier

from SmilesPE.pretokenizer import kmer_tokenizer

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
toks = kmer_tokenizer(smi, ngram=4)
print(toks)

['CC[N+](', 'C[N+](C', '[N+](C)', '(C)(', 'C)(C', ')(C)', '(C)C', 'C)Cc', ')Cc1', 'Cc1c', 'c1cc', '1ccc', 'cccc', 'cccc', 'ccc1', 'cc1Br']

The basic tokenizers are also compatible with SELFIES and DeepSMILES. Package installations are required.

Example of SELFIES

import selfies
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
sel = selfies.encoder(smi)
print(f'SELFIES string: {sel}')
> >> SELFIES string: [C][C][N+][Branch1_2][epsilon][C][Branch1_3][epsilon][C][C][c][c][c][c][c][c][Ring1][Branch1_1][Br]    
toks = atomwise_tokenizer(sel)
print(toks)
> >> ['[C]', '[C]', '[N+]', '[Branch1_2]', '[epsilon]', '[C]', '[Branch1_3]', '[epsilon]', '[C]', '[C]', '[c]', '[c]', '[c]', '[c]', '[c]', '[c]', '[Ring1]', '[Branch1_1]', '[Br]']

toks = kmer_tokenizer(sel, ngram=4)
print(toks)

>>> ['[C][C][N+][Branch1_2]', '[C][N+][Branch1_2][epsilon]', '[N+][Branch1_2][epsilon][C]', '[Branch1_2][epsilon][C][Branch1_3]', '[epsilon][C][Branch1_3][epsilon]', '[C][Branch1_3][epsilon][C]', '[Branch1_3][epsilon][C][C]', '[epsilon][C][C][c]', '[C][C][c][c]', '[C][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][c]', '[c][c][c][Ring1]', '[c][c][Ring1][Branch1_1]', '[c][Ring1][Branch1_1][Br]']

Example of DeepSMILES

import deepsmiles
converter = deepsmiles.Converter(rings=True, branches=True)
smi = 'CC[N+](C)(C)Cc1ccccc1Br'
deepsmi = converter.encode(smi)
print(f'DeepSMILES string: {deepsmi}')> >> DeepSMILES string: CC[N+]C)C)Ccccccc6Br
toks = atomwise_tokenizer(deepsmi)
print(toks)

>>> ['C', 'C', '[N+]', 'C', ')', 'C', ')', 'C', 'c', 'c', 'c', 'c', 'c', 'c', '6', 'Br']

toks = kmer_tokenizer(deepsmi, ngram=4)
print(toks)

>>> ['CC[N+]C', 'C[N+]C)', '[N+]C)C', 'C)C)', ')C)C', 'C)Cc', ')Ccc', 'Cccc', 'cccc', 'cccc', 'cccc', 'ccc6', 'cc6Br']

Use the Pre-trained SmilesPE Tokenizer

Dowbload 'SPE_ChEMBL.txt'.

import codecs
from SmilesPE.tokenizer import *

spe_vob= codecs.open('../SPE_ChEMBL.txt')
spe = SPE_Tokenizer(spe_vob)

smi = 'CC[N+](C)(C)Cc1ccccc1Br'
spe.tokenize(smi)

>>> 'CC [N+](C) (C)C c1ccccc1 Br'

Train a SmilesPE Tokenizer with a Custom Dataset

See train_SPE.ipynb for an example of training A SPE tokenizer on ChEMBL data.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.3

Apr 7, 2020

0.0.2

Mar 18, 2020

0.0.1

Mar 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SmilesPE-0.0.3.tar.gz (15.6 kB view details)

Uploaded Apr 7, 2020 Source

Built Distribution

SmilesPE-0.0.3-py3-none-any.whl (15.7 kB view details)

Uploaded Apr 7, 2020 Python 3

File details

Details for the file SmilesPE-0.0.3.tar.gz.

File metadata

Download URL: SmilesPE-0.0.3.tar.gz
Upload date: Apr 7, 2020
Size: 15.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.7

File hashes

Hashes for SmilesPE-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`7ceebc7d314e456a08f77d45f08fe4b638886901c0eac50f0cdb005b9f0912bc`
MD5	`ac151f898f038aab0f6becc2f620e78d`
BLAKE2b-256	`5e5ca638fd96cdf4499eaed76d5dbcec734d98c4ddaf2a8f9e13e44e5151fa29`

See more details on using hashes here.

File details

Details for the file SmilesPE-0.0.3-py3-none-any.whl.

File metadata

Download URL: SmilesPE-0.0.3-py3-none-any.whl
Upload date: Apr 7, 2020
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.6.7

File hashes

Hashes for SmilesPE-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9f74279daa14945859546fb2de11c208b5116927ce5fe03b3cf46bcba96f5e58`
MD5	`d9faf4f4f324a7018a099d8f9a933d6c`
BLAKE2b-256	`6df9273f54d9d4b42779926291c82a5b3ffea30cff2492ebbe4ce08dccdcc949`

See more details on using hashes here.

SmilesPE 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SMILES Pair Encoding (SmilesPE).

Overview

Installation

Usage Instructions

Basic Tokenizers

Use the Pre-trained SmilesPE Tokenizer

Train a SmilesPE Tokenizer with a Custom Dataset

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes