Subwords tokenizer for neural natural language processing

These details have not been verified by PyPI

Project links

Homepage

Project description

SubTokenizer

Subwords tokenizer based on google code from tensor2tensor. It supports tags and combined tokens in addition to google tokenizer.

Tags are tokens starting from @, they are not splited on parts.
No break symbol ¬ '\xac' allows to join several words in one token.

Tokenizer does unicode normalization and controls characters escaping. It's also possible to encode rare symbols so they can be splited on parts by subwords algorithm.

Original google subwords tokenizer: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py

By default before learning and tokenizing SubTokenizer encodes all control characters and @ ¬ symbols. To use tags it's needed to run encoding first then add tags and after learn/tokenize with encode_controls=Flase or --no_encode_controls in command line mode.

Install:

 pip install subtokenizer

Usage:

cat text_file.txt | subtokenizer learn -o bpe.file -s 1000 -r reserved_tokens.txt
cat text_file.txt | subtokenizer tokenize -s bpe.file > tokenized_file.txt
cat tokenized_file.txt | subtokenizer detokenize -s bpe.file > text_file.txt

Or:

from subtokenizer import SubTokenizer

tokenizer = SubTokenizer.learn(words_count)
tokenizer.save(subwords_filename)

tokenizer = SubTokenizer.load(subwords_filename)
tokens = tokenizer.tokenize(line)
line = tokenizer.detokenize(tokens)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.19

Jul 26, 2019

0.0.18

Jul 20, 2019

0.0.17

Jul 20, 2019

0.0.16

Jul 19, 2019

0.0.15

Jul 19, 2019

0.0.14

Jul 19, 2019

0.0.13

Jun 14, 2019

0.0.12

Apr 11, 2019

0.0.11

Feb 12, 2019

0.0.10

Nov 12, 2018

0.0.9

Nov 12, 2018

0.0.8

Oct 31, 2018

0.0.7

Oct 29, 2018

0.0.6

Oct 29, 2018

0.0.5

Oct 29, 2018

0.0.4

Oct 11, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subtokenizer-0.0.19.tar.gz (10.3 kB view details)

Uploaded Jul 26, 2019 Source

File details

Details for the file subtokenizer-0.0.19.tar.gz.

File metadata

Download URL: subtokenizer-0.0.19.tar.gz
Upload date: Jul 26, 2019
Size: 10.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for subtokenizer-0.0.19.tar.gz
Algorithm	Hash digest
SHA256	`c1d5bf9f1aa897b08c86a4cacc8be6221f4240281481ffb961e13a067931abc2`
MD5	`6e5df7aec70d366c89463d3952b63b3f`
BLAKE2b-256	`8df9bbc37011b825802362561fc8d978742592cafad9e366d4b656d0dbc94a5f`

See more details on using hashes here.

subtokenizer 0.0.19

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

SubTokenizer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes