Skip to main content

Subwords tokenizer for neural natural language processing

Project description

SubTokenizer

Build status

Subwords tokenizer based on google code from tensor2tensor. It supports tags and combined tokens in addition to google tokenizer.

  • Tags are tokens starting from @, they are not splited on parts.
  • No break symbol ¬ '\xac' allows to join several words in one token.

Tokenizer does unicode normalization and controls characters escaping. It's also possible to encode rare symbols so they can be splited on parts by subwords algorithm.

Original google subwords tokenizer: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py

By default before learning and tokenizing SubTokenizer encodes all control characters and @ ¬ symbols. To use tags it's needed to run encoding first then add tags and after learn/tokenize with encode_controls=Flase or --no_encode_controls in command line mode.

Install:

 pip install subtokenizer

Usage:

cat text_file.txt | subtokenizer learn -o bpe.file -s 1000 -r reserved_tokens.txt
cat text_file.txt | subtokenizer tokenize -s bpe.file > tokenized_file.txt
cat tokenized_file.txt | subtokenizer detokenize -s bpe.file > text_file.txt

Or:

from subtokenizer import SubTokenizer

tokenizer = SubTokenizer.learn(words_count)
tokenizer.save(subwords_filename)

tokenizer = SubTokenizer.load(subwords_filename)
tokens = tokenizer.tokenize(line)
line = tokenizer.detokenize(tokens)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subtokenizer-0.0.19.tar.gz (10.3 kB view details)

Uploaded Source

File details

Details for the file subtokenizer-0.0.19.tar.gz.

File metadata

  • Download URL: subtokenizer-0.0.19.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.6.8

File hashes

Hashes for subtokenizer-0.0.19.tar.gz
Algorithm Hash digest
SHA256 c1d5bf9f1aa897b08c86a4cacc8be6221f4240281481ffb961e13a067931abc2
MD5 6e5df7aec70d366c89463d3952b63b3f
BLAKE2b-256 8df9bbc37011b825802362561fc8d978742592cafad9e366d4b656d0dbc94a5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page