Skip to main content

Unsupervised text tokenizer focused on computational efficiency

Project description

PyPI Downloads Code style: black GitHub Build Status

YouTokenToMe

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]. Our implementation is much faster in training and tokenization than Hugging Face, fastBPE and SentencePiece. In some test cases, it is 90 times faster. Check out our benchmark results.

Key advantages:

  • Multithreading for training and tokenization
  • The algorithm has O(N) complexity, where N is the length of training data
  • Highly efficient implementation in C++
  • Python wrapper and command-line interface

Extra features:

As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in SentencePiece, all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.

For example, the phrase Blazingly fast tokenization! can be tokenized into

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

Installation

pip install youtokentome

Python interface

Example

Let's start with a self-contained example.

import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open(train_data_path, "w") as fout:
    for _ in range(n_lines):
        print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)

# Generating random text
test_text = "".join([random.choice("abcde ") for _ in range(100)])

# Training model
yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)

# Loading model
bpe = yttm.BPE(model=model_path)

# Two types of tokenization
print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))

 

Training model

youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)

Trains BPE model and saves to file.

Args:

  • data: string, path to file with training data
  • model: string, path to where the trained model will be saved
  • vocab_size: int, number of tokens in the final vocabulary
  • coverage: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
  • n_threads: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see benchmark).
  • pad_id: int, reserved id for padding
  • unk_id: int, reserved id for unknown symbols
  • bos_id: int, reserved id for begin of sentence token
  • eos_id: int, reserved id for end of sentence token

Returns: Class youtokentome.BPE with the loaded model.

 

Model loading

youtokentome.BPE(model, n_threads=-1)

Class constructor. Loads the trained model.

  • model: string, path to the trained model
  • n_threads: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.

 

Methods

Class youtokentome.BPE has the following methods:

encode

encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)

Args:

  • sentences: list of strings, sentences for tokenization.
  • output_type: enum, sentence can be tokenized to ids or subwords. Use OutputType.ID for ids and OutputType.SUBWORD for subwords.
  • bos: bool, if True then token “beginning of sentence” will be added
  • eos: bool, if True then token “end of sentence” will be added
  • reverse: bool, if True the output sequence of tokens will be reversed
  • dropout_prob: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].

Returns: If output_type is equal to youtokentome.OutputType.ID or youtokentome.OutputType.SUBWORD then a list of lists of integers or list of lists of strings will be returned respectively.

 

vocab

vocab(self)

Returns: A list vocab_size strings. The i-th string in the list corresponds to i-th subword.

 

vocab_size

vocab_size(self)

Returns: int. Size of vocabulary.

 

subword_to_id

subword_to_id(self, subword)

Args:

  • subword: string.

Returns: Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, unk_id will be returned.

 

id_to_subword

id_to_subword(self, id)

Args:

  • id: int, must be in the range [0, vocab_size-1]

Returns: string. Subword from vocabulary by id.

 

decode

decode(self, ids, ignore_ids=None)

Convert each id to subword and concatenate with space symbol.

Args:

  • ids: list of lists of integers. All integers must be in the range [0, vocab_size-1]
  • ignore_ids: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]

Returns: List of strings.

Command line interface

Example

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA 

Supported commands

YouTokenToMe supports the following commands:

$ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

Command bpe allows you to train Byte Pair Encoding model based on a text file.

$ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

Apply BPE encoding for a corpus of sentences. Use stdin for input and stdout for output.

By default, encoding works in parallel using n_threads threads. Number of threads is limited by 8 (see benchmark).

With the --stream option, --n_threads will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the stdout before the next sentence is read.

$ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

Print vocabulary. This can be useful for understanding the model.

$ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

Convert ids back to text. Use stdin for input and stdout for output.

$ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

youtokentome-1.0.6.tar.gz (86.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

youtokentome-1.0.6-cp38-cp38-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

youtokentome-1.0.6-cp38-cp38-macosx_10_14_x86_64.whl (164.2 kB view details)

Uploaded CPython 3.8macOS 10.14+ x86-64

youtokentome-1.0.6-cp37-cp37m-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

youtokentome-1.0.6-cp37-cp37m-macosx_10_15_x86_64.whl (165.6 kB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

youtokentome-1.0.6-cp37-cp37m-macosx_10_14_intel.whl (323.9 kB view details)

Uploaded CPython 3.7mmacOS 10.14+ Intel (x86-64, i386)

youtokentome-1.0.6-cp36-cp36m-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

youtokentome-1.0.6-cp36-cp36m-macosx_10_14_intel.whl (328.4 kB view details)

Uploaded CPython 3.6mmacOS 10.14+ Intel (x86-64, i386)

youtokentome-1.0.6-cp35-cp35m-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.5mmanylinux: glibc 2.12+ x86-64

youtokentome-1.0.6-cp35-cp35m-macosx_10_14_intel.whl (326.9 kB view details)

Uploaded CPython 3.5mmacOS 10.14+ Intel (x86-64, i386)

File details

Details for the file youtokentome-1.0.6.tar.gz.

File metadata

  • Download URL: youtokentome-1.0.6.tar.gz
  • Upload date:
  • Size: 86.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.6.tar.gz
Algorithm Hash digest
SHA256 2e72fc110b804c7d63a4a5b04335397c48d9d66773233455d7e571561a6e448f
MD5 2b892f24fe358d5868b8324efea288ae
BLAKE2b-256 9aaef8b0d15696766eb35dda6cf84a23d42ae7f3ba37aa30e5e2287fd94ac053

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.6-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 c45add7af8816ea457e33c3660b55a7615b570c40326690749833386280ee1ed
MD5 1d5a8343aaaf0283e2d9ebebd575aad6
BLAKE2b-256 fdf758570e783336313880ec66f70d367a5e8f886471860b16d0d9289ebe85b5

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp38-cp38-macosx_10_14_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp38-cp38-macosx_10_14_x86_64.whl
  • Upload date:
  • Size: 164.2 kB
  • Tags: CPython 3.8, macOS 10.14+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for youtokentome-1.0.6-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm Hash digest
SHA256 9cd216302cc3791ac9a4e99c92d0403040f80c9afce3123a5b05f87583413a13
MD5 03871c3e7e403a2dee9692538f5133ca
BLAKE2b-256 10b03b57a8be6bfd803e4a626820560f2ca633ba0e02ebabcae58384306085ff

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.6-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9b6feccac4a6f514cac23ae2aeda1d61b42d88e26abf1a61a213b4f70b2ea01b
MD5 9a2438359697d5f74178d852dd247469
BLAKE2b-256 c81c224cdc3d9a32ed706c8fb1f30b491be6ea5da114ff4edc174014cc24fa43

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 165.6 kB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for youtokentome-1.0.6-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 78a80c4ba2461e1db175604cc628b2a64fd974735a34016749545bbb8f4f2437
MD5 c85331c140400ded20a4bcbc5a54cf25
BLAKE2b-256 79a4079a35c10f7c223b92a02946bb8edf404fc5fac94dc1e67f1fadcb4425d6

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp37-cp37m-macosx_10_14_intel.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp37-cp37m-macosx_10_14_intel.whl
  • Upload date:
  • Size: 323.9 kB
  • Tags: CPython 3.7m, macOS 10.14+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for youtokentome-1.0.6-cp37-cp37m-macosx_10_14_intel.whl
Algorithm Hash digest
SHA256 cb335650648a4958a9f4822d7debafcc09b93f0127f679531b39a0c290c3eebe
MD5 c982c2e795106b6991560731684cebe4
BLAKE2b-256 e3ce7f0fc804fd34a50e71df624599fd817d6a76a55dc0fdb0d8b104f89f3517

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.6-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e313afe64980e92cc96acbab2497e533bf0c59e19aec2d6e281773895cc6f3bb
MD5 59f82ea09cf702db77e5d86a74c6a9b3
BLAKE2b-256 a3654a86cf99da3f680497ae132329025b291e2fda22327e8da6a9476e51acb1

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp36-cp36m-macosx_10_14_intel.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp36-cp36m-macosx_10_14_intel.whl
  • Upload date:
  • Size: 328.4 kB
  • Tags: CPython 3.6m, macOS 10.14+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for youtokentome-1.0.6-cp36-cp36m-macosx_10_14_intel.whl
Algorithm Hash digest
SHA256 ff73d7d65dafab3c2d4d364e427e6569962ac78348065851b3ca862e93e708b4
MD5 234d842b752c71ffc07bbd50a244dd63
BLAKE2b-256 3dd4f3623399e0bf63edd58de5462dd838627bc53c31223dd18ca5e107191057

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.6-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 ee002e897489dbddda93c8a41768bcdee0fce8cbd78f7bb6add24e925e354247
MD5 3ec0e1807e63e6ec42310b1652944fa8
BLAKE2b-256 218a5ee45209347b65498e1bd8b1a299360d157e2b17632d0cdc402415191b05

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.6-cp35-cp35m-macosx_10_14_intel.whl.

File metadata

  • Download URL: youtokentome-1.0.6-cp35-cp35m-macosx_10_14_intel.whl
  • Upload date:
  • Size: 326.9 kB
  • Tags: CPython 3.5m, macOS 10.14+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.7.5

File hashes

Hashes for youtokentome-1.0.6-cp35-cp35m-macosx_10_14_intel.whl
Algorithm Hash digest
SHA256 e33bad29875ae44f5fd18bd61b80732ea0f878f7f9d6996441e8f32e7c82cea8
MD5 73fcd43d67c1ec062700713a4da22304
BLAKE2b-256 9c7895ea7cd878a50c905584c8a8aea3a6e0a59d7bbd10329cb03849d34674e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page