Skip to main content

Unsupervised text tokenizer focused on computational efficiency

Project description

PyPI Downloads Code style: black GitHub Build Status

YouTokenToMe

YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE) [Sennrich et al.]. Our implementation is much faster in training and tokenization than both fastBPE and SentencePiece. In some test cases, it is 90 times faster. Check out our benchmark results.

Key advantages:

  • Multithreading for training and tokenization
  • The algorithm has O(N) complexity, where N is the length of training data
  • Highly efficient implementation in C++
  • Python wrapper and command-line interface

Extra features:

As well as in the algorithm from the original paper, ours does not consider tokens that cross word boundaries. Just like in SentencePiece, all space symbols were replaced by meta symbol "▁" (U+2581). It allows sequences of tokens to be converted back to text and for word boundaries to be restored.

For example, the phrase Blazingly fast tokenization! can be tokenized into

['▁Bl', 'az', 'ingly', '▁fast', '▁token', 'ization', '!']

Installation

pip install youtokentome

Python interface

Example

Let's start with a self-contained example.

import random

import youtokentome as yttm

train_data_path = "train_data.txt"
model_path = "example.model"

# Generating random file with training data
# 10000 lines with 100 characters in each line
n_lines = 10000
n_characters = 100
with open(train_data_path, "w") as fout:
    for _ in range(n_lines):
        print("".join([random.choice("abcd ") for _ in range(n_characters)]), file=fout)

# Generating random text
test_text = "".join([random.choice("abcde ") for _ in range(100)])

# Training model
yttm.BPE.train(data=train_data_path, vocab_size=5000, model=model_path)

# Loading model
bpe = yttm.BPE(model=model_path)

# Two types of tokenization
print(bpe.encode([test_text], output_type=yttm.OutputType.ID))
print(bpe.encode([test_text], output_type=yttm.OutputType.SUBWORD))

 

Training model

youtokentome.BPE.train(data, model, vocab_size, coverage, n_threads=-1, pad_id=0, unk_id=1, bos_id=2, eos_id=3)

Trains BPE model and saves to file.

Args:

  • data: string, path to file with training data
  • model: string, path to where the trained model will be saved
  • vocab_size: int, number of tokens in the final vocabulary
  • coverage: float, fraction of characters covered by the model. Must be in the range [0, 1]. A good value to use is about 0.9999.
  • n_threads: int, number of parallel threads used to run. If -1 is passed, then all available threads are going to be used. Note that the number of threads is limited by 8 (see benchmark).
  • pad_id: int, reserved id for padding
  • unk_id: int, reserved id for unknown symbols
  • bos_id: int, reserved id for begin of sentence token
  • eos_id: int, reserved id for end of sentence token

Returns: Class youtokentome.BPE with the loaded model.

 

Model loading

youtokentome.BPE(model, n_threads=-1)

Class constructor. Loads the trained model.

  • model: string, path to the trained model
  • n_threads: int, number of parallel threads used to run. If equal to -1, then the maximum number of threads available will be used.

 

Methods

Class youtokentome.BPE has the following methods:

encode

encode(self, sentences, output_type=yttm.OutputType.ID, bos=False, eos=False, reverse=False, dropout_prob=0)

Args:

  • sentences: list of strings, sentences for tokenization.
  • output_type: enum, sentence can be tokenized to ids or subwords. Use OutputType.ID for ids and OutputType.SUBWORD for subwords.
  • bos: bool, if True then token “beginning of sentence” will be added
  • eos: bool, if True then token “end of sentence” will be added
  • reverse: bool, if True the output sequence of tokens will be reversed
  • dropout_prob: float, BPE-dropout probability (the probability of a merge being dropped). Must be in the range [0, 1].

Returns: If output_type is equal to youtokentome.OutputType.ID or youtokentome.OutputType.SUBWORD then a list of lists of integers or list of lists of strings will be returned respectively.

 

vocab

vocab(self)

Returns: A list vocab_size strings. The i-th string in the list corresponds to i-th subword.

 

vocab_size

vocab_size(self)

Returns: int. Size of vocabulary.

 

subword_to_id

subword_to_id(self, subword)

Args:

  • subword: string.

Returns: Integer from the range [0, vocab_size-1]. Id of subword or, if there is no such subword in the vocabulary, unk_id will be returned.

 

id_to_subword

id_to_subword(self, id)

Args:

  • id: int, must be in the range [0, vocab_size-1]

Returns: string. Subword from vocabulary by id.

 

decode

decode(self, ids, ignore_ids=None)

Convert each id to subword and concatenate with space symbol.

Args:

  • ids: list of lists of integers. All integers must be in the range [0, vocab_size-1]
  • ignore_ids: collection of integers. These indices would be ignored during the decoding. All integers must be in the range [0, vocab_size-1] [default: None]

Returns: List of strings.

Command line interface

Example

$ yttm bpe --data TRAINING_DATA_FILE --model OUTPUT_MODEL_FILE --vocab_size 2000
$ yttm encode --model OUTPUT_MODEL_FILE --output_type subword < TEST_DATA_FILE > ENCODED_DATA 

Supported commands

YouTokenToMe supports the following commands:

$ yttm --help

Usage: yttm [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  bpe     Train BPE model.
  decode  Decode ids to text.
  encode  Encode text to ids or subwords.
  vocab   Print list of learned subwords.

Command bpe allows you to train Byte Pair Encoding model based on a text file.

$ yttm bpe --help

Usage: yttm bpe [OPTIONS]

  Train BPE model.

Options:
  --data PATH           Training data file path.  [required]
  --model PATH          Output model file path.  [required]
  --vocab_size INTEGER  Number of tokens in the final vocabulary.  [required]
  --coverage FLOAT      Fraction of characters covered by the model.  [default: 1.0]
  --n_threads INTEGER   Number of threads.  [default: -1]
  --pad_id INTEGER      Padding token id.  [default: 0]
  --unk_id INTEGER      Unknown token id.  [default: 1]
  --bos_id INTEGER      'Begin of sentence' token id.  [default: 2]
  --eos_id INTEGER      'End of sentence' token id.  [default: 3]
  --help                Show this message and exit.

Apply BPE encoding for a corpus of sentences. Use stdin for input and stdout for output.

By default, encoding works in parallel using n_threads threads. Number of threads is limited by 8 (see benchmark).

With the --stream option, --n_threads will be ignored and all sentences will be processed one by one. Each sentence will be tokenized and written to the stdout before the next sentence is read.

$ yttm encode --help

Usage: yttm encode [OPTIONS]

  Encode text to ids or subwords.

Options:
  --model PATH         Path to file with learned model.  [required]
  --output_type TEXT   'id' or 'subword'.  [required]
  --n_threads INTEGER  Number of threads.  [default: -1]
  --bos                Add tab 'begin of sentence'.
  --eos                Add tab 'end of sentence'.
  --reverse            Reverse output sequence of tokens.
  --stream             Process each line before reading the next one.
  --dropout_prob       BPE-dropout probability (the probability of a merge being dropped). [default: 0]
  --help               Show this message and exit.

Print vocabulary. This can be useful for understanding the model.

$ yttm vocab --help

Usage: yttm vocab [OPTIONS]

  Print list of learned subwords.

Options:
  --model PATH  Path to file with learned model.  [required]
  --verbose     Add merging rules.
  --help        Show this message and exit.

Convert ids back to text. Use stdin for input and stdout for output.

$ yttm decode --help

Usage: yttm decode [OPTIONS]

  Decode ids to text.

Options:
  --model PATH  Path to file with learned model.  [required]
  --ignore_ids  List of indices to ignore for decoding. Example: --ignore_ids=1,2,3
  --help        Show this message and exit.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

youtokentome-1.0.5.tar.gz (86.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

youtokentome-1.0.5-cp38-cp38-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

youtokentome-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl (163.1 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

youtokentome-1.0.5-cp37-cp37m-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

youtokentome-1.0.5-cp37-cp37m-macosx_10_6_intel.whl (321.8 kB view details)

Uploaded CPython 3.7mmacOS 10.6+ Intel (x86-64, i386)

youtokentome-1.0.5-cp36-cp36m-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

youtokentome-1.0.5-cp36-cp36m-macosx_10_6_intel.whl (326.3 kB view details)

Uploaded CPython 3.6mmacOS 10.6+ Intel (x86-64, i386)

youtokentome-1.0.5-cp35-cp35m-manylinux2010_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.5mmanylinux: glibc 2.12+ x86-64

youtokentome-1.0.5-cp35-cp35m-macosx_10_6_intel.whl (324.5 kB view details)

Uploaded CPython 3.5mmacOS 10.6+ Intel (x86-64, i386)

File details

Details for the file youtokentome-1.0.5.tar.gz.

File metadata

  • Download URL: youtokentome-1.0.5.tar.gz
  • Upload date:
  • Size: 86.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.5.tar.gz
Algorithm Hash digest
SHA256 18c5f0b8cbcad3772438a14b023b714b2ca13d006f09cc8a2ef2046912267f2b
MD5 1529f033ef8b9e68b159700210ea26da
BLAKE2b-256 af25e2f9863b78e5aef61bc0475bfac39f56197103f767e6f2e957cc67b989f2

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.5-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 eeb70701b5847dbdc036a46c2cc6acb6379cf6ff74a6356b869a2619a9fd6ffe
MD5 220da5834664e83baa56ab5c7e19f7e7
BLAKE2b-256 1085e76296667a98c0e8030036c4ba1d32a26896c686b7fd76673102bea7af7a

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 163.1 kB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/2.7.16

File hashes

Hashes for youtokentome-1.0.5-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9df4e17b0ac9ce2cf54722bb6048aef76d86f5487599f2d7562fa9e03d11d9c4
MD5 0b0d422b65ed2eeee230e809a4970c9e
BLAKE2b-256 99b5f1c501ad8c8b4c3f0c46b89c130a9896d0bd59e84f65256e968dbfd56883

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.5-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 a94e69e65278fc67935c9c27adf5abfaad08d94b5e08ec0fbbd1e85c06c9aa1f
MD5 4ffd57b4083defaae39feb4287264e86
BLAKE2b-256 71842f6276edb642086e9ea066bba29cd331c98665e562c3338b57b414ae0392

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp37-cp37m-macosx_10_6_intel.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp37-cp37m-macosx_10_6_intel.whl
  • Upload date:
  • Size: 321.8 kB
  • Tags: CPython 3.7m, macOS 10.6+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/2.7.16

File hashes

Hashes for youtokentome-1.0.5-cp37-cp37m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 c0299f969ad3362728e85d787a6704c759ea890a80fc55316bb124d1d8d35009
MD5 a67e6f0c74f2bccd6333998720c8a268
BLAKE2b-256 51f959ac467d985a65343129482ac7e8568e7294ab9e3781ec517000d3ab5718

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.5-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 37e894735e3e4e6f6e5698a842dbed197643d7020ece4c80a8dc61537b7ddc35
MD5 74b38667d885d9cde327bbfaad8b1fd4
BLAKE2b-256 06164cb7a9358430996bd6fa7daf32421105fe37a7bd0e4da1f79496e15aa509

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp36-cp36m-macosx_10_6_intel.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp36-cp36m-macosx_10_6_intel.whl
  • Upload date:
  • Size: 326.3 kB
  • Tags: CPython 3.6m, macOS 10.6+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/2.7.16

File hashes

Hashes for youtokentome-1.0.5-cp36-cp36m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 e712597005fb80c886ab8e6d91a50fb77e39ede82cdbbb2f01aaa44b6dd0c754
MD5 90e9d6aace15f032640157d4d75c6d97
BLAKE2b-256 cbb86a492b5faffb690d20a788893f5a3a647ffebe64bd373c26af02c454af5b

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/3.7.1

File hashes

Hashes for youtokentome-1.0.5-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 8961c624665dbe44c06e82e6b6e06989a8e1f36b87663fda3f3588b64df39739
MD5 11fe16a2b9d7b13dd2e9ed2d7ef228a4
BLAKE2b-256 adc781a9cb7455a289e79d75bbde0a1150cdbb4263d26f8b4b4a0467680192ae

See more details on using hashes here.

File details

Details for the file youtokentome-1.0.5-cp35-cp35m-macosx_10_6_intel.whl.

File metadata

  • Download URL: youtokentome-1.0.5-cp35-cp35m-macosx_10_6_intel.whl
  • Upload date:
  • Size: 324.5 kB
  • Tags: CPython 3.5m, macOS 10.6+ Intel (x86-64, i386)
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.38.0 CPython/2.7.16

File hashes

Hashes for youtokentome-1.0.5-cp35-cp35m-macosx_10_6_intel.whl
Algorithm Hash digest
SHA256 a641a869678ecf3190f0c7462280cfd3a63ed54b9bb3a6a49bee32b805c381ae
MD5 fd1046f8985b782bbaf939d355b46bc8
BLAKE2b-256 b5052c2f09d8e03e1786cace0a47f27bd73f7550dd7cc5def751d162e13814df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page