Skip to main content

Fast bare-bones BPE for modern tokenizer training

Project description

bpeasy

codecov tests image image PyPI version

Overview

bpeasy is a Python package that provides a tokenizer trainer, implementing in 400 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface tokenizers library, but makes opinionated decisions to simplify the tokenizer training specifically to:

  1. Treat text data at the byte-level first --- all text is converted to bytes before training rather than using a character-level approach (like in Huggingface).
  2. Always use a regex-based split pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the fancy-regex crate which supports a richer set of regex features than the regex crate used in Huggingface.
  3. Use int64 types for counting to allow for training on much larger datasets without the risk of overflow.

You can think of bpeasy as the tiktoken training code that never was.

See the benchmarks section for a comparison with the Huggingface library.

Installation

Simply install the package using pip:

pip install bpeasy

Training

The training function is designed to be bare-bones and returns the trained tokenizer vocab as a dictionary of bytes to integers. This is to allow for maximum flexibility in how you want to use the tokenizer. For example, you can use then port these to tiktoken or Huggingface tokenizers (see below).

# should be an iterator over str
iterator = jsonl_content_iterator(args)
# example regex from GPT-4
regex_pattern = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""

# returns the vocab (dict[bytes, int])
vocab = bpeasy.train_bpe(
    iterator,
    regex_pattern,
    args.max_sentencepiece_length, # max length of tokens
    args.vocab_size, # max size of vocab
)

Alternatively, you can also train using the basic tokenizer class provided:

from bpeasy.tokenizer import BPEasyTokenizer

tokenizer = BPEasyTokenizer.train(
    iterator, # iterator over str
    vocab_size=vocab_size,
    max_token_length=max_token_length,
    regex_pattern=regex_pattern,
    special_tokens=["<s>", "<pad>", "</s>"],
    fill_to_nearest_multiple_of_eight=True,
    name="bpeasy",
)

Encoding/Decoding

To test your tokenizer you can use the BPEasyTokenizer class, which is a wrapper around the tiktoken.Encoding module, simplifying the handling of vocabularies, special tokens, and regex patterns for tokenization.

from bpeasy.tokenizer import BPEasyTokenizer

your_special_tokens = ["<s>", "<pad>", "</s>"]

tokenizer = BPEasyTokenizer(
    vocab=vocab,
    regex_pattern=regex_pattern,
    special_tokens=your_special_tokens,
    fill_to_nearest_multiple_of_eight=True, # pad vocab to multiple of 8
    name="bpeasy" # optional name for the tokenizer
)

test = "hello_world"

# encode and decode uses the tiktoken functions
encoded = tokenizer.encode(test)
decoded = tokenizer.decode(encoded)
> "hello_world"

You can also use tiktoken directly, but you would need to handle the special tokens and regex pattern yourself:

import tiktoken

vocab = bpeasy.train_bpe(...)
special_tokens = ["<s>", "<pad>", "</s>"]

# Sort the vocab by rank
sorted_vocab = sorted(list(vocab.items()), key=lambda x: x[1])

# add special tokens
special_token_ranks = {}
for special_token in special_tokens:
    special_token_ranks[special_token] = len(sorted_vocab)
    sorted_vocab.append((special_token.encode("utf-8"), len(sorted_vocab)))

full_vocab = dict(sorted_vocab)

encoder = tiktoken.Encoding(
            name=name,
            pat_str=regex_pattern,
            mergeable_ranks=full_vocab,
            special_tokens=special_token_ranks,
        )

Save/Load tokenizer from file

We provide basic utility functions to save and load the tokenizer from a json file.

tokenizer.save("path_to_file.json")

tokenizer = BPEasyTokenizer.from_file("path_to_file.json")

Export to HuggingFace format

We also support exporting the tokenizer to the HuggingFace format, which can then be used directly with the HuggingFace transformers library.

from bpeasy.tokenizer import BPEasyTokenizer
from trans
tokenizer = BPEasyTokenizer(
    ...
)

tokenizer.export_to_huggingface_format("hf_tokenizer.json")

from transformers import PreTrainedTokenizerFast

hf_tokenizer = PreTrainedTokenizerFast(tokenizer_file="hf_tokenizer.json")

Export vocab to tiktoken txt format

from bpeasy import 
vocab = bpeasy.train_bpe(...)

# saves the vocab to a tiktoken txt file format
save_vocab_to_tiktoken(vocab, "vocab.txt", special_tokens=["<s>", "<pad>", "</s>"])

If you want to use the tiktoken txt format, you will still need to handle the regex and special tokens yourself, as shown above,

Contributing

Contributions are welcome! Please open an issue if you have any suggestions or improvements.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpeasy-0.1.2.tar.gz (931.9 kB view hashes)

Uploaded Source

Built Distributions

bpeasy-0.1.2-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-pp310-pypy310_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-pp310-pypy310_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-pp310-pypy310_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-pp310-pypy310_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

bpeasy-0.1.2-pp39-pypy39_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-pp39-pypy39_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-pp39-pypy39_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-pp39-pypy39_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-pp39-pypy39_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-pp39-pypy39_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

bpeasy-0.1.2-pp38-pypy38_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-pp38-pypy38_pp73-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-pp38-pypy38_pp73-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-pp38-pypy38_pp73-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-pp38-pypy38_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-pp38-pypy38_pp73-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded PyPy manylinux: glibc 2.12+ i686

bpeasy-0.1.2-cp313-cp313-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-cp313-cp313-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-cp313-cp313-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.13 manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-cp312-none-win_amd64.whl (751.3 kB view hashes)

Uploaded CPython 3.12 Windows x86-64

bpeasy-0.1.2-cp312-none-win32.whl (673.4 kB view hashes)

Uploaded CPython 3.12 Windows x86

bpeasy-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-cp312-cp312-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-cp312-cp312-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-cp312-cp312-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-cp312-cp312-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.12 manylinux: glibc 2.12+ i686

bpeasy-0.1.2-cp312-cp312-macosx_11_0_arm64.whl (869.1 kB view hashes)

Uploaded CPython 3.12 macOS 11.0+ ARM64

bpeasy-0.1.2-cp312-cp312-macosx_10_7_x86_64.whl (935.3 kB view hashes)

Uploaded CPython 3.12 macOS 10.7+ x86-64

bpeasy-0.1.2-cp311-none-win_amd64.whl (750.6 kB view hashes)

Uploaded CPython 3.11 Windows x86-64

bpeasy-0.1.2-cp311-none-win32.whl (673.2 kB view hashes)

Uploaded CPython 3.11 Windows x86

bpeasy-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-cp311-cp311-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-cp311-cp311-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-cp311-cp311-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-cp311-cp311-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.11 manylinux: glibc 2.12+ i686

bpeasy-0.1.2-cp311-cp311-macosx_11_0_arm64.whl (867.9 kB view hashes)

Uploaded CPython 3.11 macOS 11.0+ ARM64

bpeasy-0.1.2-cp311-cp311-macosx_10_7_x86_64.whl (934.5 kB view hashes)

Uploaded CPython 3.11 macOS 10.7+ x86-64

bpeasy-0.1.2-cp310-none-win_amd64.whl (750.6 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

bpeasy-0.1.2-cp310-none-win32.whl (673.2 kB view hashes)

Uploaded CPython 3.10 Windows x86

bpeasy-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-cp310-cp310-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-cp310-cp310-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-cp310-cp310-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-cp310-cp310-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.12+ i686

bpeasy-0.1.2-cp310-cp310-macosx_11_0_arm64.whl (867.9 kB view hashes)

Uploaded CPython 3.10 macOS 11.0+ ARM64

bpeasy-0.1.2-cp310-cp310-macosx_10_7_x86_64.whl (934.5 kB view hashes)

Uploaded CPython 3.10 macOS 10.7+ x86-64

bpeasy-0.1.2-cp39-none-win_amd64.whl (750.6 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

bpeasy-0.1.2-cp39-none-win32.whl (673.2 kB view hashes)

Uploaded CPython 3.9 Windows x86

bpeasy-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-cp39-cp39-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-cp39-cp39-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-cp39-cp39-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-cp39-cp39-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.12+ i686

bpeasy-0.1.2-cp38-none-win_amd64.whl (750.6 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

bpeasy-0.1.2-cp38-none-win32.whl (673.2 kB view hashes)

Uploaded CPython 3.8 Windows x86

bpeasy-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

bpeasy-0.1.2-cp38-cp38-manylinux_2_17_s390x.manylinux2014_s390x.whl (1.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ s390x

bpeasy-0.1.2-cp38-cp38-manylinux_2_17_ppc64le.manylinux2014_ppc64le.whl (1.4 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ppc64le

bpeasy-0.1.2-cp38-cp38-manylinux_2_17_armv7l.manylinux2014_armv7l.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARMv7l

bpeasy-0.1.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

bpeasy-0.1.2-cp38-cp38-manylinux_2_12_i686.manylinux2010_i686.whl (1.3 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.12+ i686

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page