Skip to main content

A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.

Project description

ByteTok

CI PyPI - Version Python versions License

ByteTok implements Byte Pair Encoding (BPE) at the byte-level with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes (0-255), then iteratively merged using learned pair statistics.

The training algorithm is based on an optimized BPE algorithm from the paper A Formal Perspective on Byte-Pair Encoding. The research has enabled ByteTok to achieve O(N log V) training time and O(N log N) encoding time versus the naive O(NV) approach.

Here, N denotes the length of the input text and V is the tokenizer's vocabulary size.

Features

  • High-performance Rust-powered training, encoding, and decoding: Engineered from the ground up with a parallel processing pipeline for efficient handling of large-scale NLP datasets (1GB+) with the aim of enabling rapid processing for modern LLM applications.
  • Built-in regex patterns: Choose from a pre-tokenization regex preset that includes GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2 and DeepSeek.
  • Custom regex patterns: Supported alongside the built-in presets.
  • Special token strategies: Control how special tokens are handled during encoding.
  • Serialization: Supports versioned .model / .vocab file formats for saving tokenizer state, as well as easy loading via a from_pretrained() function.

History

This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:

  • Train on custom text
  • Save learned encodings
  • Encode text
  • Decode text

Feel free to check out robust libraries such as OpenAI's tiktoken and Google's sentencepiece that are widely adopted in production environments. Tiktoken resembles ByteTok the most, but it should be noted that ByteTok provides a training pipeline which Tiktoken lacks.

In contrast, ByteTok was developed with a different focus. It prioritizes simplicity and usability by offering a clear API that efficiently maps strings to lists of token IDs. All this without burdening users with overly complex configuration or excessive parameters.

Benchmarks

These benchmarks were conducted on a Linux x86_64 system equipped with an Intel Core i7-12700H processor (20 cores @ 4.70 GHz) and 32GB DDR5 RAM. Encoding and decoding throughput represent the speed of encode_batch() and decode_batch() operations, respectively.

Dataset: Sci-Fi Books (Gutenberg)

Corpus Size Vocab Size Training Time Encoding Throughput Decoding Throughput Compression Ratio Size Reduction
132.36 MB 10,000 4.58 mins 16.12 MB/sec 82.4M tokens/sec 1.38x 27.5%
216.96 MB 10,000 8.75 mins 13.82 MB/sec 81.4M tokens/sec 1.60x 37.7%
216.96 MB 25,000 9.74 mins 14.55 MB/sec 70.2M tokens/sec 1.68x 40.6%
216.96 MB 50,000 10.67 mins 14.99 MB/sec 77.0M tokens/sec 1.75x 42.7%
326.96 MB 50,000 16.19 mins 14.61 MB/sec 79.3M tokens/sec 1.44x 30.7%

Requirements

  • Python >= 3.12

Installation

Install from PyPI:

# with pip
pip install bytetok

# or with uv (recommended)
uv add bytetok

Building from Source

If you want to develop or build from source, you will need the Rust toolchain rustup.

# clone the repository
git clone https://github.com/VihangaFTW/bytetok.git

# install with uv
uv sync

# or build with maturin
uv sync --group dev
uv run maturin develop --release

Quick Start

Here you will find the primary workflows for using ByteTok tokenizers. For detailed API usage and additional features, see the full documentation in the Wiki.

Basics

The API has been designed with simplicity in mind:

import bytetok as btok


# Create a tokenizer with a built-in pattern (default: gpt4o).
tokenizer = btok.get_tokenizer("gpt4o")

# Train on text.
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode and decode.
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"

# Save and reload.
tokenizer.save("my_tokenizer")
reloaded = btok.from_pretrained("my_tokenizer.model")

Custom regex patterns can be used for pre-tokenization:

import bytetok as btok


# Create a tokenizer with a custom pattern
# For example, split on whitespace and punctuation.
tokenizer = btok.get_tokenizer(custom_pattern = r"\w+|[^\w\s]")

For best results, it is recommended to choose from the built-in presets, which have been extensively validated.

Parallel Encoding

ByteTok supports parallel encoding and decoding for faster processing of large batches of text.

Use encode_batch to perform parallel encoding to efficiently handle large collections of texts. You can then decode the resulting list of token sequences in parallel using decode_batch:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Encode a batch of texts in parallel.
texts = ["First document...", "Second document...", "Third document..."]
encoded = tokenizer.encode_batch(texts, show_progress=False)

# Decode the batch in parallel.
decoded = tokenizer.decode_batch(encoded, errors="replace", show_progress=False)
assert decoded[0] == "First document..."

Special Tokens

Register special tokens after training, then encode with a strategy to control how they are handled:

import bytetok as btok


tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)

# Set special tokens (IDs must be >= vocab size).
tokenizer.set_special_tokens({"<|endoftext|>": 15005, "<|pad|>": 13005})

# Encode with strategy: "all" allows special tokens in text; "none" ignores them.
strategy = btok.get_strategy("all")
tokens = tokenizer.encode("Hello<|endoftext|>world", strategy=strategy)

# Batch encoding with special tokens.
encoded = tokenizer.encode_batch(
    ["Doc one.", "Doc two<|pad|>padding", "Doc three."],
    strategy=strategy,
)

ByteTok automatically checks for conflicts when special tokens would replace existing tokens in the vocabulary or if there are duplicates.

Acknowledgment

ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bytetok-0.2.1.tar.gz (40.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bytetok-0.2.1-cp312-abi3-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.12+Windows x86-64

bytetok-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ ARM64

bytetok-0.2.1-cp312-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

bytetok-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

bytetok-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file bytetok-0.2.1.tar.gz.

File metadata

  • Download URL: bytetok-0.2.1.tar.gz
  • Upload date:
  • Size: 40.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.5

File hashes

Hashes for bytetok-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c7d37b63583cc12ce51d1a3751a8268f8c49cc837af6122b36a6d2b22ee77998
MD5 70ddbc0284fc01b286b3aa0ffcf0ae1c
BLAKE2b-256 00dfdfac74e23f11bcaf3828c534fadca15237d32bacd7d32e1089512646cade

See more details on using hashes here.

File details

Details for the file bytetok-0.2.1-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: bytetok-0.2.1-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.5

File hashes

Hashes for bytetok-0.2.1-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 720faf5c50fdcaf50b2adcf5c6c152d881f7a5e7d4250441f9758456b10d03e4
MD5 b86e587594d4562659cd652b0b7e3af3
BLAKE2b-256 ae754d372227b480ab714c08e401a18c09ad99ecdacd024206839bb8bec004f4

See more details on using hashes here.

File details

Details for the file bytetok-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 de2110c187dbc36b50e5337c9af77e5a5cfa1b73ecbf350f4dcbaf3416c55ca2
MD5 25eec3ce9230c6ebd0b1347a7700a645
BLAKE2b-256 99bbc08f052cd2edc0ad58767488154dff81dcd00bac0c9e9ebc13aee93ec454

See more details on using hashes here.

File details

Details for the file bytetok-0.2.1-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.1-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9d810bb8cb9a4aec91b084e7aa5a688fb6be6418b095551a89b6ddc8bcf5334
MD5 a4195fd22b0905501aac6e42f8630208
BLAKE2b-256 dd014f7aedb5b2f59aba7f1662cf233cf749c1f7566c290670c707e12f0a7e67

See more details on using hashes here.

File details

Details for the file bytetok-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.1-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 907f0465c025844d6dac8c2a9d4a0908a5db9122405fef6cb225158f9bd48bd9
MD5 c31ef0951e78a3613328c4b079c12f89
BLAKE2b-256 bf0712ecf07cf601b20c4e7e0246f4464a0aa8c100a4d14de884369e43b0ecef

See more details on using hashes here.

File details

Details for the file bytetok-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bytetok-0.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ca7e28dd6f90cb45d58013f0af70b8d653b35ea3b3ded55cc2e5e1077a9b9b6f
MD5 f78313c000cf51a15e8d2c74854943c9
BLAKE2b-256 561ce702ca3e3505b7cf7cdc245b6a90e967ead18016cf58a3e472ffd1e83eaf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page