A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.
Project description
ByteTok
ByteTok implements Byte Pair Encoding (BPE) at the byte-level with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes (0-255), then iteratively merged using learned pair statistics.
The training algorithm is based on an optimized BPE algorithm from the paper A Formal Perspective on Byte-Pair Encoding. The research has enabled ByteTok to achieve O(N log V) training time and O(N log N) encoding time versus the naive O(NV) approach.
Here, N denotes the length of the input text and V is the tokenizer's vocabulary size.
Features
- High-performance Rust-powered training, encoding, and decoding: Engineered from the ground up with a parallel processing pipeline for efficient handling of large-scale NLP datasets (1GB+) with the aim of enabling rapid processing for modern LLM applications.
- Built-in regex patterns: Choose from a pre-tokenization regex preset that includes GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2 and DeepSeek.
- Custom regex patterns: Supported alongside the built-in presets.
- Special token strategies: Control how special tokens are handled during encoding.
- Serialization: Supports versioned
.model/.vocabfile formats for saving tokenizer state, as well as easy loading via afrom_pretrained()function.
History
This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:
- Train on custom text
- Save learned encodings
- Encode text
- Decode text
Feel free to check out robust libraries such as OpenAI's tiktoken and Google's sentencepiece that are widely adopted in production environments. Tiktoken resembles ByteTok the most, but it should be noted that ByteTok provides a training pipeline which Tiktoken lacks.
In contrast, ByteTok was developed with a different focus. It prioritizes simplicity and usability by offering a clear API that efficiently maps strings to lists of token IDs. All this without burdening users with overly complex configuration or excessive parameters.
Benchmarks
These benchmarks were conducted on a Linux x86_64 system equipped with an Intel Core i7-12700H processor (20 cores @ 4.70 GHz) and 32GB DDR5 RAM. Encoding and decoding throughput represent the speed of encode_batch() and decode_batch() operations, respectively.
Dataset: Sci-Fi Books (Gutenberg)
| Corpus Size | Vocab Size | Training Time | Encoding Throughput | Decoding Throughput | Compression Ratio | Size Reduction |
|---|---|---|---|---|---|---|
| 132.36 MB | 10,000 | 4.58 mins | 16.12 MB/sec | 82.4M tokens/sec | 1.38x | 27.5% |
| 216.96 MB | 10,000 | 8.75 mins | 13.82 MB/sec | 81.4M tokens/sec | 1.60x | 37.7% |
| 216.96 MB | 25,000 | 9.74 mins | 14.55 MB/sec | 70.2M tokens/sec | 1.68x | 40.6% |
| 216.96 MB | 50,000 | 10.67 mins | 14.99 MB/sec | 77.0M tokens/sec | 1.75x | 42.7% |
| 326.96 MB | 50,000 | 16.19 mins | 14.61 MB/sec | 79.3M tokens/sec | 1.44x | 30.7% |
Requirements
- Python >= 3.12
Installation
Install from PyPI:
# with pip
pip install bytetok
# or with uv (recommended)
uv add bytetok
Building from Source
If you want to develop or build from source, you will need the Rust toolchain rustup.
# clone the repository
git clone https://github.com/VihangaFTW/bytetok.git
# install with uv
uv sync
# or build with maturin
uv sync --group dev
uv run maturin develop --release
Quick Start
Here you will find the primary workflows for using ByteTok tokenizers. For detailed API usage and additional features, see the full documentation in the Wiki.
Basics
The API has been designed with simplicity in mind:
import bytetok as btok
# Create a tokenizer with a built-in pattern (default: gpt4o).
tokenizer = btok.get_tokenizer("gpt4o")
# Train on text.
tokenizer.train("your training corpus here...", vocab_size=1000)
# Encode and decode.
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"
# Save and reload.
tokenizer.save("my_tokenizer")
reloaded = btok.from_pretrained("my_tokenizer.model")
Custom regex patterns can be used for pre-tokenization:
import bytetok as btok
# Create a tokenizer with a custom pattern
# For example, split on whitespace and punctuation.
tokenizer = btok.get_tokenizer(custom_pattern = r"\w+|[^\w\s]")
For best results, it is recommended to choose from the built-in presets, which have been extensively validated.
Parallel Encoding
ByteTok supports parallel encoding and decoding for faster processing of large batches of text.
Use encode_batch to perform parallel encoding to efficiently handle large collections of texts. You can then decode the resulting list of token sequences in parallel using decode_batch:
import bytetok as btok
tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)
# Encode a batch of texts in parallel.
texts = ["First document...", "Second document...", "Third document..."]
encoded = tokenizer.encode_batch(texts)
# Decode the batch in parallel.
decoded = tokenizer.decode_batch(encoded, errors="replace")
assert decoded[0] == "First document..."
Special Tokens
Register special tokens after training, then encode with a strategy to control how they are handled:
import bytetok as btok
tokenizer = btok.get_tokenizer("gpt4o")
tokenizer.train("your training corpus here...", vocab_size=1000)
# Set special tokens (IDs must be >= vocab size).
tokenizer.set_special_tokens({"<|endoftext|>": 15005, "<|pad|>": 13005})
# Encode with strategy: "all" allows special tokens in text; "none" ignores them.
strategy = btok.get_strategy("all")
tokens = tokenizer.encode("Hello<|endoftext|>world", strategy=strategy)
# Batch encoding with special tokens.
encoded = tokenizer.encode_batch(
["Doc one.", "Doc two<|pad|>padding", "Doc three."],
strategy=strategy,
)
ByteTok automatically checks for conflicts when special tokens would replace existing tokens in the vocabulary or if there are duplicates.
Acknowledgment
ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bytetok-0.2.0.tar.gz.
File metadata
- Download URL: bytetok-0.2.0.tar.gz
- Upload date:
- Size: 37.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e96686a7a426f6d5cbe19b64eaa9eb5ec45ebd01f367587d9ed8a7ae539c196
|
|
| MD5 |
6bea32662214d3876a8b11e08ba9639b
|
|
| BLAKE2b-256 |
f64b5411501e548b3ac0115626557aeb6acb62e4b9aec395827345c8e0095e0e
|
File details
Details for the file bytetok-0.2.0-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: bytetok-0.2.0-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 995.8 kB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a7852f14315143876edea6e61c3ba2064733e663f7990e40b2e6b2ceed177aa
|
|
| MD5 |
7f7a130dced728fc7e65a637f2c5b268
|
|
| BLAKE2b-256 |
db62956dbd302d2f91220bcd1c823882a758fdb8ff7a383ce0c135a1a33ce578
|
File details
Details for the file bytetok-0.2.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bytetok-0.2.0-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c692d5173c6c5db90927e7237c0781568ba5a30d41e393e3bc377942f88ec4a6
|
|
| MD5 |
ef4691e2fd3df84d5bf52ebc23e336b5
|
|
| BLAKE2b-256 |
232ba2277830278ea555c7d7192e7831c3db2a6f608836b2e0e0343904820c1a
|
File details
Details for the file bytetok-0.2.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bytetok-0.2.0-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fed7bdcd355de7b6d054b362ffd755243e5192e4143b68df714adb1c048a680c
|
|
| MD5 |
409054f0697fa025b858df8b1c49983c
|
|
| BLAKE2b-256 |
e98ef524be4b9eb6be5c7d7563863adb6591a74b9aad446892550d574dd43fe3
|
File details
Details for the file bytetok-0.2.0-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: bytetok-0.2.0-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.1 MB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7f0a6dca9ac1bf2995c74541abe76edf2e2dbe0f15efd6d3b124b3e11eac803
|
|
| MD5 |
8ff104fc01ee2b23b163b6cd063fae8f
|
|
| BLAKE2b-256 |
7ab7b77da1fd08a23b07e2c1f7998bf3a8b0b6ebf8a23872cc5e44428b1c2d05
|
File details
Details for the file bytetok-0.2.0-cp312-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: bytetok-0.2.0-cp312-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.12+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff0f14f56b62a9c5e5446fd95538b2275f0ece364dad41cd602cd54b6b9d6cd5
|
|
| MD5 |
fd746738326c897847ad734a08e34f3d
|
|
| BLAKE2b-256 |
06fd98b7df53ff08acce5d83b5b8aa737d0f401fc5302852f07c32864364434b
|