A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.
Project description
ByteTok
A fast, modular and light-weight BPE tokenizer for NLP research and prototyping.
ByteTok implements Byte Pair Encoding (BPE) at the byte-level with a Rust-accelerated core for training and encoding. Text is first converted to raw bytes (0-255), then iteratively merged using learned pair statistics. The training algorithm is based on Algorithm 2 from "A Formal Perspective on Byte-Pair Encoding", achieving O(N log V) training and O(N log N) encoding versus the naive O(NV) approach.
History
This project started as a weekend experiment with BPE for text compression. I later needed a tokenizer for my custom GPT, which was bottlenecked by context length due to character-level encoding. I wanted a simple API that did four things correctly at a reasonable speed:
- Train on custom text
- Save learned encodings
- Encode text
- Decode text
Libraries like OpenAI's tiktoken and Google's sentencepiece exist and are probably better for production work. But ByteTok wasn't designed to compete with them or benchmaxx. I wanted a straightforward API that took a string and returned a list of integers; not something that forced me to read through documentation for 200 function arguments (looking at you, sentencepiece).
As my dataset requirements grew, the naive BPE implementation started struggling. So I rewrote the trainer and encoder in Rust using a much more efficient algorithm 😎.
Features
- Fast Rust-backed training and encoding via PyO3/maturin for datasets larger than 100MB. ByteTok delivers 600x-1000x better performance when compared to a naive O(NV) implementation.
- Built-in regex patterns from GPT-2, GPT-4, GPT-4o, LLaMA 3, Qwen 2, DeepSeek, StarCoder, Falcon, and BLOOM.
- Custom patterns supported alongside the built-in presets.
- Special token strategies for controlling how special tokens are handled during encoding.
- Serialization with versioned
.model/.vocabfile format andfrom_pretrained()loader.
Benchmarks
Benchmarks were run on Linux x86_64 with an Intel Core i7-12700H (20 cores @ 4.70 GHz) and 32GB DDR5 RAM.
| Dataset | Corpus Size | Vocab Size | Training Time | Encoding Throughput | Decoding Throughput | Compression Ratio | Size Reduction |
|---|---|---|---|---|---|---|---|
| Sci-Fi Books (Gutenberg) | 88.85 MB (93M chars) | 25,000 | 198s (~3.3 mins) | 2.99M chars/sec (2.85 MB/sec) | 19.2M tokens/sec | 1.43x | 30.3% |
| Sci-Fi Books (Gutenberg) | 216.96 MB (227M chars) | 10,000 | 523s (~8.7 mins) | 2.86M chars/sec (2.73 MB/sec) | 17.1M tokens/sec | 1.60x | 37.7% |
| Sci-Fi Books (Gutenberg) | 216.96 MB (227M chars) | 25,000 | 579s (~9.65 mins) | 2.85M chars/sec (2.72 MB/sec) | 17.0M tokens/sec | 1.68x | 40.6% |
| Sci-Fi Books (Gutenberg) | 216.96 MB (227M chars) | 50,000 | 640s (~10.7 mins) | 2.76M chars/sec (2.63 MB/sec) | 16.4M tokens/sec | 1.75x | 42.7% |
| Sci-Fi Books (Gutenberg) | 326.96 MB (343M chars) | 50,000 | 1048s (~17.5 mins) | 2.82M chars/sec (2.69 MB/sec) | 7.02M tokens/sec | 1.44x | 30.7% |
Requirements
- Python >= 3.13
Installation
Install from PyPI:
# with pip
pip install bytetok
# or with uv (recommended)
uv add bytetok
Building from Source
If you want to develop or build from source, you'll need a Rust toolchain (rustup):
# clone the repository
git clone https://github.com/vihanga-malaviarachchi/bytetok.git
cd bytetok
# install with uv
uv sync
# or build with maturin
pip install maturin
maturin develop
Quick Start
import bytetok
# create a tokenizer with a built-in pattern (default: gpt4o)
tokenizer = bytetok.get_tokenizer("gpt4o")
# train on text
tokenizer.train("your training corpus here...", vocab_size=1000)
# encode and decode
tokens = tokenizer.encode("Hello, world!")
text = tokenizer.decode(tokens)
assert text == "Hello, world!"
# save and reload
tokenizer.save("my_tokenizer")
reloaded = bytetok.from_pretrained("my_tokenizer.model")
API Reference
Factory Functions
bytetok.get_tokenizer(pattern="gpt4o", *, custom_pattern=None)
Create a RegexTokenizer with a built-in or custom regex pattern.
- pattern (
str) -- Name of a built-in pattern. Ignored whencustom_patternis set. Default:"gpt4o". - custom_pattern (
str | None) -- A custom regex pattern string. Overridespatternwhen provided. - Returns:
Tokenizer - Raises:
PatternErrorif the custom pattern is invalid regex.
# built-in pattern
tokenizer = bytetok.get_tokenizer("llama3")
# custom pattern
tokenizer = bytetok.get_tokenizer(custom_pattern=r"'s|'t|'re|'ve|'m|'ll|'d| ?\w+")
bytetok.from_pretrained(model_path)
Load a previously saved tokenizer from a .model file. The tokenizer type is auto-detected from the file header.
- model_path (
str) -- Path to the.modelfile. - Returns:
Tokenizer(eitherBasicTokenizerorRegexTokenizerdepending on what was saved). - Raises:
ModelLoadErrorif the file does not exist, has the wrong extension, contains an unknown tokenizer type, or has a version mismatch.
tokenizer = bytetok.from_pretrained("my_tokenizer.model")
bytetok.get_strategy(name="none-raise", allowed_subset=None)
Create a special token handling strategy for use with encode().
- name (
"all" | "none" | "none-raise" | "custom") -- Strategy name. - allowed_subset (
set[str] | None) -- Required whenname="custom". The set of special token strings to allow. - Returns:
SpecialTokenStrategy - Raises:
StrategyErrorif the name is unknown or"custom"is used withoutallowed_subset.
strategy = bytetok.get_strategy("all")
strategy = bytetok.get_strategy("custom", allowed_subset={"<|endoftext|>"})
bytetok.list_patterns()
Return the names of all available built-in regex patterns.
- Returns:
list[str]
bytetok.list_patterns()
# ['GPT2', 'GPT4', 'GPT4O', 'LLAMA3', 'QWEN2', 'DEEPSEEK_CODER', 'DEEPSEEK_LLM',
# 'STARCODER', 'FALCON', 'BLOOM']
bytetok.get_pattern(name)
Get the regex pattern string for a specific built-in pattern by name.
- name (
str) -- Name of the built-in pattern (case-insensitive). - Returns:
str-- The regex pattern string. - Raises:
PatternErrorif the pattern name is unknown.
# get a specific pattern string
pattern_str = bytetok.get_pattern("llama3")
# use it to create a tokenizer
tokenizer = bytetok.RegexTokenizer(pattern=pattern_str)
bytetok.list_strategies()
Return the names of all available special token strategies.
- Returns:
list[str]
bytetok.list_strategies()
# ['all', 'none', 'none-raise', 'custom']
Tokenizer Classes
All tokenizers inherit from the abstract base class Tokenizer. The two concrete implementations are BasicTokenizer and RegexTokenizer.
The
BasicTokenizerserves as a documentation for the simplest implementation of a BPE tokenizer. It is not recommended for actual use due to its lossy nature when decoding multi-byte utf-8 sequences.All ByteTok's factory methods default to
RegexTokenizer. For custom extensions or implementations, always inherit fromRegexTokenizer.
Tokenizer (abstract base class)
Manages vocabulary, byte pair merges, and serialization. You do not instantiate this directly; use RegexTokenizer or the factory functions instead.
Attributes
| Attribute | Type | Description |
|---|---|---|
merges |
dict[tuple[int,int], int] |
Byte pair -> merged token ID mapping. |
vocab |
dict[int, bytes] |
Token ID -> byte sequence mapping. |
pat |
str |
Regex pattern used for text splitting (if any). |
special_toks |
dict[str, int] |
Special token string -> token ID mapping. |
train(text, vocab_size, verbose=False)
Train the tokenizer by learning byte pair merges from the input.
- text (
str | list[str]) -- Training corpus. Lists are concatenated. - vocab_size (
int) -- Target vocabulary size. Must be > 256. - verbose (
bool) -- Log each merge operation. Default:False. - Raises:
VocabularyErrorifvocab_size <= 256.TrainingErrorif the input is empty.
encode(text, strategy=None)
Encode text into a list of integer token IDs.
- text (
str) -- Text to encode. - strategy (
SpecialTokenStrategy | None) -- How to handle special tokens.Nonemeans no special token handling. - Returns:
list[int]
decode(tokens)
Decode a list of token IDs back into text.
- tokens (
list[int]) -- Token IDs to decode. - Returns:
str - Raises:
VocabularyErrorif a token ID is not in the vocabulary (RegexTokenizer).
save(file_prefix)
Save the trained tokenizer to disk. Creates two files:
<file_prefix>.model-- Binary merge mappings (used byload()/from_pretrained()).<file_prefix>.vocab-- Human-readable token representations.
Parameters:
- file_prefix (
str) -- Path prefix for the output files.
tokenizer.save("models/my_tok")
# creates models/my_tok.model and models/my_tok.vocab
load(model_filename)
Load tokenizer state from a .model file. Restores merges, special tokens, and rebuilds the vocabulary.
- model_filename (
str) -- Path to the.modelfile. - Raises:
ModelLoadErroron missing file, wrong extension, version mismatch, or type mismatch.
tokenizer = bytetok.RegexTokenizer()
tokenizer.load("models/my_tok.model")
BasicTokenizer()
Tokenizer that operates directly on raw byte sequences without any regex splitting. Does not support special token strategies.
tok = bytetok.BasicTokenizer()
tok.train("Hello world", vocab_size=300)
tokens = tok.encode("Hello")
text = tok.decode(tokens)
All methods are inherited from Tokenizer. The strategy parameter on encode() is accepted but ignored.
It is recommended not to use this class. Use RegexTokenizer instead.
RegexTokenizer(pattern=None)
Tokenizer that splits text with a regex pattern before applying BPE. Supports special token registration and strategies.
- pattern (
str | None) -- Regex pattern for text splitting. Defaults to thegpt4opattern whenNone.
tok = bytetok.RegexTokenizer() # default gpt4o pattern
tok = bytetok.RegexTokenizer(pattern=r"\w+|\S") # custom pattern
In addition to the methods inherited from Tokenizer, RegexTokenizer provides:
register_special_tokens(special_toks)
Register special tokens with auto-assigned IDs. Must be called after training. Token IDs are assigned sequentially starting from the current vocabulary size.
- special_toks (
list[str]) -- Special token strings to register. - Raises:
SpecialTokenErrorif the tokenizer has not been trained yet.
tok.train(text, vocab_size=1000)
tok.register_special_tokens(["<|endoftext|>", "<|pad|>", "<|start|>"])
# encode with special token awareness
strategy = bytetok.get_strategy("all")
tokens = tok.encode("Hello<|endoftext|>", strategy=strategy)
text = tok.decode(tokens)
TokenPattern
TokenPattern is a str enum containing pre-defined regex patterns sourced from popular tokenizer implementations.
TokenPattern.get(name)
Look up a pattern by name (case-insensitive).
- name (
str) -- Pattern name. - Returns:
str-- The regex pattern string. - Raises:
PatternErrorif the name is unknown.
pattern = bytetok.TokenPattern.get("gpt4o")
Available Patterns
| Name | Source |
|---|---|
gpt2 |
OpenAI GPT-2 |
gpt4 |
OpenAI GPT-4 |
gpt4o |
OpenAI GPT-4o |
llama3 |
Meta LLaMA 3 |
qwen2 |
Alibaba Qwen 2 |
deepseek-coder |
DeepSeek Coder |
deepseek-llm |
DeepSeek LLM |
starcoder |
BigCode StarCoder |
falcon |
TII Falcon |
bloom |
BigScience BLOOM |
Special Token Strategies
Strategies control how special tokens are recognised during encode(). Pass a strategy instance as the strategy parameter.
SpecialTokenStrategy (abstract base class)
Base class. Subclass this to implement custom strategies.
handle(text, special_toks)
- text (
str) -- The text being encoded. - special_toks (
dict[str, int]) -- All registered special tokens. - Returns:
dict[str, int]-- The subset of special tokens to apply.
AllowAllStrategy
Allows all registered special tokens to be recognised during encoding.
AllowNoneStrategy
Silently ignores all special tokens. They are treated as regular text.
AllowNoneRaiseStrategy
Raises SpecialTokenError if any registered special token is found in the input text.
AllowCustomStrategy(allowed_subset)
Allows only a specified subset of special tokens.
- allowed_subset (
set[str]) -- The special token strings to allow.
# via factory (recommended)
strategy = bytetok.get_strategy("custom", allowed_subset={"<|endoftext|>"})
# or instantiate directly
strategy = bytetok.AllowCustomStrategy({"<|endoftext|>"})
Exceptions
All exceptions inherit from ByteTokError (importable from bytetok.errors).
| Exception | Raised when |
|---|---|
ByteTokError |
Base exception for all bytetok errors. |
VocabularyError |
vocab_size <= 256 during training, or unknown token ID during decode. |
TrainingError |
Training input is empty or too short. |
ModelLoadError |
Loading a .model file fails (missing, wrong format, version mismatch). |
PatternError |
A regex pattern fails to compile. |
SpecialTokenError |
Special token handling fails (e.g. AllowNoneRaiseStrategy finds one). |
StrategyError |
Unknown strategy name or missing allowed_subset for custom strategy. |
TokenizationError |
General tokenization failure. |
from bytetok.errors import ModelLoadError
try:
tok = bytetok.from_pretrained("missing.model")
except ModelLoadError as e:
print(e)
Model File Format
save() produces two files:
.model -- Machine-readable format used by load() and from_pretrained():
ByteTok 0.1.0
type regex
re <pattern>
---
<n_special_tokens>
<special_token_string> <token_id>
...
---
<tok_a> <tok_b> <merged_tok>
...
.vocab -- Human-readable vocabulary for inspection:
ST [256] <|endoftext|>
[0] \u0000
...
[258] [he][llo] -> hello
Acknowlegment
ByteTok is inspired by Andrej Kaparthy's minbpe. A walkthrough of minbpe repository is documented on his Youtube channel here.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bytetok-0.1.1.tar.gz.
File metadata
- Download URL: bytetok-0.1.1.tar.gz
- Upload date:
- Size: 32.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1773e59b71bf29357ee7f44ccc1b2fa0bf446de16c6c99e42e61130b39f4840c
|
|
| MD5 |
b73cfdf3ea3cef7556ee704f67a31f3f
|
|
| BLAKE2b-256 |
53213f389c210ed4afc211f8bb300f45d169cab53f6a116145442f02978fcfb6
|
File details
Details for the file bytetok-0.1.1-cp312-abi3-win_amd64.whl.
File metadata
- Download URL: bytetok-0.1.1-cp312-abi3-win_amd64.whl
- Upload date:
- Size: 183.9 kB
- Tags: CPython 3.12+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
383714f82335e75b8f55f7a1d3fbb79da820aded4377076bc244025b8acb4350
|
|
| MD5 |
1faf23c562708162e83baa40ef968716
|
|
| BLAKE2b-256 |
93b66664d4a3cc88e64b8314fbfa7351791bd8e785232f493540e34098e4839c
|
File details
Details for the file bytetok-0.1.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bytetok-0.1.1-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 331.1 kB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8273fc2eb082f052b2763542d9ec029cfd73f88c8e113bc545a47bc20923586
|
|
| MD5 |
842e3fd92b6aa292cc2680472ee4cac5
|
|
| BLAKE2b-256 |
97dd5c838e5e0a29645a3155ee78c9ab16dbd1759a209bf28191a3f04c81097f
|
File details
Details for the file bytetok-0.1.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bytetok-0.1.1-cp312-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 323.4 kB
- Tags: CPython 3.12+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af48ba6db3c750046e74e595af91b9f2dadfa04193bc1ad2b65a06b33de42284
|
|
| MD5 |
2f830fb4311445981b44012cf01dd7c5
|
|
| BLAKE2b-256 |
df6b7c8fb8972d63a48f53abd1048b0aa78cbf0064f92cc7a95422dc44bebdb5
|
File details
Details for the file bytetok-0.1.1-cp312-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: bytetok-0.1.1-cp312-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 290.8 kB
- Tags: CPython 3.12+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24d8be7e84a433a94ccf5ca1fc6fadde49f31c69047b2117e4347ef378e77d1a
|
|
| MD5 |
108bcf151c33cb21243110c1224fa90a
|
|
| BLAKE2b-256 |
63335a385e122b99c5f477f2e36e7aeb4350e94a0fd711ce0cb53709eebc2609
|
File details
Details for the file bytetok-0.1.1-cp312-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: bytetok-0.1.1-cp312-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 297.1 kB
- Tags: CPython 3.12+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
012dd9e7e0eb1b8a0e52a5cc0c6b45d9917a430a5273e402c889ffe3d444c254
|
|
| MD5 |
d9c68ccebe2d859d6e14daab24fff31f
|
|
| BLAKE2b-256 |
e89e6aa456fdf42dad5eb35a864bae83408ce8ad42cdd57c937706585ae8b3c2
|