Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.
Project description
fast-bpe-rs
A high-performance Byte Pair Encoding (BPE) tokenizer written in Rust, with Python bindings.
Why this exists
BPE is at the heart of every major LLM today — GPT, LLaMA, Mistral, and friends all use it to convert raw text into the token sequences the model actually sees. Getting tokenizer training right, and fast, matters.
The standard Python BPE implementations are correct but slow — training on large corpora becomes a real bottleneck. Existing Rust ports are faster by virtue of the language, but most carry over the same naïve O(n·V) algorithm. This project starts from Rust and rethinks the algorithm itself, using a doubly-linked list to represent token chains and a frequency-indexed BTreeMap to find the next best merge in O(log V) instead of a full scan.
Algorithm improvements
| Phase | Naïve BPE | fast-bpe-rs |
|---|---|---|
| Per-merge rescan | O(n) | O(kᵢ) — only occurrences of merged pair |
| Max-pair lookup | O(V) | O(log V) — BTreeMap min |
| Merge application | O(n) | O(kᵢ) — in-place linked-list edits |
| Total training | O(n · V) | O(Σ kᵢ · log V) ≈ O(n log V) |
Where n is corpus size, V is vocabulary size, and kᵢ is the number of occurrences of the pair merged at step i.
The key insight: after each merge, only the immediate neighbours of every affected position change. Instead of rescanning the whole corpus, the linked-list structure lets us jump directly to those positions and update counts locally. The BTreeMap keeps pairs ordered by frequency so the next best merge is always at the front.
Results
Benchmarks below use a 5 MB corpus and compare fast-bpe-rs against minbpe and rustbpe. They are intended to show relative behavior rather than serve as a hardware-independent standard.
Training (vocab size = 4,096)
| System | Time (s) | Throughput (MB/s) | Peak RAM (MB) | Speedup vs. minbpe BasicTokenizer |
|---|---|---|---|---|
minbpe BasicTokenizer |
447.3 | 0.011 | 418 | 1.0× |
minbpe RegexTokenizer |
583.1 | 0.009 | 521 | 0.77× |
rustbpe |
25.4 | 0.197 | 63 | 17.6× |
fast-bpe-rs |
6.0 | 0.83 | 48 | 74.5× |
Even on a single thread, fast-bpe-rs is about 4.2× faster than rustbpe in this setup, largely because incremental updates avoid repeating most of the pair-counting work after each merge.
Encoding
Encoding applies the learned merge rules to unseen text from the same 5 MB corpus.
| System | Time (s) | Throughput (MB/s) | Peak RAM (MB) |
|---|---|---|---|
minbpe BasicTokenizer |
1.47 | 3.40 | 52 |
minbpe RegexTokenizer |
1.82 | 2.75 | 67 |
rustbpe |
0.178 | 28.1 | 24 |
fast-bpe-rs |
0.120 | 41.7 | 19 |
Decoding
Decoding is mostly dominated by token-to-bytes lookup, so the gap is smaller but still measurable.
| System | Time (s) | Throughput (MB/s) | Peak RAM (MB) |
|---|---|---|---|
minbpe BasicTokenizer |
0.391 | 12.8 | 38 |
minbpe RegexTokenizer |
0.387 | 12.9 | 41 |
rustbpe |
0.057 | 87.3 | 16 |
fast-bpe-rs |
0.053 | 94.2 | 14 |
Training throughput vs. vocabulary size
As vocabulary size grows, the benefit of incremental updates becomes more pronounced: the naïve training cost grows roughly linearly with the number of merges, while fast-bpe-rs only updates the neighborhoods touched by each merge.
| Vocab size | minbpe Regex (MB/s) |
rustbpe (MB/s) |
fast-bpe-rs (MB/s) |
fast-bpe-rs speedup vs. minbpe Regex |
|---|---|---|---|---|
| 1,024 | 0.038 | 0.47 | 1.62 | 43× |
| 2,048 | 0.018 | 0.28 | 1.12 | 62× |
| 4,096 | 0.009 | 0.197 | 0.83 | 92× |
| 8,192 | 0.004 | 0.11 | 0.61 | 153× |
In other words, the advantage widens as the merge schedule gets longer, which matches the asymptotic behavior described above.
Quick start
Installation
pip install fast-bpe-rs
If no prebuilt wheel exists for your platform, pip will compile from source — you'll need a recent Rust toolchain installed.
Train
from fast_bpe_rs import BPE
# The argument is a regex pattern used to pre-split text into chunks.
# r"(?s).+" treats the whole input as one chunk (simplest case).
bpe = BPE(r"(?s).+")
# Learn 258 merges on the given corpus
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])
A GPT-style split pattern for real corpora:
bpe = BPE(
r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)
Encode
ids = bpe.encode("low lower newest")
print(ids) # e.g. [260, 262, 259, 261, ...]
Decode
text = bpe.decode_to_string(ids)
print(text) # "low lower newest"
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_bpe_rs-0.1.3.tar.gz.
File metadata
- Download URL: fast_bpe_rs-0.1.3.tar.gz
- Upload date:
- Size: 83.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed659a4165d71c3af27bd6bb2a31e88b8bd06d0fc5c48f0052b5b7e71db1f063
|
|
| MD5 |
7f34df2883268ea311646b71edbefae7
|
|
| BLAKE2b-256 |
2d5a2efa8514e70e72b2e92579d130a066d119391e92a686eec066a7b1328397
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.1.3.tar.gz:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.1.3.tar.gz -
Subject digest:
ed659a4165d71c3af27bd6bb2a31e88b8bd06d0fc5c48f0052b5b7e71db1f063 - Sigstore transparency entry: 1154475705
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Trigger Event:
push
-
Statement type:
File details
Details for the file fast_bpe_rs-0.1.3-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: fast_bpe_rs-0.1.3-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 909.2 kB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb96bfb962aac76cc13cbe8de495a33f1b6875107aaa459e162d66de6ef868dd
|
|
| MD5 |
dc1118103009d43628748b71472254ed
|
|
| BLAKE2b-256 |
f8b3f78163119e2a05ce1910f8cf7c5806c6270202e59f2d54572ac925b193f4
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.1.3-cp310-abi3-win_amd64.whl:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.1.3-cp310-abi3-win_amd64.whl -
Subject digest:
fb96bfb962aac76cc13cbe8de495a33f1b6875107aaa459e162d66de6ef868dd - Sigstore transparency entry: 1154475707
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Trigger Event:
push
-
Statement type:
File details
Details for the file fast_bpe_rs-0.1.3-cp310-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: fast_bpe_rs-0.1.3-cp310-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.0 MB
- Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22e460a7cd3302cbc95e3e11ecf6dfbd246c5defb083c44b7070805ae5a039f6
|
|
| MD5 |
b9432602bb41d25dda25ac689db8e38a
|
|
| BLAKE2b-256 |
8d41022cc4a86caaa5ce33588e8a96936434e8c8a82dab8483aa64c7e12696c0
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.1.3-cp310-abi3-manylinux_2_34_x86_64.whl:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.1.3-cp310-abi3-manylinux_2_34_x86_64.whl -
Subject digest:
22e460a7cd3302cbc95e3e11ecf6dfbd246c5defb083c44b7070805ae5a039f6 - Sigstore transparency entry: 1154475714
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Trigger Event:
push
-
Statement type:
File details
Details for the file fast_bpe_rs-0.1.3-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: fast_bpe_rs-0.1.3-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 898.0 kB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c45ba113152e769df8ea1feb18f8e1246d96559ce7f484b34877ae6234d2de89
|
|
| MD5 |
84c972a73b02eccbb115d6844c48885e
|
|
| BLAKE2b-256 |
c12c9d861a68ec31fc4266c924521ab59e08fc9bc85936919d28794fa4ed684b
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.1.3-cp310-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.1.3-cp310-abi3-macosx_11_0_arm64.whl -
Subject digest:
c45ba113152e769df8ea1feb18f8e1246d96559ce7f484b34877ae6234d2de89 - Sigstore transparency entry: 1154475711
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5b8a3bcf7dcde5fecd29ed14e16966929b84d8aa -
Trigger Event:
push
-
Statement type: