Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A high-performance Byte Pair Encoding (BPE) tokenizer written in Rust, with Python bindings.

Why this exists

BPE is at the heart of every major LLM today — GPT, LLaMA, Mistral, and friends all use it to convert raw text into the token sequences the model actually sees. Getting tokenizer training right, and fast, matters.

The standard Python BPE implementations are correct but slow — training on large corpora becomes a real bottleneck. Existing Rust ports are faster by virtue of the language, but most carry over the same naïve O(n·V) algorithm. This project starts from Rust and rethinks the algorithm itself, using a doubly-linked list to represent token chains and a frequency-indexed BTreeMap to find the next best merge in O(log V) instead of a full scan.

Algorithm improvements

Phase Naïve BPE fast-bpe-rs
Per-merge rescan O(n) O(kᵢ) — only occurrences of merged pair
Max-pair lookup O(V) O(log V) — BTreeMap min
Merge application O(n) O(kᵢ) — in-place linked-list edits
Total training O(n · V) O(Σ kᵢ · log V) ≈ O(n log V)

Where n is corpus size, V is vocabulary size, and kᵢ is the number of occurrences of the pair merged at step i.

The key insight: after each merge, only the immediate neighbours of every affected position change. Instead of rescanning the whole corpus, the linked-list structure lets us jump directly to those positions and update counts locally. The BTreeMap keeps pairs ordered by frequency so the next best merge is always at the front.

Results

Benchmarks below use a 5 MB corpus and compare fast-bpe-rs against minbpe and rustbpe. They are intended to show relative behavior rather than serve as a hardware-independent standard.

Training (vocab size = 4,096)

System Time (s) Throughput (MB/s) Peak RAM (MB) Speedup vs. minbpe BasicTokenizer
minbpe BasicTokenizer 447.3 0.011 418 1.0×
minbpe RegexTokenizer 583.1 0.009 521 0.77×
rustbpe 25.4 0.197 63 17.6×
fast-bpe-rs 6.0 0.83 48 74.5×

Even on a single thread, fast-bpe-rs is about 4.2× faster than rustbpe in this setup, largely because incremental updates avoid repeating most of the pair-counting work after each merge.

Encoding

Encoding applies the learned merge rules to unseen text from the same 5 MB corpus.

System Time (s) Throughput (MB/s) Peak RAM (MB)
minbpe BasicTokenizer 1.47 3.40 52
minbpe RegexTokenizer 1.82 2.75 67
rustbpe 0.178 28.1 24
fast-bpe-rs 0.120 41.7 19

Decoding

Decoding is mostly dominated by token-to-bytes lookup, so the gap is smaller but still measurable.

System Time (s) Throughput (MB/s) Peak RAM (MB)
minbpe BasicTokenizer 0.391 12.8 38
minbpe RegexTokenizer 0.387 12.9 41
rustbpe 0.057 87.3 16
fast-bpe-rs 0.053 94.2 14

Training throughput vs. vocabulary size

As vocabulary size grows, the benefit of incremental updates becomes more pronounced: the naïve training cost grows roughly linearly with the number of merges, while fast-bpe-rs only updates the neighborhoods touched by each merge.

Vocab size minbpe Regex (MB/s) rustbpe (MB/s) fast-bpe-rs (MB/s) fast-bpe-rs speedup vs. minbpe Regex
1,024 0.038 0.47 1.62 43×
2,048 0.018 0.28 1.12 62×
4,096 0.009 0.197 0.83 92×
8,192 0.004 0.11 0.61 153×

In other words, the advantage widens as the merge schedule gets longer, which matches the asymptotic behavior described above.

Quick start

Installation

pip install fast-bpe-rs

If no prebuilt wheel exists for your platform, pip will compile from source — you'll need a recent Rust toolchain installed.

Train

from fast_bpe_rs import BPE

# The argument is a regex pattern used to pre-split text into chunks.
# r"(?s).+" treats the whole input as one chunk (simplest case).
bpe = BPE(r"(?s).+")

# Learn 258 merges on the given corpus
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

A GPT-style split pattern for real corpora:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode

ids = bpe.encode("low lower newest")
print(ids)  # e.g. [260, 262, 259, 261, ...]

Decode

text = bpe.decode_to_string(ids)
print(text)  # "low lower newest"

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.1.2.tar.gz (82.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.1.2-cp310-abi3-win_amd64.whl (909.3 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.1.2-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.1.2-cp310-abi3-macosx_11_0_arm64.whl (898.0 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.1.2.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.1.2.tar.gz
  • Upload date:
  • Size: 82.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.1.2.tar.gz
Algorithm Hash digest
SHA256 87f1e3734570c9ddd7130c904aca3a52ce1fbe79fd2265d05bf8fc025bbfa9c2
MD5 a681f329fc52ba5d309e52e89d162f0f
BLAKE2b-256 1dfd316f39e2fe99284290d1c40e2775fd8116e813812da63300d243fa92e3a0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.2.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.1.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.1.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 909.3 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.1.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 0906a9961c075b7858b6200280e4a269eabf11e1ff3db946c93d73d1876b2256
MD5 0ab1c0aaf4e35d4b61ccb8113cf76ce9
BLAKE2b-256 d3c8eb06be3f50d770a7cae0634418473d1d9866cad61d2517d245eb8e2ddc1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.1.2-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.1.2-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 c47583ed1dc6b7243170c7f6d4429e777d59752d795d21a8e241e6e4370c7a3b
MD5 437a3cc0cfd4c78dbaedf5982c76aba2
BLAKE2b-256 67c8d6ed57c929acb27f8afa565be495f86f7219ba040cb400e7df9db3e66e92

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.2-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.1.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.1.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4ba39e806745ddb3280ca6e5436a80f4c29f326e4584357a71da68bc663f0c90
MD5 2d9e9ff0cf189acf69ec875dd5565a09
BLAKE2b-256 bb93721b2b694e3177d94a1f081cbad7913a93283f5f4e2a765e9a0fccf9d5b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page