Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


What makes it fast?

Most BPE implementations rescan the entire corpus after every merge. fast-bpe-rs doesn't.

It uses a doubly-linked list to represent token chains and a frequency-indexed BTreeMap to find the next best merge. After each merge, only the immediate neighbours of affected positions are updated — skipping the vast majority of work.

Phase Naïve BPE fast-bpe-rs
Per-merge rescan O(n) O(kᵢ) — only affected positions
Max-pair lookup O(V) O(log V) — BTreeMap
Total training O(n · V) O(n log V)

Benchmarks

Tested on a 5 MB corpus. Numbers show relative behavior, not absolute hardware performance.

Training (vocab size = 4,096)

Time (s) Throughput (MB/s) Peak RAM (MB) vs. minbpe
minbpe BasicTokenizer 447.3 0.011 418
minbpe RegexTokenizer 583.1 0.009 521 0.77×
rustbpe 25.4 0.197 63 17.6×
fast-bpe-rs 6.0 0.83 48 74.5×

Encoding & Decoding

Encode (MB/s) Decode (MB/s)
minbpe BasicTokenizer 3.40 12.8
rustbpe 28.1 87.3
fast-bpe-rs 41.7 94.2

The advantage grows with vocabulary size

Vocab size fast-bpe-rs speedup vs. minbpe Regex
1,024 43×
2,048 62×
4,096 92×
8,192 153×

The longer the merge schedule, the more work is skipped — which matches the algorithm's design.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

How should I write my commits?

This project uses Release Please to prepare releases, so commit messages should follow the Conventional Commits format. That helps the release automation determine version bumps and generate changelog entries correctly.

Examples:

  • feat: add support for custom regex patterns
  • fix: handle invalid split regex in python bindings
  • docs: clarify installation requirements

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.1.6.tar.gz (191.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.1.6-cp310-abi3-win_amd64.whl (944.6 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.1.6-cp310-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.1.6-cp310-abi3-macosx_11_0_arm64.whl (935.6 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.1.6.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.1.6.tar.gz
  • Upload date:
  • Size: 191.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.1.6.tar.gz
Algorithm Hash digest
SHA256 b6012cf6da9268b45c9a83b64ac493cf0b0c2613419b8a8778b3a11c2680e34f
MD5 e01ec9d68a2c05d5f5a25ef5c51c6219
BLAKE2b-256 e402173d98ac8fca135f1663bd7f5d4e489c26505bc46d98846e9f0dd1d4465a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.6.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.1.6-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.1.6-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 944.6 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.1.6-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8b2678e0c5155f6f27f9c6a60be2c778380735367580783f96fd1f2be0af1130
MD5 838a20ee615076f45a5ebf7ee4e76ec0
BLAKE2b-256 47b8499b6de5786de90001274094983f55da043c55d562bcd9befa7bf9cbd190

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.6-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.1.6-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.1.6-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 5805c5349d670a852fb141209beda91f56060110c0dc612a5def7043a6efbf78
MD5 d1e11e8958097c5e95f33ed8de1d5a15
BLAKE2b-256 c2b51b52935bd7c6723aee5bd60ef751a0687af290580676ada2014fa75f6004

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.6-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.1.6-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.1.6-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 befa6b4e5d661b5069fcd77791450359ed2bdd8e1107ab7da1ac9d038eee9667
MD5 0e3013bfbbe84a7287df9ba572fee91b
BLAKE2b-256 80643d8723f2d2d86578610feedb5b33c55d7d7a2d864306f595e18fe196cf81

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.1.6-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page