Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.
Project description
fast-bpe-rs: A Fast Rust BPE Library
A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.
Naive Vs fast-bpe-rs
Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.
| Aspect | Naive BPE trainer | fast-bpe-rs |
|---|---|---|
| Corpus representation | Plain token lists such as Vec<Vec<u32>> |
Deduplicated weighted merge sequences |
| Per-merge work | Recount pairs across the full corpus | Update only neighborhoods touched by the merge |
| Sequence updates | Rebuild or rewrite token lists repeatedly | In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>> |
| Pair statistics | Recomputed from scratch each round | Maintained incrementally as pair -> {count, locations} and count -> set of pairs |
| Best-pair lookup | Usually depends on the latest full recount | Pulled from the highest non-empty count bucket |
| Repeated chunks | Counted again and again | Stored once with a frequency weight |
| Parallelism | Often minimal in simple implementations | Parallel chunk counting and initial pair aggregation with rayon |
| Training time complexity | Typically O(MN) because each merge triggers another global count pass |
O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans |
| Space complexity | Usually O(N) plus temporary pair counts |
Higher than naive: O(N) corpus state plus pair-location indexes and count buckets |
Setup
Install from PyPI
pip install fast-bpe-rs
Use
Small example
from fast_bpe_rs import BPE
bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])
ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)
GPT-style split pattern
from fast_bpe_rs import BPE
bpe = BPE(
r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(32768, corpus_lines)
Special tokens
from fast_bpe_rs import BPE
bpe = BPE(
r"(?s).+",
{
"<pad>": 600,
"<eos>": 601,
},
)
bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")
API
BPE(split_pattern, special_tokens=None)train(vocab_size, docs)encode(text) -> list[int]decode(token_ids) -> bytesdecode_to_string(token_ids) -> str
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fast_bpe_rs-0.6.1.tar.gz.
File metadata
- Download URL: fast_bpe_rs-0.6.1.tar.gz
- Upload date:
- Size: 199.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58d1ae9363582206834cf0cd822d21212b91096ff7a1c8f7afb4290d1694e4b7
|
|
| MD5 |
d164afde1101bca86c6be9e2dacd93ca
|
|
| BLAKE2b-256 |
7dc27b3ccbcd2b67bc382e2f0098940e0da9ec855477c01b71ca20c2e8ba837b
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.6.1.tar.gz:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.6.1.tar.gz -
Subject digest:
58d1ae9363582206834cf0cd822d21212b91096ff7a1c8f7afb4290d1694e4b7 - Sigstore transparency entry: 1271213837
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c690b6aa890722d9360ecf868e88d4716aed1315ddb7d8ff2c1e8ca9032cc5fc
|
|
| MD5 |
3c856bbc0f85cd2954bc8de207312f1e
|
|
| BLAKE2b-256 |
1d65bb0d1f3998876e28b3f2a31d8e1962e36d39d2b9e7ce9167d963c0be78dd
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl -
Subject digest:
c690b6aa890722d9360ecf868e88d4716aed1315ddb7d8ff2c1e8ca9032cc5fc - Sigstore transparency entry: 1271214460
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 1.3 MB
- Tags: CPython 3.10+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e9a9140c941f20d70476cea04588a90268e374ec570c2dd98786051a7c02b991
|
|
| MD5 |
8b79f44f0f4f07a33a6799d6006e90ad
|
|
| BLAKE2b-256 |
a341a209bc7fe0f4e7aa60cbd45acb6e387a810d9d77ed111bf988e43a2c1b16
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl -
Subject digest:
e9a9140c941f20d70476cea04588a90268e374ec570c2dd98786051a7c02b991 - Sigstore transparency entry: 1271213880
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Trigger Event:
push
-
Statement type:
File details
Details for the file fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c19fc9376ae4655ec3bac8928122411f28ed4cda66c53dececac42819e8cab4d
|
|
| MD5 |
935a00e6fda9103a9078bdce3ff63eac
|
|
| BLAKE2b-256 |
05aedba7798e83bbdbd9c720819863ac0342cce1d5ce79e0c0f034a48895e97f
|
Provenance
The following attestation bundles were made for fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on zhixiangli/fast-bpe-rs
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl -
Subject digest:
c19fc9376ae4655ec3bac8928122411f28ed4cda66c53dececac42819e8cab4d - Sigstore transparency entry: 1271214194
- Sigstore integration time:
-
Permalink:
zhixiangli/fast-bpe-rs@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Branch / Tag:
refs/heads/main - Owner: https://github.com/zhixiangli
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@106f2985a6e988d28b2b642768f0e4453c9d4b6a -
Trigger Event:
push
-
Statement type: