Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.2.1.tar.gz (211.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.2.1-cp310-abi3-win_amd64.whl (959.5 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.2.1-cp310-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.2.1-cp310-abi3-macosx_11_0_arm64.whl (959.4 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.2.1.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.2.1.tar.gz
  • Upload date:
  • Size: 211.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.2.1.tar.gz
Algorithm Hash digest
SHA256 682d526435a8d825ca54be1e0e7bdb6f2edf79e429550cb660d0495d15e48a8a
MD5 ce1d292b28a0414f990a31790e9cae20
BLAKE2b-256 523669b9d88932dde82c4f25c9d565d2b3cb28f2a28ace794830340335528243

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.1.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.2.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.2.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 959.5 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.2.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 982d721a2f7ffaff7bbbc2fac888e8dd20b79ad240f6d41404c12b611e9c0bd9
MD5 bbb3de2f863350067984b6fe2760041e
BLAKE2b-256 485e8ae8e328dbcb742ace1a8243bf78c4fb9073a65721ea5ce9f4e41e2433ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.2.1-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.2.1-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 9dbe31a03159dce26d30560e0e3666393c8754505e053cc45ce6a6317a73d680
MD5 ddba6cf04ff76ebd905b5a2089cabc1c
BLAKE2b-256 dfbbd2089c09b040b644abf033393eb819cde8803761b2724ba8f8b6aa0ba666

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.1-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.2.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.2.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 94f073b8864804bc2046b033e5d590f2396b8bf2567a4ad79ee9d54d6e4474a8
MD5 a2c6d48805f61eefe5bc771fd9bd6dc7
BLAKE2b-256 c9ff4da115652293eb91b7ba0bf76cc54bd85d4962c8a147eb4538639f6a35e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page