Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.3.4.tar.gz (105.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.3.4-cp310-abi3-win_amd64.whl (965.2 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.3.4-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.3.4-cp310-abi3-macosx_11_0_arm64.whl (921.0 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.3.4.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.3.4.tar.gz
  • Upload date:
  • Size: 105.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.4.tar.gz
Algorithm Hash digest
SHA256 559ece6a9449de9a45cacd25ca80061038b9f111e5dca68413abdf4b8df4ed16
MD5 666261c2ff217df91c472b7841fcd24f
BLAKE2b-256 0be088397e42dec9e8333ae82637546ea9ab86c4114e312eac8fcbcd9d319402

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.4.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.4-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.3.4-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 965.2 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9b69220af2a8648a8cebdb56089e6f7ab7e0df0eff97f25d46a3f4707c4f9f34
MD5 8a02ffdac7fd68c3a7a27369d30c12e6
BLAKE2b-256 6cb3a620f15eb885762b682cba5e1cf18e9b9f586f817d79b28dde824fec8966

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.4-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.4-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.4-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 eca13dccad92fe70b4b7cab9b406c9b1ef7059eac3471b8bb6c7e5b8b6244e3c
MD5 b590ec840e7bb183a86e4a89d1bb91dc
BLAKE2b-256 d81e3166949dd9adef4bffb24b154402a76bc9f4c43ee75523bd2f518faf8b86

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.4-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 92109e122826b0a2531019a139a6391db8f0a58bbe65649e4c2cbbb6882f6a23
MD5 1a4128a90c3c6c72061ff81fbf46ec12
BLAKE2b-256 817c7a160951c2ca8e383ef1fb9424af7e87ecbb7ba5afe449f4fd31556fc929

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.4-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page