Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.3.2.tar.gz (105.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.3.2-cp310-abi3-win_amd64.whl (944.5 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.3.2-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.3.2-cp310-abi3-macosx_11_0_arm64.whl (904.9 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.3.2.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.3.2.tar.gz
  • Upload date:
  • Size: 105.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.2.tar.gz
Algorithm Hash digest
SHA256 62ae089b970e0bdee0b2a7407c983f92c2d7ac2be20578d48a87a19eef093fa9
MD5 5275db666b47511b8885e657901cc73f
BLAKE2b-256 2ee11cec7ebaf612fa2448a3030420eee3494923ada4d32e2ad1c262d28320b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.2.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.3.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 944.5 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 dc8b85942ca8cc773ae139a1e29c8694c52b18cf69d520b0f886e686c07da8cb
MD5 fb45e348f4ee7600696cdbb40306e277
BLAKE2b-256 3c4594904de9e1e22425b900f124027e51572e91f1300e4ec0070f2e09958e2d

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.2-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.2-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 63283a75bee6509b66a51307d20e737cfc98c5c3083a7d555cbaa0b28b0c81ec
MD5 464bfccdffc9fb95fc4b00a0f725fa9c
BLAKE2b-256 ae56082e7221891a2fd462f1cc86a50cb0e2a4132e7b2bbbc0d34c22fd47f23f

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.2-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 173ca86859234d51a006f1e0d735dcc271fedfc31546f362ac68fbaaf245d8c8
MD5 c51ef0842b6e0ea550da8f8488240d34
BLAKE2b-256 80cb6c5441532fab1c8ae40063b57340670f9d78c982aeeccdd4ac826b9d1474

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page