Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.3.3.tar.gz (105.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.3.3-cp310-abi3-win_amd64.whl (945.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.3.3-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.3.3-cp310-abi3-macosx_11_0_arm64.whl (906.5 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.3.3.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.3.3.tar.gz
  • Upload date:
  • Size: 105.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.3.tar.gz
Algorithm Hash digest
SHA256 0b4e3e59e6b4993b2dca7d5aa017a0eda0a234729258f9e010298c083b604fa3
MD5 bd139ba54e5ccd2281cbe4811f8df447
BLAKE2b-256 5098ad5a194729399529df8bef6588573b5160b1354f22ed2fb6796b89e9b437

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.3.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.3-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.3.3-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 945.8 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 390981f36afbe5bdf425db3a3d4941bb73ef70dd677df9864d75f27b4c4b3f1d
MD5 9733480b2578b5dec42f66c34a2522e7
BLAKE2b-256 2897bd559339f70d869028b8a620984af548c43148988048b52d35bf401aa55e

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.3-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.3-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.3-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ce2746396dff0e268c72068ef8b5c81cc1cf19008dcb570ce229b045641d7fd4
MD5 06f06068b6ec9c323b58bec68f051bc6
BLAKE2b-256 ef14ab3d534e3000b075b4c5ab775c6d9cfe97c9dd5ed06845d3ca65c1fadf15

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.3-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 97f19669c69ef0b43e43f9af1448fb61526d01e49d719188178de8423723e3da
MD5 bcc479d34f36760d140e3b0affe331c4
BLAKE2b-256 937b6edd60ec2df4f9a5485c7fcf77e29de7b2f70ef5d4c8e9ff0c940c730646

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page