Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.3.0.tar.gz (212.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.3.0-cp310-abi3-win_amd64.whl (959.5 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.3.0-cp310-abi3-macosx_11_0_arm64.whl (958.6 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.3.0.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.3.0.tar.gz
  • Upload date:
  • Size: 212.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.0.tar.gz
Algorithm Hash digest
SHA256 690dad82f9b910cb9e682beaf459ca92ec0cb165628872fdcbe015b54dea6950
MD5 f32905b9e1476e6bb4f267009919a782
BLAKE2b-256 9d66ae8f44a7497e831ab02cec0a39a6e6e55710197273573e7c7d851c28bfff

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.0.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.3.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 959.5 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e29eab84df0b91aa2a9d525dc9099dfa01578da7bb2875b9b65a32974ce6e669
MD5 f92b9184edd5b58549471611dfbe44c9
BLAKE2b-256 184fdd314d1bd376d58a434deeb7a24d8c4ca95143c57c1ecf708b755db6f0be

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 333d341e01a1d88d4b40d211ff9c65cadab5a95140b1982578270548a4785417
MD5 0c45ff800d9afdcaa1d70a821dcece23
BLAKE2b-256 0f5584a94f0a675dc9a54e2d034df2ec915af6119f106c678b5a2ba241bb7385

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 00af7431c5f85bc9e52889b2004387d048388cd9d908260889ab2921d72f1d83
MD5 247e8d2b9db593173417003a13a90da8
BLAKE2b-256 63e275958436d58dd9cbfc4e38d1e1bbfc6f586a465dc9d2d1d6267b73725c3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page