Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.3.1.tar.gz (104.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.3.1-cp310-abi3-win_amd64.whl (943.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.3.1-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.3.1-cp310-abi3-macosx_11_0_arm64.whl (904.3 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.3.1.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.3.1.tar.gz
  • Upload date:
  • Size: 104.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.1.tar.gz
Algorithm Hash digest
SHA256 df9ce223cf4e308be69d789ebf0729b95dc24fd7470bada6c14a650c8142b2ed
MD5 90a003aa99fcb5fbfb5daa17f23727cb
BLAKE2b-256 4d03ffcaa4f65f0a898695062fe06572e13596f2a93b81f4fd24275c37ff07d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.1.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.3.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 943.8 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.3.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fe8308f9b15b89f093a1e82b46b0110823149d0dba16c14b452bab86c2637186
MD5 92bb1f97e4927a879c38bb6d6c20fe96
BLAKE2b-256 91443740bee6c345b2f63f55ea073fd94bce663c37eb321be06b0ac277e7bda4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.1-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.1-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6da04c5238233a29686fb19926d776b2d2f3a7740aff62fee34fbe2b678722cb
MD5 d47623ab5db5ddf63a545866a28debe5
BLAKE2b-256 2adc86ea08d5bdb4fd10eaecf2396f46ee5bf57bae455841ee8ddb249583030c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.1-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.3.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.3.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f0fec2880f20f29dac0b33c95eab2bc14814b9102466ea306afd04d60d23a0aa
MD5 7b1de429591de11498600c4c71848997
BLAKE2b-256 6e9a7494e2a080f9a49c4e3e6d6a49a6d45a4398312bf738984a7971b6a791d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.3.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page