Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs

A blazing-fast Byte Pair Encoding tokenizer — written in Rust, usable from Python.


Quick start

Install

pip install fast-bpe-rs

No prebuilt wheel for your platform? pip will compile from source. You'll need a Rust toolchain installed first.

Train

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")  # regex pattern for pre-splitting text
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

For real corpora, use a GPT-style split pattern:

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)
bpe.train(50_000, corpus_lines)

Encode & Decode

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.2.0.tar.gz (211.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.2.0-cp310-abi3-win_amd64.whl (958.7 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.2.0-cp310-abi3-macosx_11_0_arm64.whl (959.7 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.2.0.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.2.0.tar.gz
  • Upload date:
  • Size: 211.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b4426edf43098ae8733b481b68693b5e91fdd0c2bdb9a12b3a3e3d6d3b8297ed
MD5 5a3b58e61da1e33eb44e81671565f151
BLAKE2b-256 ba7a9fb4fb62ce18e257704c9b037920e35cac11428e9392cd6129bfc8c59c55

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.0.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.2.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.2.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 958.7 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.2.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 5431f095b13b87e43fb2290d19587d7958e5a54bd8ced8d9e919586bf2a983f6
MD5 b3e50ffb51d9f7724f3f87b6ca0af79c
BLAKE2b-256 095b134dd0f32dbf431e3bfb65799e52df9e810cc9201d027cf99e74006ee0f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fad6e78fbdcc3afcda1984d939c9a78f168d35f51877c72f535c0935c781bd3f
MD5 c670648ca93eaac03bc05a264adc9d8f
BLAKE2b-256 7d418e17c130f3f8ce8707f8efd496bc6a2c47aa228f68943d301c7e4de46cce

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.2.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.2.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d96dd060c8960d7c49de59ee01e32f625fecee1ddb8321c46a22ccf2b2fb3c76
MD5 f9c3a9605b9c53755d28412c34701ead
BLAKE2b-256 294915c4ff22b2a8a237f5d4b617c59944aa7232798adfbc2ec020c2f04e1930

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.2.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page