Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.5.3.tar.gz (94.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.5.3-cp310-abi3-win_amd64.whl (982.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.5.3-cp310-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.5.3-cp310-abi3-macosx_11_0_arm64.whl (945.6 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.5.3.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.5.3.tar.gz
  • Upload date:
  • Size: 94.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.3.tar.gz
Algorithm Hash digest
SHA256 8a699c9aaf67608d945dfee10e9683bb518d7447673f4f521a0d861903d85681
MD5 e263e963b2d356eac69ceb58b3b04cb3
BLAKE2b-256 807b136194a6d7b750bb1c7b3482053d8cfc033410a8596320c396b8c8a6e588

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.3.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.3-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.5.3-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 982.8 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1d56bf3119c6a7103bf81a4a0d29936a5fc430993a0821b5c7dc14924970088e
MD5 a3db51d3e1c2572b4a358cecb72ae672
BLAKE2b-256 190ae9dd71eda9d3272518e93753f3e3999b48e4cd014c1449efcd281118de07

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.3-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.3-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.3-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 394cc9fdd14fd7f41ea2fae895909ac9cace28cc83d2a4d24b64458e224db15b
MD5 582654375d5a82ace0932a2e4a70269d
BLAKE2b-256 84f92b2e0444dca7dfa1b25dd9454a720e950a6df8a06df4a1d387714c9381ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.3-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 eabe06c14b5e8dd7706ded5ace56e2c101db263d023a5ae74638fddd22c9a339
MD5 1012acd9fb410cf4d5c8c971835ae3b7
BLAKE2b-256 729a71d49fe504d1a667cf66186eaefec72bfa8e4b917945066d5d62f106a710

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page