Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.5.0.tar.gz (93.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.5.0-cp310-abi3-win_amd64.whl (946.3 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.5.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.5.0-cp310-abi3-macosx_11_0_arm64.whl (898.6 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.5.0.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.5.0.tar.gz
  • Upload date:
  • Size: 93.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.0.tar.gz
Algorithm Hash digest
SHA256 7986434b2f53122ae69ad03f9b3caeb12451aaadf7893ae132215221cf76231d
MD5 b04a88ad3da4aa1b6ac585dd281ccf95
BLAKE2b-256 917980b2423dca95dc1e86421ec1d9bf34748a175211c16dbcb5bc94b14055e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.0.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.5.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 946.3 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e51ce665ee4e91babc6a9d536bb433cfe184815f175cb3d2db7dc5a118b16999
MD5 f389d418e2f9a470865503b8eed3d605
BLAKE2b-256 76e63be50ef8a570a1835884487bb48fee3cbb46a80a08cfd4649fbd08f39456

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 a3b5309f6f01f338605e17b1bf67aae5130a7ff9097a3cbb999c955bebdcaae4
MD5 e51c0f81db7de992dccf4a5ead036704
BLAKE2b-256 45d2d1915de8891f5621a875aca0dc321d99041852b39e285eb430f007d7c64b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 585044b898819e47cc9f9ae66a9006b7d0de4c67cbcf76b21d7a6660a8e237fd
MD5 665cd97513f93365797b01c0e8451295
BLAKE2b-256 dba8ad504511f0e5c1d16f2925512b430adb4514b9b706201e76c6f1af612be0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page