Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.4.2.tar.gz (93.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.4.2-cp310-abi3-win_amd64.whl (952.4 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.4.2-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.4.2-cp310-abi3-macosx_11_0_arm64.whl (903.1 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.4.2.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.4.2.tar.gz
  • Upload date:
  • Size: 93.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.2.tar.gz
Algorithm Hash digest
SHA256 0bedd639fc7dfa77a81b4e35bdfe6b1558536ac344318269690377e27a447df0
MD5 da9990c9b665d6c9a14f0fbfe84d79f7
BLAKE2b-256 d5f85f44dc61ef3c59cefa50886357ba69553ac34732667a247363633d749639

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.2.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.4.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 952.4 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 57b6d2f0ef508a8646611ae38adf13b76fe38662461e25b6400e97525a3ca901
MD5 c8d4705db07df447e32fe21feb39e192
BLAKE2b-256 43720c061252c4c045884914d553b6a6b1593282fc67645d265c3f7ffc5f2288

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.2-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.2-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 ae32c3c9328c4102de725b82ec9fe531a9acb66d3c6bc76ab7596e812665e61c
MD5 00db464d576d904a3f76faf6ef22669e
BLAKE2b-256 8a1bc897827d5e16af44c85b228738913be9fab843f70756c58b79e0af4fb734

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.2-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4719b85a3b0e0060bbfc542284a0d852a35e705ab6f0d2bc285234356ca66aca
MD5 88120ee4272122c7cbb36ce4f6311cd7
BLAKE2b-256 6d81fb3403468570e880d3741e89b98ebb1688fe02c54c009c7638a79f6e51de

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page