Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.4.4.tar.gz (93.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.4.4-cp310-abi3-win_amd64.whl (948.5 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.4.4-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.4.4-cp310-abi3-macosx_11_0_arm64.whl (899.7 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.4.4.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.4.4.tar.gz
  • Upload date:
  • Size: 93.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.4.tar.gz
Algorithm Hash digest
SHA256 8b3ffb24a8d7b3273c66038fd2ee3aed3054c8b9eabe2cee138a57650d6436ff
MD5 967291a1e58984fc7826312a7623306a
BLAKE2b-256 3bf75ad6f95c7c2692827a260190a930e09d72e38b28f9d244cc5e28f0851c7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.4.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.4-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.4.4-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 948.5 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.4-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 490f6eebdd4daa6e58e22883940233b1d4f0ac93237e9962567c2a79473db0c8
MD5 7c52af124d4972c63c8cb8f6772168c4
BLAKE2b-256 b96cb1dd2381d6a9d61eeb97caecc41c2d31103ad370874b934d4d6ab7734f7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.4-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.4-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.4-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 18a318e2dd7a98ac9c6c1a0526148937342833969b5219a5e3b3013295e0afd9
MD5 6d7efb3cf8bb4fa9aa23ba3a1a0e4908
BLAKE2b-256 a47b5ce2ffa268fcef59029d19b63f146921abdc44e20366b835687994d21f49

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.4-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.4-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.4-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7613a229ed4dbebfc881ab3861f146c96f8850b11aa76f4a5c7dba9b1988ce1c
MD5 6afd0282e719e29462af8fce860d9cd9
BLAKE2b-256 64fae22712541825d78f02334803af5633fac27397915a8aa80f7ce07f615077

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.4-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page