Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.6.0.tar.gz (200.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.6.0-cp310-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.6.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.6.0-cp310-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.6.0.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.6.0.tar.gz
  • Upload date:
  • Size: 200.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fast_bpe_rs-0.6.0.tar.gz
Algorithm Hash digest
SHA256 a9524b4a79c7158fa89c1ff9c800069369db336c1077d235cb4e1520330f1c46
MD5 f432e8c963d052e074396cd28e0494cc
BLAKE2b-256 40386d858f8224c0725ec0c8628a3a49539d41ae882b16f1f8aa274d7b083dab

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.0.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.6.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.6.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fast_bpe_rs-0.6.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 45b4f80aed8cdc5fb4a947861b5b7177a7259d7ccdcf0916a0e9e6a696edc88f
MD5 6bd185a3bfedef43f5a55b93636fd95b
BLAKE2b-256 2e16dc2411d2b77555bc61f2a9a70fa5ffbb5dce41bff3907faa2ba011562018

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.6.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.6.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 9c76f47a216980637ae82915c43a04558008acba7437eeb9ed1a308498c8e235
MD5 e2812a81a814940bf14a090ceb407910
BLAKE2b-256 ba6b98d700227bd5df0cbaf3416bea5c169bf0c8d01c8f4a0fb12b0689626b4d

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.6.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.6.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9e9585859732d7928a8c99906e4f9c4b327f77a82bf0b5c5bdf61c744e09ef61
MD5 01bd05582911d5752b9081e1d000e93f
BLAKE2b-256 8d944e55cc341e72b0aab171efd47ea046a9e849372aea42636f71360679e4b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page