Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.5.2.tar.gz (94.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.5.2-cp310-abi3-win_amd64.whl (981.7 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.5.2-cp310-abi3-manylinux_2_34_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.5.2-cp310-abi3-macosx_11_0_arm64.whl (945.3 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.5.2.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.5.2.tar.gz
  • Upload date:
  • Size: 94.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.2.tar.gz
Algorithm Hash digest
SHA256 37988581590788c847a9dae9e8942421ec19a17105025269d0b5b91cdd4b0fa0
MD5 5f4bc27b8c679374d89dea9cb60bbf5f
BLAKE2b-256 5f09ef664785a02643a1cd20b7a69c91c2025bacfe782810f19970d1fc16d87c

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.2.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.5.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 981.7 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 20a3a18712d22a9fb96b8c397fe3901d0d474ed3dca20374502f0e65a26ad7c8
MD5 00470f92defe2fd567f23ced5c14e4c6
BLAKE2b-256 a47479003e3caff16f859e87095d6732417e7892d811929b3b782978c88996c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.2-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.2-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 29112c454b3f25638f69d0a920d9f0c377aee6db32f0c9634ee7055d34ee0c22
MD5 e054c1d98210e72a21ed54876994412c
BLAKE2b-256 017b9c2f96aa044216a0895b44d9fdcebd0a98e8205fe2f76b7707ed5828aa22

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.2-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3af016bf96321a2f92a35f42c141f519df31abb6445bbc6bf577e5b6764e92cb
MD5 b93a32bd6d88a729098f70f0ff435510
BLAKE2b-256 2b793f498cf7413ef9e111cdcf9b1bf6336c0a3f0d234fe27f0e6ade35c10910

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page