Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.4.1.tar.gz (93.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.4.1-cp310-abi3-win_amd64.whl (932.3 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.4.1-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.4.1-cp310-abi3-macosx_11_0_arm64.whl (887.7 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.4.1.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.4.1.tar.gz
  • Upload date:
  • Size: 93.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.1.tar.gz
Algorithm Hash digest
SHA256 e6927080e3675767a7f58e87c7cbf3eb6ccff5f3e5004e1d5b4bd803b561da55
MD5 8c00882215acefc79d3b545b9e3c5a9e
BLAKE2b-256 aebe4658d554e398fb82f5ab2f3c5b5fdfb7251cbe6654da6f0712c36de1e4e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.1.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.4.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 932.3 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 71cd82ca426f34c99fa22759d8dc1da69a44c4a66e721a5d1201c01958e53666
MD5 a6ee2e16e71edb3502d540448f2279ef
BLAKE2b-256 bf762316af0f6b5a86c14235869e67bdad4d9eb42ff1d84636a0eb2c417e87e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.1-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.1-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 940635cdfd5bdc500f33cd1d9a91ed93e5dea60f147663683008a174ca28225d
MD5 99f198d9443487a38075401bdcacb6b7
BLAKE2b-256 27c5e12aeb0e7bfd70ebfa664b014a51be1f0a1895765dbc8c0d3aeae65070d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.1-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d2e471ca96a7d41dd3f7b352c627a0e22fc036bef493bf23b669a1cf11436193
MD5 6771c742527671381d22ba2c9c6c0c97
BLAKE2b-256 a1d82351843041feb53b08268a1b39d547ec7f6980ec1dab2a4b34dc060f0bf4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page