Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.5.1.tar.gz (93.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.5.1-cp310-abi3-win_amd64.whl (947.5 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.5.1-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.5.1-cp310-abi3-macosx_11_0_arm64.whl (899.2 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.5.1.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.5.1.tar.gz
  • Upload date:
  • Size: 93.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.1.tar.gz
Algorithm Hash digest
SHA256 598b716ea81f06626daf9e25f69cf94ccd96dee28ba0907f065d90350f2b569e
MD5 c572aa9e4d7dd3f2899b9918a55b34ed
BLAKE2b-256 25935d8334c0d816b61ee92695f4e063f5fc6efcb0f4cd5519e12ffe3a714566

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.1.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.5.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 947.5 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.5.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 994ba879aeaf5da1621cda9aff9615ebc8bed5c0afb5ab67c1a86268361ee597
MD5 178b6756f6328cc2f090b032f652230f
BLAKE2b-256 143f15d59926b0f929db9449c324ebad42341e2c88609bbc9ce95471b1c5a3d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.1-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.1-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 827a0bfc48417e8936a758ee7de7aec5d30efa5e59f67688ee7ad058fe663ce6
MD5 2672a8bbae7fb62002dda3178b16ea7c
BLAKE2b-256 4e707c966bf6702f8d1e84d7f031090036e64487e387e16ad373f3111b3ce32b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.1-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.5.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.5.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 33e4d5a3df597dadeaa291507e8fc70f499dd0547696fad1d5b73882aa86543b
MD5 949db4f8c4446a22ffe5fbc6898f16d0
BLAKE2b-256 9f879403e3b7d9729359ffd6f025e66eb0cf7a5ddb48431ba2745e71097f7bdd

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.5.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page