Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.4.0.tar.gz (93.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.4.0-cp310-abi3-win_amd64.whl (930.4 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.4.0-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.4.0-cp310-abi3-macosx_11_0_arm64.whl (885.2 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.4.0.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.4.0.tar.gz
  • Upload date:
  • Size: 93.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.0.tar.gz
Algorithm Hash digest
SHA256 e71c44b9dc106d1de2d4103132a2ab14dd4ed633a7356933ed3a9cc966e3c5de
MD5 d6772e1f190caa3a6b001aebce31e8fb
BLAKE2b-256 aaf8bf88901d749d29ac1173425ecc4b66a1dba9d2fa9bdd6fae213688cb8af3

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.0.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.4.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 930.4 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 640a3ae2f3a0df274e23a45a6442c9c19e87cfcfbfdd98e6c19eb005353443f6
MD5 aac1a9921542f8c9d83dd30fee34c3e3
BLAKE2b-256 0d071c624ec7edfb0d255e8d0c4b8fadc7ae3efce608321b253340a3f414379a

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.0-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.0-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 4e856db928f2af5a6c8631de90b19b6d9adc23c976de8fd5df2936135e9f69ef
MD5 b2b9938cd6994bfb4d2d322fdd70ff69
BLAKE2b-256 a346efdb3403c3b78da09f735f4015507befd91e9f6917b4db02608022576998

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.0-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7ac18d94ac70006bde603b13f0a481edf15c4b2d875d7cf73d5e427dfb931ada
MD5 bb53b7e9fd136c6ac52c5275d0ba1b95
BLAKE2b-256 5c1caff86bb5e5c092e2be8c1bcf8d93806ede97d0bb3981a03d613522633651

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page