Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.6.1.tar.gz (199.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.6.1.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.6.1.tar.gz
  • Upload date:
  • Size: 199.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fast_bpe_rs-0.6.1.tar.gz
Algorithm Hash digest
SHA256 58d1ae9363582206834cf0cd822d21212b91096ff7a1c8f7afb4290d1694e4b7
MD5 d164afde1101bca86c6be9e2dacd93ca
BLAKE2b-256 7dc27b3ccbcd2b67bc382e2f0098940e0da9ec855477c01b71ca20c2e8ba837b

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.1.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 c690b6aa890722d9360ecf868e88d4716aed1315ddb7d8ff2c1e8ca9032cc5fc
MD5 3c856bbc0f85cd2954bc8de207312f1e
BLAKE2b-256 1d65bb0d1f3998876e28b3f2a31d8e1962e36d39d2b9e7ce9167d963c0be78dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 e9a9140c941f20d70476cea04588a90268e374ec570c2dd98786051a7c02b991
MD5 8b79f44f0f4f07a33a6799d6006e90ad
BLAKE2b-256 a341a209bc7fe0f4e7aa60cbd45acb6e387a810d9d77ed111bf988e43a2c1b16

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.1-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c19fc9376ae4655ec3bac8928122411f28ed4cda66c53dececac42819e8cab4d
MD5 935a00e6fda9103a9078bdce3ff63eac
BLAKE2b-256 05aedba7798e83bbdbd9c720819863ac0342cce1d5ce79e0c0f034a48895e97f

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.6.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page