Skip to main content

Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3.

Project description

fast-bpe-rs: A Fast Rust BPE Library

A blazing-fast Rust Byte Pair Encoding (BPE) tokenizer with Python bindings for training, encoding, and decoding BPE models.

Naive Vs fast-bpe-rs

Let N be the total number of token positions after splitting, M the number of merges to learn, and k the number of occurrences touched by the current merge.

Aspect Naive BPE trainer fast-bpe-rs
Corpus representation Plain token lists such as Vec<Vec<u32>> Deduplicated weighted merge sequences
Per-merge work Recount pairs across the full corpus Update only neighborhoods touched by the merge
Sequence updates Rebuild or rewrite token lists repeatedly In-place splicing in a sparse linked structure backed by Vec<Option<MergeNode>>
Pair statistics Recomputed from scratch each round Maintained incrementally as pair -> {count, locations} and count -> set of pairs
Best-pair lookup Usually depends on the latest full recount Pulled from the highest non-empty count bucket
Repeated chunks Counted again and again Stored once with a frequency weight
Parallelism Often minimal in simple implementations Parallel chunk counting and initial pair aggregation with rayon
Training time complexity Typically O(MN) because each merge triggers another global count pass O(N) setup, then per merge roughly O(k) local updates instead of O(N) rescans
Space complexity Usually O(N) plus temporary pair counts Higher than naive: O(N) corpus state plus pair-location indexes and count buckets

Setup

Install from PyPI

pip install fast-bpe-rs

Use

Small example

from fast_bpe_rs import BPE

bpe = BPE(r"(?s).+")
bpe.train(258, ["low low low low", "lower lower", "newest newest newest"])

ids = bpe.encode("low lower newest")
text = bpe.decode_to_string(ids)

GPT-style split pattern

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}"
    r"| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
)

bpe.train(32768, corpus_lines)

Special tokens

from fast_bpe_rs import BPE

bpe = BPE(
    r"(?s).+",
    {
        "<pad>": 600,
        "<eos>": 601,
    },
)

bpe.train(605, ["a<pad>a"])
ids = bpe.encode("a<pad><eos>a")

API

  • BPE(split_pattern, special_tokens=None)
  • train(vocab_size, docs)
  • encode(text) -> list[int]
  • decode(token_ids) -> bytes
  • decode_to_string(token_ids) -> str

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_bpe_rs-0.4.3.tar.gz (93.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

fast_bpe_rs-0.4.3-cp310-abi3-win_amd64.whl (947.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

fast_bpe_rs-0.4.3-cp310-abi3-manylinux_2_34_x86_64.whl (1.0 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.34+ x86-64

fast_bpe_rs-0.4.3-cp310-abi3-macosx_11_0_arm64.whl (899.6 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file fast_bpe_rs-0.4.3.tar.gz.

File metadata

  • Download URL: fast_bpe_rs-0.4.3.tar.gz
  • Upload date:
  • Size: 93.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.3.tar.gz
Algorithm Hash digest
SHA256 e3cb20137a6f1295b7af25d95a2f6ad6b7b1c19c08dfe3fb0a314c2f4637cd67
MD5 84692cba2f8c15dda9f0ab4d40e60009
BLAKE2b-256 4b4d42bbf42fa5f110365bd7ed9a266b40d4e0b758fd19dfdb4745eada2d6039

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.3.tar.gz:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.3-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: fast_bpe_rs-0.4.3-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 947.8 kB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fast_bpe_rs-0.4.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ff189a6197b4f6499b526d3699c79db64a00c2777bf0265815126baf45a605a9
MD5 3f4350b801252f2a8db479590c0f4145
BLAKE2b-256 f0fe70ee5b2947dcbf946911d93634a5524b1720f34d045285390054fb0784c4

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.3-cp310-abi3-win_amd64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.3-cp310-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.3-cp310-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 b2907b571f8e6c2f18bb7b8e0626d2da72d30ac7f16c48b822a469e090e03b5c
MD5 0043911ba8d2e40dc84fc4d3a5cad2db
BLAKE2b-256 73189ffe7507caa2288c64767618105a02d84296bb19ecb86ebdd5508d37d5d7

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.3-cp310-abi3-manylinux_2_34_x86_64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fast_bpe_rs-0.4.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for fast_bpe_rs-0.4.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3a7cad50e442a29223b25f5de1a0510370ee63b390bdfd8ea3cf588e3f16d401
MD5 4650f800025e770695a0fefad919c4df
BLAKE2b-256 79c8dee392740677445118a23a5f98f1fbad72646bdd04510900b87823c5d3a6

See more details on using hashes here.

Provenance

The following attestation bundles were made for fast_bpe_rs-0.4.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhixiangli/fast-bpe-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page