Skip to main content

A blazing-fast BPE tokenizer for Qwen models, built with Rust. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.

Project description

bpe-qwen

A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.

Features

  • 🚀 Linear-time tokenization using optimized Rust implementation
  • 🐍 Python bindings via PyO3 for seamless integration
  • 📦 Native BPE format support (vocab.json + merges.txt)
  • 5x faster encoding with parallelism and 2x faster decoding compared to HuggingFace
  • 🎯 Pretokenization support for Qwen's pretokenization pattern
  • 100% accuracy verified across comprehensive test suite, including special tokens

Installation

pip install bpe-qwen

Usage

Quick Start

Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:

# Patch transformers to use bpe-qwen for Qwen models
from bpe_qwen import AutoLinearTokenizer

# This automatically uses bpe-qwen under the hood
tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Use it exactly like a HuggingFace tokenizer
outputs = tokenizer(
    "Hello, world!",
    return_tensors="pt",
    padding=True,
    truncation=True
)
print(outputs["input_ids"])

# Batch processing with native HuggingFace API
batch = tokenizer(
    ["Text 1", "Text 2", "Text 3"],
    padding=True,
    return_attention_mask=True
)

Benchmark Results

Performance comparison with HuggingFace tokenizers on various text samples:

Metric bpe-qwen (Rust) HuggingFace Speedup
Encoding Speed 19.22M chars/sec 3.35M chars/sec 5.73x
Decoding Speed 12.34M tokens/sec 5.33M tokens/sec 2.32x
Load Time ~3.3 seconds ~2.0 seconds 1.65x

Technical Implementation

Performance Optimization Journey

We systematically optimized the tokenizer through multiple iterations with significant performance improvements:

Core Optimizations

  1. HashMap → Vec mapping: Replaced HashMap<u32, u32> with Vec<u32> for O(1) token ID mapping
  2. ASCII normalization skip: Fast-path ASCII text to skip Unicode normalization
  3. Vector pre-allocation: Optimal 128-token capacity reduces reallocation overhead

Advanced Optimizations

  1. SIMD ASCII detection: Process 8 bytes at once using u64 chunks instead of byte-by-byte checks
  2. Memory pool: Reuse Vec<u32> allocations between tokenization calls to reduce allocation pressure
  3. True SIMD intrinsics: NEON on ARM, SSE2 on x86_64 for 16-byte parallel processing
  4. Zero-copy strings: Use Cow<str> to avoid allocations for ASCII text and when normalization not needed

Experiment Results Table

Optimization Encoding Speed Encoding vs HF Decoding Speed Decoding vs HF Status
Baseline 5.36M tok/s 6.39x 11.47M tok/s 2.22x ✅ Kept
+ SIMD ASCII 5.57M tok/s 6.87x - - ✅ Kept
+ Memory Pool 5.85M tok/s 7.30x 11.47M tok/s 2.22x ✅ Kept
+ String Interning 6.05M tok/s 7.72x 7.55M tok/s 1.38x ❌ Reverted
- String Interning 5.93M tok/s 6.99x 11.39M tok/s 2.12x ✅ Kept
+ True SIMD 6.12M tok/s 7.28x 12.04M tok/s 2.21x ✅ Kept
+ Batch API 6.06M tok/s 7.50x 12.04M tok/s 2.32x ❌ Reverted
+ Zero-Copy 6.30M tok/s 7.83x 12.34M tok/s 2.32x ✅ Kept
+ Jemalloc 5.70M tok/s 8.91x 11.01M tok/s 2.19x ❌ Reverted
+ Parallel Batch (8 workers) 31.43M tok/s 18.13x* - - ✅ Kept

Development

Building from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/sweepai/bpe-qwen.git
cd bpe-qwen
maturin develop --release

# Run tests
python test_simple.py
python benchmark.py

Running Benchmarks

# Run comprehensive benchmarks
python benchmark.py

# Compare against HuggingFace
# (automatically downloads HF tokenizer if needed)

Limitations

  • Currently supports Qwen models with GPT-2 style byte-level BPE
  • Requires vocab.json and merges.txt files (not tokenizer.json)
  • Some special tokens may need manual configuration

Future Improvements

Potential Optimizations

  • Rayon parallelization: Multi-threaded tokenization for large texts using data parallelism
  • True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
  • Custom allocators: Specialized memory management for tokenization workloads
  • Profile-guided optimization: Workload-specific optimizations based on production usage patterns

Feature Enhancements

  • Early stopping for tokenization based on token count
  • Support for more model architectures
  • Batch processing optimizations

Acknowledgments

  • Built on top of the excellent rust-gems BPE crate
  • Inspired by the need for faster tokenization in production ML pipelines

This entire project was written by Sweep AI, an AI plugin for JetBrains IDEs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.1-cp310-cp310-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ ARM64

File details

Details for the file bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6cac7c2c5c7950eff0e78d9283998f52e1a5f07f0975c4514b3b0533a785bdd1
MD5 d71b6e6f81be3b40545c340516580371
BLAKE2b-256 78967ecbd1d3af07b240536f909a37665b95b6f9c9c887795ba1058d03b015a7

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c06bfd2b081208351e226ffed92cfe2dcfb3b43477268df56ee6a42793b47109
MD5 524721e3f55f184ffe7c594a4b702e54
BLAKE2b-256 d01dde53969897c2cc725c671fc603db6406a2bcb0b546deb1d5c3195a110e1d

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2da363a7bb504f6ffc225390937b343dde0ec326d12b3e00ff755acc8e61a0f5
MD5 7f5ba804812c8537a6807d08291bbc0d
BLAKE2b-256 e67ae7111df3ddd255ee37ade39ecde8797cf1e1b67ae4608cbe97c70f3b9123

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e98de5945a2c6371e8f907c31aa96dbbb806d11f1d59093306a023af6b5f75b6
MD5 4381719c6c322c5d25f440d19b28b79c
BLAKE2b-256 967f9dfd453ae64160699e346c7727d806f5d372853ccc3d64683902777d9127

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 80773f98c8c7356181cb57594c1ce8e6016c5024def4a0b7eeb85018276c9b6b
MD5 49da3f9dcbf450fb7090789c043ad504
BLAKE2b-256 16cd94f8c6577cb93a3f2c320a09802de66a669bb8e9c1c847f003d26b3e84d9

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 32231a068382306c312e90009d57000690ddd09b122857f20ee109702b353501
MD5 0e1aac53534ce9579783ed4e2ddfa18e
BLAKE2b-256 15edb0f2cba64774d85f11c8577534e766fa58c84e2b86ea4597fa6a535e9e80

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 31aa060c5c7e70329e47450d101dd686b78a4bd466f4ca1f3292fcd2a232bab9
MD5 5cff0a5de2330ea964a76da977e683ef
BLAKE2b-256 c8d93ac26437a2ca76a9cfe63b71d799cd21d02fe0edac4f858eb4c0b85fdc17

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 2aa44751cff43fc4d7d8b980045bbeea11df8de4ed09828391b5a0a0115b3c4b
MD5 e7e37c10be043510a1d177e1c9fae8d1
BLAKE2b-256 104729416aaea43671501ae92bec9fa7fe9a3ee076b58fdb23fdfd6ec132ef0b

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9819c5b8818e0d9c838dd84e8f42d6eadff12bd7e96dd4d7d85f42d6ce7af1ab
MD5 d03a18fbc6f9673553ebb070cc09588a
BLAKE2b-256 3c7f051eaeeaec64249ecc2620d1d045ecd52aa9fd457a7174683137f4dd9cb0

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 1eda08fa3ee97a3cc02953ffc9aa1edc6e981d6093a67de857446758e88a8e0f
MD5 e64744ebe9294654d8de406b69a521c7
BLAKE2b-256 3920fbf0aea8d7075060c463dea28105e1ece67b91ce0e877ba7baa656ce837b

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 032d740b9e365ff48d161e48f90e21e86ef9c4139fbb5ea3a956c28154348b19
MD5 91a59bd43d1d3ccea70c9f1720392c64
BLAKE2b-256 c779d5c81cc08a32f170ef1fdf4152b68bd026382a41e8feaa43e5662b03a383

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54e00d057935173d0a508ec1ec366823c2baffcc3d83e950e6785e0bbde97c29
MD5 19299b3924d136dd4e60191a4e4f1233
BLAKE2b-256 d099c84d24e106d263da780a513019de1154aa647efbfe6eec1ff00de28f5653

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 8fca3f68d7881b8bbb672462f9aa43e86b2eee022f01eabf40946ea0630f299f
MD5 7e9c3288c0eba9754101fd230bbea3ec
BLAKE2b-256 57fa374eae78127e23aa602c37a2c2857642aad3be9aa688904a446c6de3d5a8

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4caa82a11af5a784676ca7f8272aea7886c076f39c0d66b4d45a3b10adc33dfc
MD5 1cd659a7470ed11796600cc4dc30f690
BLAKE2b-256 2431d1791cd3b789d7d09bb419f5f6a68330df14b2f67275af745d04f127aa59

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 363dbe8dfd881897985d2e557a1b20d9e535d1013811673e63a97ca2a1e19e26
MD5 e5048f2792839aa6dbe1ae9249a87c04
BLAKE2b-256 550bec4a432e26cd761fe3a67c0bcbc4ef0f2368978c77de39607e3aa7afde91

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page