A blazing-fast BPE tokenizer for Qwen models, built with Rust. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.
Project description
bpe-qwen
A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.
Features
- 🚀 Linear-time tokenization using optimized Rust implementation
- 🐍 Python bindings via PyO3 for seamless integration
- 📦 Native BPE format support (vocab.json + merges.txt)
- ⚡ 5x faster encoding with parallelism and 2x faster decoding compared to HuggingFace
- 🎯 Pretokenization support for Qwen's pretokenization pattern
- ✅ 100% accuracy verified across comprehensive test suite, including special tokens
Installation
pip install bpe-qwen
Usage
Quick Start
Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:
# Patch transformers to use bpe-qwen for Qwen models
from bpe_qwen import AutoLinearTokenizer
# This automatically uses bpe-qwen under the hood
tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")
# Use it exactly like a HuggingFace tokenizer
outputs = tokenizer(
"Hello, world!",
return_tensors="pt",
padding=True,
truncation=True
)
print(outputs["input_ids"])
# Batch processing with native HuggingFace API
batch = tokenizer(
["Text 1", "Text 2", "Text 3"],
padding=True,
return_attention_mask=True
)
Benchmark Results
Performance comparison with HuggingFace tokenizers on various text samples:
| Metric | bpe-qwen (Rust) | HuggingFace | Speedup |
|---|---|---|---|
| Encoding Speed | 19.22M chars/sec | 3.35M chars/sec | 5.73x |
| Decoding Speed | 12.34M tokens/sec | 5.33M tokens/sec | 2.32x |
| Load Time | ~3.3 seconds | ~2.0 seconds | 1.65x |
Technical Implementation
Performance Optimization Journey
We systematically optimized the tokenizer through multiple iterations with significant performance improvements:
Core Optimizations
- HashMap → Vec mapping: Replaced
HashMap<u32, u32>withVec<u32>for O(1) token ID mapping - ASCII normalization skip: Fast-path ASCII text to skip Unicode normalization
- Vector pre-allocation: Optimal 128-token capacity reduces reallocation overhead
Advanced Optimizations
- SIMD ASCII detection: Process 8 bytes at once using u64 chunks instead of byte-by-byte checks
- Memory pool: Reuse
Vec<u32>allocations between tokenization calls to reduce allocation pressure - True SIMD intrinsics: NEON on ARM, SSE2 on x86_64 for 16-byte parallel processing
- Zero-copy strings: Use
Cow<str>to avoid allocations for ASCII text and when normalization not needed
Experiment Results Table
| Optimization | Encoding Speed | Encoding vs HF | Decoding Speed | Decoding vs HF | Status |
|---|---|---|---|---|---|
| Baseline | 5.36M tok/s | 6.39x | 11.47M tok/s | 2.22x | ✅ Kept |
| + SIMD ASCII | 5.57M tok/s | 6.87x | - | - | ✅ Kept |
| + Memory Pool | 5.85M tok/s | 7.30x | 11.47M tok/s | 2.22x | ✅ Kept |
| + String Interning | 6.05M tok/s | 7.72x | 7.55M tok/s | 1.38x | ❌ Reverted |
| - String Interning | 5.93M tok/s | 6.99x | 11.39M tok/s | 2.12x | ✅ Kept |
| + True SIMD | 6.12M tok/s | 7.28x | 12.04M tok/s | 2.21x | ✅ Kept |
| + Batch API | 6.06M tok/s | 7.50x | 12.04M tok/s | 2.32x | ❌ Reverted |
| + Zero-Copy | 6.30M tok/s | 7.83x | 12.34M tok/s | 2.32x | ✅ Kept |
| + Jemalloc | 5.70M tok/s | 8.91x | 11.01M tok/s | 2.19x | ❌ Reverted |
| + Parallel Batch (8 workers) | 31.43M tok/s | 18.13x* | - | - | ✅ Kept |
Development
Building from Source
# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone https://github.com/sweepai/bpe-qwen.git
cd bpe-qwen
maturin develop --release
# Run tests
python test_simple.py
python benchmark.py
Running Benchmarks
# Run comprehensive benchmarks
python benchmark.py
# Compare against HuggingFace
# (automatically downloads HF tokenizer if needed)
Limitations
- Currently supports Qwen models with GPT-2 style byte-level BPE
- Requires vocab.json and merges.txt files (not tokenizer.json)
- Some special tokens may need manual configuration
Future Improvements
Potential Optimizations
- Rayon parallelization: Multi-threaded tokenization for large texts using data parallelism
- True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
- Custom allocators: Specialized memory management for tokenization workloads
- Profile-guided optimization: Workload-specific optimizations based on production usage patterns
Feature Enhancements
- Early stopping for tokenization based on token count
- Support for more model architectures
- Batch processing optimizations
Acknowledgments
- Built on top of the excellent rust-gems BPE crate
- Inspired by the need for faster tokenization in production ML pipelines
This entire project was written by Sweep AI, an AI plugin for JetBrains IDEs
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: PyPy, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6cac7c2c5c7950eff0e78d9283998f52e1a5f07f0975c4514b3b0533a785bdd1
|
|
| MD5 |
d71b6e6f81be3b40545c340516580371
|
|
| BLAKE2b-256 |
78967ecbd1d3af07b240536f909a37665b95b6f9c9c887795ba1058d03b015a7
|
File details
Details for the file bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: PyPy, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c06bfd2b081208351e226ffed92cfe2dcfb3b43477268df56ee6a42793b47109
|
|
| MD5 |
524721e3f55f184ffe7c594a4b702e54
|
|
| BLAKE2b-256 |
d01dde53969897c2cc725c671fc603db6406a2bcb0b546deb1d5c3195a110e1d
|
File details
Details for the file bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2da363a7bb504f6ffc225390937b343dde0ec326d12b3e00ff755acc8e61a0f5
|
|
| MD5 |
7f5ba804812c8537a6807d08291bbc0d
|
|
| BLAKE2b-256 |
e67ae7111df3ddd255ee37ade39ecde8797cf1e1b67ae4608cbe97c70f3b9123
|
File details
Details for the file bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e98de5945a2c6371e8f907c31aa96dbbb806d11f1d59093306a023af6b5f75b6
|
|
| MD5 |
4381719c6c322c5d25f440d19b28b79c
|
|
| BLAKE2b-256 |
967f9dfd453ae64160699e346c7727d806f5d372853ccc3d64683902777d9127
|
File details
Details for the file bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80773f98c8c7356181cb57594c1ce8e6016c5024def4a0b7eeb85018276c9b6b
|
|
| MD5 |
49da3f9dcbf450fb7090789c043ad504
|
|
| BLAKE2b-256 |
16cd94f8c6577cb93a3f2c320a09802de66a669bb8e9c1c847f003d26b3e84d9
|
File details
Details for the file bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32231a068382306c312e90009d57000690ddd09b122857f20ee109702b353501
|
|
| MD5 |
0e1aac53534ce9579783ed4e2ddfa18e
|
|
| BLAKE2b-256 |
15edb0f2cba64774d85f11c8577534e766fa58c84e2b86ea4597fa6a535e9e80
|
File details
Details for the file bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31aa060c5c7e70329e47450d101dd686b78a4bd466f4ca1f3292fcd2a232bab9
|
|
| MD5 |
5cff0a5de2330ea964a76da977e683ef
|
|
| BLAKE2b-256 |
c8d93ac26437a2ca76a9cfe63b71d799cd21d02fe0edac4f858eb4c0b85fdc17
|
File details
Details for the file bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2aa44751cff43fc4d7d8b980045bbeea11df8de4ed09828391b5a0a0115b3c4b
|
|
| MD5 |
e7e37c10be043510a1d177e1c9fae8d1
|
|
| BLAKE2b-256 |
104729416aaea43671501ae92bec9fa7fe9a3ee076b58fdb23fdfd6ec132ef0b
|
File details
Details for the file bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9819c5b8818e0d9c838dd84e8f42d6eadff12bd7e96dd4d7d85f42d6ce7af1ab
|
|
| MD5 |
d03a18fbc6f9673553ebb070cc09588a
|
|
| BLAKE2b-256 |
3c7f051eaeeaec64249ecc2620d1d045ecd52aa9fd457a7174683137f4dd9cb0
|
File details
Details for the file bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1eda08fa3ee97a3cc02953ffc9aa1edc6e981d6093a67de857446758e88a8e0f
|
|
| MD5 |
e64744ebe9294654d8de406b69a521c7
|
|
| BLAKE2b-256 |
3920fbf0aea8d7075060c463dea28105e1ece67b91ce0e877ba7baa656ce837b
|
File details
Details for the file bpe_qwen-0.1.1-cp310-cp310-macosx_11_0_arm64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp310-cp310-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.2 MB
- Tags: CPython 3.10, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
032d740b9e365ff48d161e48f90e21e86ef9c4139fbb5ea3a956c28154348b19
|
|
| MD5 |
91a59bd43d1d3ccea70c9f1720392c64
|
|
| BLAKE2b-256 |
c779d5c81cc08a32f170ef1fdf4152b68bd026382a41e8feaa43e5662b03a383
|
File details
Details for the file bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54e00d057935173d0a508ec1ec366823c2baffcc3d83e950e6785e0bbde97c29
|
|
| MD5 |
19299b3924d136dd4e60191a4e4f1233
|
|
| BLAKE2b-256 |
d099c84d24e106d263da780a513019de1154aa647efbfe6eec1ff00de28f5653
|
File details
Details for the file bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8fca3f68d7881b8bbb672462f9aa43e86b2eee022f01eabf40946ea0630f299f
|
|
| MD5 |
7e9c3288c0eba9754101fd230bbea3ec
|
|
| BLAKE2b-256 |
57fa374eae78127e23aa602c37a2c2857642aad3be9aa688904a446c6de3d5a8
|
File details
Details for the file bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4caa82a11af5a784676ca7f8272aea7886c076f39c0d66b4d45a3b10adc33dfc
|
|
| MD5 |
1cd659a7470ed11796600cc4dc30f690
|
|
| BLAKE2b-256 |
2431d1791cd3b789d7d09bb419f5f6a68330df14b2f67275af745d04f127aa59
|
File details
Details for the file bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: bpe_qwen-0.1.1-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.4 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
363dbe8dfd881897985d2e557a1b20d9e535d1013811673e63a97ca2a1e19e26
|
|
| MD5 |
e5048f2792839aa6dbe1ae9249a87c04
|
|
| BLAKE2b-256 |
550bec4a432e26cd761fe3a67c0bcbc4ef0f2368978c77de39607e3aa7afde91
|