Skip to main content

A blazing-fast BPE tokenizer for Qwen models, built with Rust. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.

Project description

bpe-qwen

A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 5x faster tokenization by default and 10x faster with parallelization compared to HuggingFace tokenizers.

Features

  • 🚀 Linear-time tokenization using optimized Rust implementation
  • 🐍 Python bindings via PyO3 for seamless integration
  • 📦 Native BPE format support (vocab.json + merges.txt)
  • 5x faster encoding by default, 10x faster with parallelism, and 2x faster decoding compared to HuggingFace
  • 🎯 Pretokenization support for Qwen's pretokenization pattern
  • 100% accuracy verified across comprehensive test suite, including special tokens

Installation

pip install bpe-qwen

Usage

Quick Start

Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:

# Patch transformers to use bpe-qwen for Qwen models
from bpe_qwen import AutoLinearTokenizer

# This automatically uses bpe-qwen under the hood
tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Use it exactly like a HuggingFace tokenizer
outputs = tokenizer(
    "Hello, world!",
    return_tensors="pt",
    padding=True,
    truncation=True
)
print(outputs["input_ids"])

# Batch processing with native HuggingFace API
batch = tokenizer(
    ["Text 1", "Text 2", "Text 3"],
    padding=True,
    return_attention_mask=True
)

Benchmark Results

Performance comparison with HuggingFace tokenizers on various text samples:

Metric bpe-qwen (Rust) HuggingFace Speedup
Encoding Speed 19.22M chars/sec 3.35M chars/sec 5.73x
Decoding Speed 12.34M tokens/sec 5.33M tokens/sec 2.32x
Load Time ~3.3 seconds ~2.0 seconds 1.65x

Technical Implementation

Performance Optimization Journey

We systematically optimized the tokenizer through multiple iterations with significant performance improvements:

Core Optimizations

  1. HashMap → Vec mapping: Replaced HashMap<u32, u32> with Vec<u32> for O(1) token ID mapping
  2. ASCII normalization skip: Fast-path ASCII text to skip Unicode normalization
  3. Vector pre-allocation: Optimal 128-token capacity reduces reallocation overhead

Advanced Optimizations

  1. SIMD ASCII detection: Process 8 bytes at once using u64 chunks instead of byte-by-byte checks
  2. Memory pool: Reuse Vec<u32> allocations between tokenization calls to reduce allocation pressure
  3. True SIMD intrinsics: NEON on ARM, SSE2 on x86_64 for 16-byte parallel processing
  4. Zero-copy strings: Use Cow<str> to avoid allocations for ASCII text and when normalization not needed

Experiment Results Table

Optimization Encoding Speed Encoding vs HF Decoding Speed Decoding vs HF Status
Baseline 5.36M tok/s 6.39x 11.47M tok/s 2.22x ✅ Kept
+ SIMD ASCII 5.57M tok/s 6.87x - - ✅ Kept
+ Memory Pool 5.85M tok/s 7.30x 11.47M tok/s 2.22x ✅ Kept
+ String Interning 6.05M tok/s 7.72x 7.55M tok/s 1.38x ❌ Reverted
- String Interning 5.93M tok/s 6.99x 11.39M tok/s 2.12x ✅ Kept
+ True SIMD 6.12M tok/s 7.28x 12.04M tok/s 2.21x ✅ Kept
+ Batch API 6.06M tok/s 7.50x 12.04M tok/s 2.32x ❌ Reverted
+ Zero-Copy 6.30M tok/s 7.83x 12.34M tok/s 2.32x ✅ Kept
+ Jemalloc 5.70M tok/s 8.91x 11.01M tok/s 2.19x ❌ Reverted
+ Parallel Batch (8 workers) 31.43M tok/s 18.13x* - - ✅ Kept

Development

Building from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/sweepai/bpe-qwen.git
cd bpe-qwen
maturin develop --release

# Run tests
python test_simple.py
python benchmark.py

Running Benchmarks

# Run comprehensive benchmarks
python benchmark.py

# Compare against HuggingFace
# (automatically downloads HF tokenizer if needed)

Limitations

  • Currently supports Qwen models with GPT-2 style byte-level BPE
  • Requires vocab.json and merges.txt files (not tokenizer.json)
  • Some special tokens may need manual configuration

Future Improvements

Potential Optimizations

  • Rayon parallelization: Multi-threaded tokenization for large texts using data parallelism
  • True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
  • Custom allocators: Specialized memory management for tokenization workloads
  • Profile-guided optimization: Workload-specific optimizations based on production usage patterns

Feature Enhancements

  • Early stopping for tokenization based on token count
  • Support for more model architectures
  • Batch processing optimizations

Acknowledgments

  • Built on top of the excellent rust-gems BPE crate
  • Inspired by the need for faster tokenization in production ML pipelines

This entire project was written by Sweep AI, an AI plugin for JetBrains IDEs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bpe_qwen-0.1.2-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.2-cp310-cp310-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

bpe_qwen-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.4 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ ARM64

File details

Details for the file bpe_qwen-0.1.2-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 65536392a75bce589e952f4602cc8a8cbd8fb1c0e36d299d9f00637d2a4d0282
MD5 48e1fe7d56e36d720349625548056dcc
BLAKE2b-256 79fe3fd00ed42bc48db0f059f835b1d5ff8740ccf1ff05908f5e2df1c63c1095

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 916210328f14b4dccad775039099298d35aca101ea1be6a2802535ada7b47a04
MD5 a404cf54ac4e98e5d902281789eb4ffd
BLAKE2b-256 10cb56904d855107c553930506f0b01c042423095b26bbb3329b07fe4ebcc4a5

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ea15e0a5004115722cc6940ca85601af3dd863e1d7986a32ba510493fe390f94
MD5 4867741a24412da1959ca7c0f3a46982
BLAKE2b-256 af7a72258a36d45aa4a6a084195d9186ee4b7d388b07b9467696f08fe932c176

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 531821df11c58acbc942894aa840dc8eb51cb7b08629b401c1b0d8dfd7906647
MD5 c23fbc571e7b2858af624e3124bdeae7
BLAKE2b-256 c3ff59e912a9d2c5ffbd184646e1f6f18e69d0f97be15ea1317d91ac572e4f4b

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 81c5ccaf47a1968dcccab58bf4e063c0f3023a4665369976b994fade8b5a2f5e
MD5 f89dbf250a56eeb178dd78332d636f15
BLAKE2b-256 1cee2a0163942358e05f038e91b9faab51a6981d146e8d6340ee6229e6acb879

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 db39b6f0fafb80dfa01b0dbf1965ba389d20295b5d9ef943a6e25dd7778f8d98
MD5 884f5d16152f8a72ab87ed8f079f019d
BLAKE2b-256 e52e6dba45318a3116e6de538978f86051067d2cf071b3d5e7591e1249f635cc

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d22bf178f85fc8f396873cae00077e669769d78879ab6082e34b60f8f9eadb10
MD5 3bfd2c1312d7a1f4ef6235fa38d39e33
BLAKE2b-256 33068a28542300f473c136d56807cbf2a47b0bb034d968de417f8f9aad6f8e1c

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e6a9dd5d50b4d9e8e7f22503fd1b269d96fe10902b88601c2fc96fe180becefd
MD5 7e8e53b94c1e444d9f56cc3b1eaab859
BLAKE2b-256 2d56e692113dca4608c166a695a41182860fa8b4257d8967ddf5cb2e909efb09

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c48b0b60eee02f0dd04ffb6932453339741ea665055eb6e7a6b778871488dcfe
MD5 32edadef66df4c07626474796303e24c
BLAKE2b-256 1eafafc426d54662cd2c0f6a121471f8ec69fdb5bde201ae59c632d0bf9722bf

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f51aad2fb0ce29344c0b940b337a65cda9eba154bf050b14cc829bf8a728b2a8
MD5 9415eb89fe2707eb1c81c50f1ba6088c
BLAKE2b-256 3bd503aa8ae0e0397877f30dd72be34f8a6d8d60a863dabc42e3684dacdbc3a7

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c9be2aa7e8c50aef3c340f462b21f8d86fe90471f23d3dede62f1e0a815189f
MD5 9e3f2f3502453371b72b040919756ad6
BLAKE2b-256 c4c8ae4a22e1d19e4546b472e09962d19d8f2aa570a9f7390b3aa591d3b1b8ef

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 754a26f7d5834bdbf3ded7c230331eb22e26ddae2c1a82846317741f3667843f
MD5 cedf4ae8ecd7e0bda65b305bb09a0536
BLAKE2b-256 6c1cc9b55e34055df1751505218574d878ae4bfe385cb7064355d8cffdffe8f2

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a520d97f9eb9c2b2eb24c9a1dc7bcbca22e20ba45c15dece515e2d5c65b954c6
MD5 9c77f9fd88d526ccb39775f5c02407ad
BLAKE2b-256 3c557e4166b2deef9c7a4dfdb0b3a4a61d6df9c4ba2854951ab3f4d74ce806df

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6cf76075027bf9eb120f1bfee23b6037dcfcf8d81772523be6def361e0fd56d6
MD5 1e8b15a90e58ccc3ababdacb08f9201e
BLAKE2b-256 37d8b5ad992a6fb86168622dedcddc899dbb00b11111da48142d6c8f78ef2593

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b7f3b36220593e23ecce383a85f410a027600a31eebc40824478cab3b8ce195e
MD5 8774a1ef363f26a808878b1a8cc38dca
BLAKE2b-256 f1834c54eb692c14202723c66180c5b67e65de3033ffd2cee790a43e7cc7e3ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page