Skip to main content

A blazing-fast BPE tokenizer for Qwen models, built with Rust. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.

Project description

bpe-qwen

A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 6x faster tokenization by default and 12x faster with parallelization compared to HuggingFace tokenizers.

Features

  • 🚀 Linear-time tokenization based on the rust-gems BPE crate for fast tokenization
  • 🎯 Optimized pretokenization for Qwen's pretokenization pattern using a two-pass approach instead of the base lookahead regex
  • 🐍 Python bindings via PyO3 for seamless integration
  • 📦 Native BPE format support (vocab.json + merges.txt)
  • 6x faster encoding by default, 12x faster with parallelism, and 2x faster decoding compared to HuggingFace
  • 100% accuracy verified across comprehensive test suite, including special tokens

Installation

pip install bpe-qwen

Usage

Quick Start

Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:

# Patch transformers to use bpe-qwen for Qwen models
from bpe_qwen import AutoLinearTokenizer

# This automatically uses bpe-qwen under the hood
tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Use it exactly like a HuggingFace tokenizer
outputs = tokenizer(
    "Hello, world!",
    return_tensors="pt",
    padding=True,
    truncation=True
)
print(outputs["input_ids"])

# Batch processing with native HuggingFace API
batch = tokenizer(
    ["Text 1", "Text 2", "Text 3"],
    padding=True,
    return_attention_mask=True
)

Benchmark Results

Performance comparison with HuggingFace tokenizers on WikiText dataset (2,891 texts, 1.3M characters):

Sequential Performance:

Tokenizer Speed Speedup vs HF
bpe-qwen 6.40M tokens/sec 6.28x
HuggingFace 1.02M tokens/sec 1.00x

Parallel Performance (8 workers):

Tokenizer Speed Speedup vs HF Parallel Benefit
bpe-qwen 33.08M tokens/sec 12.52x 5.17x vs sequential
HuggingFace 2.64M tokens/sec 1.00x 2.59x vs sequential

Token consistency verified: All methods produce identical 298,938 tokens

Development

Building from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/sweepai/bpe-qwen.git
cd bpe-qwen
maturin develop --release

# Run tests
python test_simple.py
python benchmark.py

Limitations

  • Requires vocab.json and merges.txt files (not tokenizer.json)
  • Some multi-byte UTF-8 characters are not handled correctly

Future Improvements

Potential Optimizations

  • True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
  • Custom allocators: Specialized memory management for tokenization workloads

Feature Enhancements

  • Early stopping for tokenization based on token count
  • Support for more model architectures
  • Batch processing optimizations

Acknowledgments

  • Built on top of the excellent rust-gems BPE crate
  • Inspired by the need for faster tokenization in production ML pipelines

This entire project was written by Sweep AI, an AI plugin for JetBrains IDEs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bpe_qwen-0.1.3-cp312-cp312-macosx_11_0_arm64.whl (989.7 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bpe_qwen-0.1.3-cp310-cp310-macosx_11_0_arm64.whl (990.0 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file bpe_qwen-0.1.3-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.3-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bdddebcbec0b232dc5505562ce4d48b6cb1a3cca8abe15a4d08dba409ae5907e
MD5 9bf91670c04214b1218fcfc47ce83e84
BLAKE2b-256 9fa5b5d4c05bdb00d5db8d08cde77632b36454163642b51dc0aaf85e61e751ad

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.3-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.3-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc79142c44bb7335326cc15bbd7163c7223dd530e0c90ffb88d78dd5355ec326
MD5 5892177c05b2730a9fefd9661371fbc3
BLAKE2b-256 b26b4a489642a67a52be42834987b817bdf96e1be2beeb373c1832ead88086e9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page