Skip to main content

A blazing-fast BPE tokenizer for Qwen models, built with Rust. Achieves 5x faster tokenization with parallelism compared to HuggingFace tokenizers.

Project description

bpe-qwen

A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 6x faster tokenization by default and 12x faster with parallelization compared to HuggingFace tokenizers.

Features

  • 🚀 Linear-time tokenization based on the rust-gems BPE crate for fast tokenization
  • 🎯 Optimized pretokenization for Qwen's pretokenization pattern using a two-pass approach instead of the base lookahead regex
  • 🐍 Python bindings via PyO3 for seamless integration
  • 📦 Native BPE format support (vocab.json + merges.txt)
  • 6x faster encoding by default, 12x faster with parallelism, and 2x faster decoding compared to HuggingFace
  • 100% accuracy verified across comprehensive test suite, including special tokens

Installation

pip install bpe-qwen

Usage

Quick Start

Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:

# Patch transformers to use bpe-qwen for Qwen models
from bpe_qwen import AutoLinearTokenizer

# This automatically uses bpe-qwen under the hood
tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Use it exactly like a HuggingFace tokenizer
outputs = tokenizer(
    "Hello, world!",
    return_tensors="pt",
    padding=True,
    truncation=True
)
print(outputs["input_ids"])

# Batch processing with native HuggingFace API
batch = tokenizer(
    ["Text 1", "Text 2", "Text 3"],
    padding=True,
    return_attention_mask=True
)

Benchmark Results

Performance comparison with HuggingFace tokenizers on WikiText dataset (2,891 texts, 1.3M characters):

Sequential Performance:

Tokenizer Speed Speedup vs HF
bpe-qwen 6.40M tokens/sec 6.28x
HuggingFace 1.02M tokens/sec 1.00x

Parallel Performance (8 workers):

Tokenizer Speed Speedup vs HF Parallel Benefit
bpe-qwen 33.08M tokens/sec 12.52x 5.17x vs sequential
HuggingFace 2.64M tokens/sec 1.00x 2.59x vs sequential

Token consistency verified: All methods produce identical 298,938 tokens

Development

Building from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/sweepai/bpe-qwen.git
cd bpe-qwen
maturin develop --release

# Run tests
python test_simple.py
python benchmark.py

Limitations

  • Requires vocab.json and merges.txt files (not tokenizer.json)
  • Some multi-byte UTF-8 characters are not handled correctly

Future Improvements

Potential Optimizations

  • True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
  • Custom allocators: Specialized memory management for tokenization workloads

Feature Enhancements

  • Early stopping for tokenization based on token count
  • Support for more model architectures
  • Batch processing optimizations

Acknowledgments

  • Built on top of the excellent rust-gems BPE crate
  • Inspired by the need for faster tokenization in production ML pipelines

This entire project was written by Sweep AI, an AI plugin for JetBrains IDEs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_qwen-0.1.4.tar.gz (312.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bpe_qwen-0.1.4-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.4-cp312-cp312-macosx_11_0_arm64.whl (989.2 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bpe_qwen-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.4-cp311-cp311-macosx_11_0_arm64.whl (989.4 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

bpe_qwen-0.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.4-cp310-cp310-macosx_11_0_arm64.whl (989.7 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file bpe_qwen-0.1.4.tar.gz.

File metadata

  • Download URL: bpe_qwen-0.1.4.tar.gz
  • Upload date:
  • Size: 312.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for bpe_qwen-0.1.4.tar.gz
Algorithm Hash digest
SHA256 d83f0e1e2a33419ad6179c8ffe3eb4a75d11e32d0dc45c1d2bc79b39076415bd
MD5 800b12ae216f97afc13cda35d795d305
BLAKE2b-256 1be8f9c3e1334b749bec70b223932541361e6755758e8f40c0cc244594f5bd75

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 656b327ebc12aed0a50976c95aff4cd77905618a9c16a0f4d7adc3fa2c217085
MD5 165d0911ff8bfb881392d4a8228794d0
BLAKE2b-256 3d8ad39f3d2a1693bc71f20a38e067c63b701f95d3edb6bbb3d99bc7e5dbcca8

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 54837f57cf9a344c000c9001d0886017aa6951c5fc2ff38f3dfc8faf757c7da4
MD5 9a414a90371e6102b0200557e64a2293
BLAKE2b-256 be3e389750ec99086d4dc3fe1476b449a180543bc0f339b2181a0dfbec07a629

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 195cbccd4865056ad1b5dae8e4958023cf1b412b75d68694ae0c02c4a57f8750
MD5 1f682f13e077c601701a3f09f03a67c6
BLAKE2b-256 fb62222bcc85c8365e7b89631b2fecc0e278266c73f9ae5f88d3a0f4201d98ec

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 f0b6979ae4093498b3e6c39bff1de08cd154f5f5ddede7bfda6d5c5d2f0384d1
MD5 89f20c65975637dec2dc923bf67ace2b
BLAKE2b-256 abb1b0e926315e04eced78819e26c87aeee81d1ba4b77fa05d85862f535ea01e

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c34a2c2d3607f7bdc15fd519e87df0228bdd42afb4dd7f28e2660eef773b679a
MD5 60732e79de2a00face4f87a7dd7b49e2
BLAKE2b-256 0b4fa1dd9cd71572625c02ee848b6745660742e48dcc2ef8c8d10f71867b804c

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b03da242a04d5cae64c6930cb4cb22c8372719509deef37d7816473ffd3b81ff
MD5 afcef2db72d3e7ceb1c13291661973d4
BLAKE2b-256 29e5dd5bc13037c3884ed30e8a2cdb49dda5949214eed16fec4da47b79130609

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 dc38a5701adce5d77c01d4fef034ac427492fae78339c0430d56a1060c8b7049
MD5 dac7af70e84b8689c3b1d02e48f375d0
BLAKE2b-256 70d140125211b5fae57b6300adbf46c742f9b90c9fa37461ff7241308a0a9abf

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ad21ee8d8c10026c4c914c0da04014bdd246e26fed7f9ac660bed0202bcd72d5
MD5 c03d8b0b68009c4a2868413100d78e8a
BLAKE2b-256 4fcf4feef74d6bf1898d335ba83fbf61b44bada93989dfad1d97cc9136d146f3

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 765ca682a66a6c4890aab40f26185fd527c1cafec15c734e9a70720dfedf6939
MD5 eab9474391cc1e9402a1584c45239719
BLAKE2b-256 f2081b5e1463a2514d61efe0a382ae1396b25dac4348fffd3f59110c06c1b016

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 e6e6d26d9a36d39fd8a1fcf62678a6adf0001d048f507e021c5ab2ff2356306b
MD5 dc5d43e3f26c9a7ae314afbc25f90a8a
BLAKE2b-256 9db57d59b11f83e5956ab5801e3e39e9bb76048bf2818d6a7966c9a802980ca1

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.4-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.4-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5a5e88c57a744eecc2b0f84c6a2f20edd6b443f7732d2115676bfaee477e145c
MD5 5a05711fdfbd0d720534b1a827e8ea7c
BLAKE2b-256 a932f5beb26751b8d045ac0a89a449c63a0bd2bd6351d4e187f007104ca383e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page