Skip to main content

A blazing-fast BPE tokenizer for Qwen models, built with Rust. Achieves 6x faster tokenization with parallelism compared to HuggingFace tokenizers.

Project description

bpe-qwen

A blazing-fast BPE tokenizer for Qwen models, built with Rust and the rust-gems BPE crate. Achieves 6x faster tokenization by default and 12x faster with parallelization compared to HuggingFace tokenizers.

Features

  • 🚀 Linear-time tokenization based on the rust-gems BPE crate for fast tokenization
  • 🎯 Optimized pretokenization for Qwen's pretokenization pattern using a two-pass approach instead of the base lookahead regex
  • 🐍 Python bindings via PyO3 for seamless integration
  • 📦 Native BPE format support (vocab.json + merges.txt)
  • 6x faster encoding by default, 12x faster with parallelism, and 2x faster decoding compared to HuggingFace
  • 100% accuracy verified across comprehensive test suite, including special tokens

Installation

pip install bpe-qwen

Usage

Quick Start

Use bpe-qwen as a drop-in replacement for HuggingFace tokenizers:

# Patch transformers to use bpe-qwen for Qwen models
from bpe_qwen import AutoLinearTokenizer

# This automatically uses bpe-qwen under the hood
tokenizer = AutoLinearTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-7B-Instruct")

# Use it exactly like a HuggingFace tokenizer
outputs = tokenizer(
    "Hello, world!",
    return_tensors="pt",
    padding=True,
    truncation=True
)
print(outputs["input_ids"])

# Batch processing with native HuggingFace API
batch = tokenizer(
    ["Text 1", "Text 2", "Text 3"],
    padding=True,
    return_attention_mask=True
)

Benchmark Results

Performance comparison with HuggingFace tokenizers on WikiText dataset (2,891 texts, 1.3M characters):

Sequential Performance:

Tokenizer Speed Speedup vs HF
bpe-qwen 6.40M tokens/sec 6.28x
HuggingFace 1.02M tokens/sec 1.00x

Parallel Performance (8 workers):

Tokenizer Speed Speedup vs HF Parallel Benefit
bpe-qwen 33.08M tokens/sec 12.52x 5.17x vs sequential
HuggingFace 2.64M tokens/sec 1.00x 2.59x vs sequential

Token consistency verified: All methods produce identical 298,938 tokens

Development

Building from Source

# Install Rust toolchain
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/sweepai/bpe-qwen.git
cd bpe-qwen
maturin develop --release

# Run tests
python test_simple.py
python benchmark.py

Limitations

  • Requires vocab.json and merges.txt files (not tokenizer.json)
  • Some multi-byte UTF-8 characters are not handled correctly

Future Improvements

Potential Optimizations

  • True SIMD intrinsics: Explicit vector instructions for even faster ASCII detection and token processing
  • Custom allocators: Specialized memory management for tokenization workloads

Feature Enhancements

  • Early stopping for tokenization based on token count
  • Support for more model architectures
  • Batch processing optimizations

Acknowledgments

  • Built on top of the excellent rust-gems BPE crate
  • Inspired by the need for faster tokenization in production ML pipelines

This entire project was written by Sweep AI, an AI plugin for JetBrains IDEs

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bpe_qwen-0.1.5.tar.gz (312.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

bpe_qwen-0.1.5-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded PyPymanylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.5-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.5-cp312-cp312-macosx_11_0_arm64.whl (986.8 kB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

bpe_qwen-0.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.5-cp311-cp311-macosx_11_0_arm64.whl (986.9 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

bpe_qwen-0.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

bpe_qwen-0.1.5-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

bpe_qwen-0.1.5-cp310-cp310-macosx_11_0_arm64.whl (987.5 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file bpe_qwen-0.1.5.tar.gz.

File metadata

  • Download URL: bpe_qwen-0.1.5.tar.gz
  • Upload date:
  • Size: 312.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.15

File hashes

Hashes for bpe_qwen-0.1.5.tar.gz
Algorithm Hash digest
SHA256 67e2c4a6f75daabc14cef47d8c4e00cd56791ccdf99e331388628c7e18e8d880
MD5 f3fd599d599a6383588801eced0290e6
BLAKE2b-256 80a7124e83786c4aa75706b18ea003a6023cf44155c8b93ee00b01ef94d70cc0

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c0a75d50c99459c90ffaee78082caab55e594e2053ad9c32f86065c05eb46ed2
MD5 5a785355b330952c4329b2b8afe8737e
BLAKE2b-256 1e5f0cf43281b281fde57cba70b1bf8bdf21e1e4e109dff855bc947842c47555

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-pp310-pypy310_pp73-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d61237530f17fd74d9fd9def41221a8b6cec4f437e1eb3675fa7705cc4a2115b
MD5 fb0b3c4b3f31eb7ecc97e0b4c44dc9f7
BLAKE2b-256 a65ab4d3263f19133d8cc05016e9648c409c319d79fcd598f89ff0761b14d6c3

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d7504544b5c08532068f4b7e718a7d127b81761313a56ea58df3265e29f57965
MD5 9f553670263887575e571ba8d41525e4
BLAKE2b-256 2bffe6ca8338d4482a9a0b01052dabcf03cc20fd976b2118ff86937e39e5d8cb

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 d1a6f2cf1416a340faee84b5f42994b56025c04363c60e4782ef7199a5e0dcbc
MD5 15db05e9e203f933b24e23af07eec32f
BLAKE2b-256 de6351379bb93d172d554977a44692f5e1ee316170dab502066e748b269c2e70

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8cf6d6870c4cc0e375a277e2f1dd3f23151ef48d3961cb64bbc9c013d40fdff6
MD5 61349696fe754a360cf83084c4d52196
BLAKE2b-256 f735010cd7e9a5d4f16d8e65cb3c02b64b766a7a7bd7d46bfd6dac861b5f2819

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cb0967a8281b4d2f25c787f215a9e49dafacd2f00be2948f71e06c6ba99e9127
MD5 2d0072367e861ce1d18c7dca389d02bc
BLAKE2b-256 30b95e2f095512c7213ba221f2fa05433e68297dd1e85b1476574f014ca29c07

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 74f06aff7467b806725335848360f49aa795fec9e89130b8eab7f0864dd1af94
MD5 06d58ec628064f12f989f31bb4fe017b
BLAKE2b-256 81f8022c0e757a00666c3149346f8c3ef43c131ef5123946bcc1afca57d39762

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f545dc93f067d5e3413d6e7df03a18b7d902383eea7928a589abe757f6bd2947
MD5 2954789e5645365382008fe22e7b7725
BLAKE2b-256 24e5ac0e091ae9666910bc1aad23e6dc787ab231367f85ae312e688aefc134e9

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 37c53d3a98da0c02288948ed8e250d87849f3d8e6c361cb7c8993694e5eee8eb
MD5 99dc75d0016a77b60e6fa45b0c2d804f
BLAKE2b-256 0d091862ed9b93a8fc1cff37f7f316d3451978c1f68ee6af9849eff68116b32c

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 13c8cdef91dff2ff2eb9ce40e7ebafce678d8c7f79a688aeb545820be464eb66
MD5 be411da1277bce64ba2b6592981bab47
BLAKE2b-256 8c68eb3fff5429c7b6008fec02d9fb68dc1c123e39e9511a40a3b9239615f91c

See more details on using hashes here.

File details

Details for the file bpe_qwen-0.1.5-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for bpe_qwen-0.1.5-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f43f93c9b968566928a308d6ca5052a8055176ef80a94adb18005c9788f54c83
MD5 a0dd5cda497ee09aea94941a3562195c
BLAKE2b-256 222c4781d5f9147e24bb739efc97d3bd6baf729ddaa7927d90907cd9a8edf541

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page