A lightweight, standalone Qwen tokenizer with string utilities.
Project description
Qwen Tokenizer Lite
A lightweight, standalone Python package for Qwen tokenization, extracted and optimized for simple installation via PyPI. It wraps tiktoken to provide fast encoding/decoding and includes utility functions for text processing.
Installation
Install the package directly from PyPI:
pip install q-tokenizer
Quick Start
The package initializes the tokenizer automatically using the bundled vocabulary file, so you can start tokenizing immediately.
from q_tokenizer import count_tokens, has_chinese_chars, remove_chinese_chars
# 1. Count tokens easily
text = "Hello, this is a test string."
print(count_tokens(text))
# 2. Check for Chinese characters
print(has_chinese_chars("Hello World")) # False
print(has_chinese_chars("Hello 世界")) # True
# 3. Remove Chinese characters
clean_text = remove_chinese_chars("Hello 世界")
print(clean_text) # Output: "Hello "
Advanced Usage
For more control over the tokenization process, you can use the QWenTokenizer class directly.
Initialization
The default vocabulary (qwen.tiktoken) is included in the package. If you have a custom vocabulary, you can pass the file path.
from q_tokenizer import QWenTokenizer
# Initialize with default bundled vocab
tokenizer = QWenTokenizer()
# OR initialize with extra vocab
# tokenizer = QWenTokenizer(extra_vocab_file='path/to/extra_vocab.tiktoken')
Encoding and Decoding
text = "<|im_start|>user你好<|im_end|>"
# Tokenize
tokens = tokenizer.tokenize(text)
print(tokens)
# Encode to IDs
ids = tokenizer.encode(text)
print(ids)
# Decode back to string
decoded_text = tokenizer.decode(ids)
print(decoded_text)
Truncation
Truncate text based on token count, useful for LLM context limits.
long_text = """
A tokenizer whispers, each fragment to break,
From words into tokens, the journey begins.
Qwen counts the pieces, where meaning spin,
The rhythm of language yours, forever mine.
"""
# Standard truncation (keeps the start)
short_text = tokenizer.truncate(long_text, max_token=10)
# Smart truncation (keeps start and end, adds "..." in the middle)
smart_text = tokenizer.truncate(long_text, max_token=10, keep_both_sides=True)
Features
- Plug & Play: No need to manually download
qwen.tiktoken; it is included in the package. - Efficient: Built on top of
tiktokenfor high performance. - Special Tokens: Full support for
<|im_start|>,<|im_end|>, and other special tokens. - Chinese Utilities: Includes helper functions to detect and strip Chinese characters from text.
License & Attribution
This project is licensed under the Apache License, Version 2.0.
Origin
The core tokenization logic in this package is derived from the Qwen-Agent Repository.
Why this exists
The original repository includes a wide range of dependencies. q-tokenizer was created for users who only need the essential tokenization functions without the overhead of the entire library.
Key modifications made to the original code:
- Modularized: Extracted only the essential
QWenTokenizerclass and BPE logic. - Utility Additions: Added standalone helper functions like
has_chinese_charsandremove_chinese_chars.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file q_tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: q_tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86b936bc8533fcae11389c5437af73e6e70e088f828af4baac07df18742e2abd
|
|
| MD5 |
eea5d4a47b5ef323851b186a6ea1a0a6
|
|
| BLAKE2b-256 |
7c5aa453c95c3e0364bceb96df66d22f97db7f9832c479fb81a5864cb0d34dd0
|
File details
Details for the file q_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: q_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a101b90037986f9410db501bf120a9439ce2f6117eff06a73971de69e804983
|
|
| MD5 |
66d90de34c851b2e6fe3862be6806c41
|
|
| BLAKE2b-256 |
e1d514b8cabbfe159a6ce6060f59f62e5cdfc428e4d1bd489dff16f0c6768496
|