A lightweight, standalone Qwen tokenizer with string utilities.

These details have not been verified by PyPI

Project links

Homepage

Project description

Qwen Tokenizer Lite

A lightweight, standalone Python package for Qwen tokenization, extracted and optimized for simple installation via PyPI. It wraps tiktoken to provide fast encoding/decoding and includes utility functions for text processing.

Installation

Install the package directly from PyPI:

pip install q-tokenizer

Quick Start

The package initializes the tokenizer automatically using the bundled vocabulary file, so you can start tokenizing immediately.

from q_tokenizer import count_tokens, has_chinese_chars, remove_chinese_chars

# 1. Count tokens easily
text = "Hello, this is a test string."
print(count_tokens(text)) 

# 2. Check for Chinese characters
print(has_chinese_chars("Hello World"))        # False
print(has_chinese_chars("Hello 世界"))         # True

# 3. Remove Chinese characters
clean_text = remove_chinese_chars("Hello 世界")
print(clean_text)  # Output: "Hello "

Advanced Usage

For more control over the tokenization process, you can use the QWenTokenizer class directly.

Initialization

The default vocabulary (qwen.tiktoken) is included in the package. If you have a custom vocabulary, you can pass the file path.

from q_tokenizer import QWenTokenizer

# Initialize with default bundled vocab
tokenizer = QWenTokenizer()

# OR initialize with extra vocab
# tokenizer = QWenTokenizer(extra_vocab_file='path/to/extra_vocab.tiktoken')

Encoding and Decoding

text = "<|im_start|>user你好<|im_end|>"

# Tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode to IDs
ids = tokenizer.encode(text)
print(ids)

# Decode back to string
decoded_text = tokenizer.decode(ids)
print(decoded_text)

Truncation

Truncate text based on token count, useful for LLM context limits.

long_text = """
A tokenizer whispers, each fragment to break,
From words into tokens, the journey begins.
Qwen counts the pieces, where meaning spin,
The rhythm of language yours, forever mine.
"""

# Standard truncation (keeps the start)
short_text = tokenizer.truncate(long_text, max_token=10)

# Smart truncation (keeps start and end, adds "..." in the middle)
smart_text = tokenizer.truncate(long_text, max_token=10, keep_both_sides=True)

Features

Plug & Play: No need to manually download qwen.tiktoken; it is included in the package.
Efficient: Built on top of tiktoken for high performance.
Special Tokens: Full support for <|im_start|>, <|im_end|>, and other special tokens.
Chinese Utilities: Includes helper functions to detect and strip Chinese characters from text.

License & Attribution

This project is licensed under the Apache License, Version 2.0.

Origin

The core tokenization logic in this package is derived from the Qwen-Agent Repository.

Why this exists

The original repository includes a wide range of dependencies. q-tokenizer was created for users who only need the essential tokenization functions without the overhead of the entire library.

Key modifications made to the original code:

Modularized: Extracted only the essential QWenTokenizer class and BPE logic.
Utility Additions: Added standalone helper functions like has_chinese_chars and remove_chinese_chars.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

q_tokenizer-0.1.0.tar.gz (10.2 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

q_tokenizer-0.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file q_tokenizer-0.1.0.tar.gz.

File metadata

Download URL: q_tokenizer-0.1.0.tar.gz
Upload date: Apr 15, 2026
Size: 10.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for q_tokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`86b936bc8533fcae11389c5437af73e6e70e088f828af4baac07df18742e2abd`
MD5	`eea5d4a47b5ef323851b186a6ea1a0a6`
BLAKE2b-256	`7c5aa453c95c3e0364bceb96df66d22f97db7f9832c479fb81a5864cb0d34dd0`

See more details on using hashes here.

File details

Details for the file q_tokenizer-0.1.0-py3-none-any.whl.

File metadata

Download URL: q_tokenizer-0.1.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 11.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for q_tokenizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2a101b90037986f9410db501bf120a9439ce2f6117eff06a73971de69e804983`
MD5	`66d90de34c851b2e6fe3862be6806c41`
BLAKE2b-256	`e1d514b8cabbfe159a6ce6060f59f62e5cdfc428e4d1bd489dff16f0c6768496`

See more details on using hashes here.

q-tokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Qwen Tokenizer Lite

Installation

Quick Start

Advanced Usage

Initialization

Encoding and Decoding

Truncation

Features

License & Attribution

Origin

Why this exists

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes