Skip to main content

A lightweight, standalone Qwen tokenizer with string utilities.

Project description

Qwen Tokenizer Lite

PyPI version

A lightweight, standalone Python package for Qwen tokenization, extracted and optimized for simple installation via PyPI. It wraps tiktoken to provide fast encoding/decoding and includes utility functions for text processing.

Installation

Install the package directly from PyPI:

pip install q-tokenizer

Quick Start

The package initializes the tokenizer automatically using the bundled vocabulary file, so you can start tokenizing immediately.

from q_tokenizer import count_tokens, has_chinese_chars, remove_chinese_chars

# 1. Count tokens easily
text = "Hello, this is a test string."
print(count_tokens(text)) 

# 2. Check for Chinese characters
print(has_chinese_chars("Hello World"))        # False
print(has_chinese_chars("Hello 世界"))         # True

# 3. Remove Chinese characters
clean_text = remove_chinese_chars("Hello 世界")
print(clean_text)  # Output: "Hello "

Advanced Usage

For more control over the tokenization process, you can use the QWenTokenizer class directly.

Initialization

The default vocabulary (qwen.tiktoken) is included in the package. If you have a custom vocabulary, you can pass the file path.

from q_tokenizer import QWenTokenizer

# Initialize with default bundled vocab
tokenizer = QWenTokenizer()

# OR initialize with extra vocab
# tokenizer = QWenTokenizer(extra_vocab_file='path/to/extra_vocab.tiktoken')

Encoding and Decoding

text = "<|im_start|>user你好<|im_end|>"

# Tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# Encode to IDs
ids = tokenizer.encode(text)
print(ids)

# Decode back to string
decoded_text = tokenizer.decode(ids)
print(decoded_text)

Truncation

Truncate text based on token count, useful for LLM context limits.

long_text = """
A tokenizer whispers, each fragment to break,
From words into tokens, the journey begins.
Qwen counts the pieces, where meaning spin,
The rhythm of language yours, forever mine.
"""

# Standard truncation (keeps the start)
short_text = tokenizer.truncate(long_text, max_token=10)

# Smart truncation (keeps start and end, adds "..." in the middle)
smart_text = tokenizer.truncate(long_text, max_token=10, keep_both_sides=True)

Features

  • Plug & Play: No need to manually download qwen.tiktoken; it is included in the package.
  • Efficient: Built on top of tiktoken for high performance.
  • Special Tokens: Full support for <|im_start|>, <|im_end|>, and other special tokens.
  • Chinese Utilities: Includes helper functions to detect and strip Chinese characters from text.

License & Attribution

This project is licensed under the Apache License, Version 2.0.

Origin

The core tokenization logic in this package is derived from the Qwen-Agent Repository.

Why this exists

The original repository includes a wide range of dependencies. q-tokenizer was created for users who only need the essential tokenization functions without the overhead of the entire library.

Key modifications made to the original code:

  • Modularized: Extracted only the essential QWenTokenizer class and BPE logic.
  • Utility Additions: Added standalone helper functions like has_chinese_chars and remove_chinese_chars.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

q_tokenizer-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

q_tokenizer-0.1.0-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file q_tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: q_tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for q_tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 86b936bc8533fcae11389c5437af73e6e70e088f828af4baac07df18742e2abd
MD5 eea5d4a47b5ef323851b186a6ea1a0a6
BLAKE2b-256 7c5aa453c95c3e0364bceb96df66d22f97db7f9832c479fb81a5864cb0d34dd0

See more details on using hashes here.

File details

Details for the file q_tokenizer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: q_tokenizer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for q_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2a101b90037986f9410db501bf120a9439ce2f6117eff06a73971de69e804983
MD5 66d90de34c851b2e6fe3862be6806c41
BLAKE2b-256 e1d514b8cabbfe159a6ce6060f59f62e5cdfc428e4d1bd489dff16f0c6768496

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page