A (nicer) tokenizer you want to use for model `inference` and `training`: with all known peventable `gotchas` normalized or auto-fixed.

These details have not been verified by PyPI

Project links

Homepage

Project description

Toke(n)icer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

News

03/03/2026 0.0.7: Fix Qwen 3.5 MoE compat.
02/09/2026 0.0.6: Fix ChatGLM compat.
09/04/2025 0.0.5: Fix pad_token_id detection for LongCat model.
02/21/2025 0.0.4: ⚡ Now tokenicer instance dynamically inherits the native tokenizer.__class__ of tokenizer passed in or loaded via our tokenicer.load() api. CI now tests tokenizer compat from 64 different models.
02/10/2025 0.0.2: 🤗 Initial release!

Features:

Compatible with all HF Transformers recognized tokenizers
Auto-fix models not setting padding_token
Auto-Fix models released with wrong padding_token: many models incorrectly use eos_token as pad_token which leads to subtle and hidden errors in post-training and inference when batching is used which is almost always.
Zero external dependency outside of Transformers

Upcoming Features:

Add automatic tokenizer validation to model training and subsequent inference so that not only tokenizer config but actual decode/encode are 100% re-validated on model load. Often the case, inference and training engines modifies the traditional tokenizers causing subtle and inaccurate output when inference performed on a platform that is disjointed from the trainer.

Install

PIP/UV

pip install -v tokenicer
uv pip install -v tokenicer

Install from source

# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer

# compile
pip install -v .

Usage

Replace all calls to AutoTokenizer.from_pretrained() with Tokenizer.load(): args are 100% compatible with AutoTokenizer

# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')

# With `Tokenicer.load()`
from tokenicer import Tokenicer

# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')

# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")

If you already have a loaded or composite config, pass it directly so Tokenicer can normalize the resolved text config in-place:

tokenizer = Tokenicer.load(tokenizer, model_config=model.config)

Citation

@misc{gptqmodel,
    author = {ModelCloud.ai and qubitium@modelcloud.ai},
    title = {Toke(n)icer},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/modelcloud/tokenicer}},
    note = {Contact: qubitium@modelcloud.ai}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.0.13

Apr 15, 2026

0.0.12

Mar 26, 2026

0.0.11

Mar 15, 2026

0.0.10

Mar 15, 2026

0.0.9

Mar 15, 2026

0.0.8

Mar 6, 2026

0.0.7

Mar 3, 2026

0.0.6

Feb 9, 2026

0.0.5

Sep 4, 2025

0.0.4

Feb 21, 2025

0.0.3

Feb 21, 2025

0.0.2

Feb 10, 2025

0.0.1

Feb 10, 2025

0.0.1.dev0 pre-release

Feb 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenicer-0.0.13.tar.gz (14.0 kB view details)

Uploaded Apr 15, 2026 Source

File details

Details for the file tokenicer-0.0.13.tar.gz.

File metadata

Download URL: tokenicer-0.0.13.tar.gz
Upload date: Apr 15, 2026
Size: 14.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for tokenicer-0.0.13.tar.gz
Algorithm	Hash digest
SHA256	`aa1de1638a543a2a2503e62ac9e47fcbeee5a2f3bcbcb5fc73d31bbe37e7a8ee`
MD5	`f9e31a3364fd1022a1463fc29a51819f`
BLAKE2b-256	`46e35dcfb5c5151bca50e7e59a4fad0a26945ffe4ade9c44d85ca4b90131eca7`

See more details on using hashes here.

TokeNicer 0.0.13

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Toke(n)icer

News

Features:

Upcoming Features:

Install

PIP/UV

Install from source

Usage

Citation

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes