A (nicer) tokenizer you want to use for model `inference` and `training`: with all known peventable `gotchas` normalized or auto-fixed.
Project description
Toke(n)icer
A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.
News
-
03/03/2026 0.0.7: Fix Qwen 3.5 MoE compat.
-
02/09/2026 0.0.6: Fix ChatGLM compat.
-
09/04/2025 0.0.5: Fix
pad_token_iddetection forLongCatmodel. -
02/21/2025 0.0.4: ⚡ Now
tokenicerinstance dynamically inherits thenativetokenizer.__class__of tokenizer passed in or loaded via ourtokenicer.load()api. CI now tests tokenizer compat from64different models. -
02/10/2025 0.0.2: 🤗 Initial release!
Features:
- Compatible with all HF
Transformersrecognized tokenizers - Auto-fix
modelsnot settingpadding_token - Auto-Fix
modelsreleased with wrongpadding_token: manymodelsincorrectly useeos_tokenaspad_tokenwhich leads to subtle and hidden errors in post-training and inference whenbatchingis used which is almost always. - Zero external dependency outside of
Transformers
Upcoming Features:
- Add
automatictokenizer validation tomodeltrainingand subsequentinferenceso that not only tokenizer config but actualdecode/encodeare 100% re-validated on model load. Often the case,inferenceandtrainingengines modifies the traditional tokenizers causing subtle and inaccurate output wheninferenceperformed on a platform that is disjointed from thetrainer.
Install
PIP/UV
pip install -v tokenicer
uv pip install -v tokenicer
Install from source
# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer
# compile
pip install -v .
Usage
- Replace all calls to
AutoTokenizer.from_pretrained()withTokenizer.load(): args are 100% compatible withAutoTokenizer
# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
# With `Tokenicer.load()`
from tokenicer import Tokenicer
# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')
# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")
- If you already have a loaded or composite config, pass it directly so Tokenicer can normalize the resolved text config in-place:
tokenizer = Tokenicer.load(tokenizer, model_config=model.config)
Citation
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {Toke(n)icer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/tokenicer}},
note = {Contact: qubitium@modelcloud.ai}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file tokenicer-0.0.12.tar.gz.
File metadata
- Download URL: tokenicer-0.0.12.tar.gz
- Upload date:
- Size: 14.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5048d65e1d8bb01945d6e3ba7b78c993ddf408ecaa5b60f2f01d937fc7d1b44
|
|
| MD5 |
bcbc5e55fcbe96b745a052399435045a
|
|
| BLAKE2b-256 |
9c35c0027b7657ea03f48c8f6453c21aaea9f7bdaea3e7f40b1bff3061d9b80f
|