Skip to main content

A (nicer) tokenizer you want to use for model `inference` and `training`: with all known peventable `gotchas` normalized or auto-fixed.

Project description

image

Toke(n)icer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

GitHub release PyPI - Version PyPI Downloads

News

  • 03/03/2026 0.0.7: Fix Qwen 3.5 MoE compat.

  • 02/09/2026 0.0.6: Fix ChatGLM compat.

  • 09/04/2025 0.0.5: Fix pad_token_id detection for LongCat model.

  • 02/21/2025 0.0.4: ⚡ Now tokenicer instance dynamically inherits the native tokenizer.__class__ of tokenizer passed in or loaded via our tokenicer.load() api. CI now tests tokenizer compat from 64 different models.

  • 02/10/2025 0.0.2: 🤗 Initial release!

Features:

  • Compatible with all HF Transformers recognized tokenizers
  • Auto-fix models not setting padding_token
  • Auto-Fix models released with wrong padding_token: many models incorrectly use eos_token as pad_token which leads to subtle and hidden errors in post-training and inference when batching is used which is almost always.
  • Zero external dependency outside of Transformers

Upcoming Features:

  • Add automatic tokenizer validation to model training and subsequent inference so that not only tokenizer config but actual decode/encode are 100% re-validated on model load. Often the case, inference and training engines modifies the traditional tokenizers causing subtle and inaccurate output when inference performed on a platform that is disjointed from the trainer.

Install

PIP/UV

pip install -v tokenicer
uv pip install -v tokenicer

Install from source

# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer

# compile
pip install -v . 

Usage

  • Replace all calls to AutoTokenizer.from_pretrained() with Tokenizer.load(): args are 100% compatible with AutoTokenizer
# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')

# With `Tokenicer.load()`
from tokenicer import Tokenicer

# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')

# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")
  • If you already have a loaded or composite config, pass it directly so Tokenicer can normalize the resolved text config in-place:
tokenizer = Tokenicer.load(tokenizer, model_config=model.config)

Citation

@misc{gptqmodel,
    author = {ModelCloud.ai and qubitium@modelcloud.ai},
    title = {Toke(n)icer},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/modelcloud/tokenicer}},
    note = {Contact: qubitium@modelcloud.ai}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenicer-0.0.12.tar.gz (14.1 kB view details)

Uploaded Source

File details

Details for the file tokenicer-0.0.12.tar.gz.

File metadata

  • Download URL: tokenicer-0.0.12.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for tokenicer-0.0.12.tar.gz
Algorithm Hash digest
SHA256 f5048d65e1d8bb01945d6e3ba7b78c993ddf408ecaa5b60f2f01d937fc7d1b44
MD5 bcbc5e55fcbe96b745a052399435045a
BLAKE2b-256 9c35c0027b7657ea03f48c8f6453c21aaea9f7bdaea3e7f40b1bff3061d9b80f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page