Skip to main content

Datamodels for HF tokenizers

Project description

A skeleton smoking a cigarette.

Skeletoken

This package contains Pydantic datamodels that fully describe the tokenizer.json file used in transformers via Tokenizers. This is useful, because working with this format is complicated.

Rationale

In one sentence: Validate, edit, and transform Hugging Face tokenizers safely.

The Hugging Face tokenizers representation does not reliably allow you to edit tokenizers as a structured object. This means that complex changes to tokenizers require you to edit the tokenizer.json file manually. This is annoying, because the format of this file is complicated.

Furthermore, tokenizers does not give reasonable errors when parsing a tokenizer fails. It does give line/character numbers, but those point to the last character of the section where the parsing fails. For example, inserting an illegal vocabulary item just tells you that there is an issue in the vocabulary somewhere by pointing out the last character of the vocabulary as the place where the error occurs.

This package contains datamodels (pydantic datamodels) that contain the same constraints as the tokenizers package. In other words, if you can create a model in this package, the tokenizers package can parse it. This allows you to progressively edit tokenizer json files, all the while getting productive error messages.

Installation

Install it via pip

pip install skeletoken

What can it do?

skeletoken allows you to:

  • validate tokenizer.json with human-readable errors
  • edit tokenizers as typed objects (Pydantic)
  • apply common transformations (decasing, greedy merges, etc.)
  • auto-fix common inconsistencies
  • round-trip to tokenizers and transformers
  • apply tokenization changes to transformers, sentence-transformers and pylate models.

Example

Here's some examples of what skeletoken can do:

Autofixing a tokenizer

skeletoken autofixes any tokenizer you load. See automatic checks to see what gets fixed automatically. For example, the Qwen/Qwen3-0.6B tokenizer has a lot of special tokens that are not part of the regular tokenizer vocabulary. This leads to a mismatch between the size of a tokenizer and the number of tokens that tokenizer can produce. skeletoken adds these to the vocabulary automatically.

from transformers import AutoTokenizer
from skeletoken import TokenizerModel

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
# Mismatch due to missing special tokens
print(tokenizer.vocab_size)  # 151643
print(len(tokenizer))  # 151669

# Load a model from the hub.
tokenizer_model = TokenizerModel.from_pretrained("Qwen/Qwen3-0.6B")
# Convert the tokenizer to transformers
tokenizer = tokenizer_model.to_transformers()
# All missing special tokens have been added to the vocabulary
print(tokenizer.vocab_size)  # 151669
print(len(tokenizer))  # 151669

Adding components to a tokenizer

skeletoken can add components to a tokenizer. First we load one, and inspect it:

from skeletoken import TokenizerModel

# Directly pull a tokenizer from the hub
tokenizer_model = TokenizerModel.from_pretrained("gpt2")

print(tokenizer_model.model.type)
# ModelType.BPE
print(tokenizer_model.pre_tokenizer.type)
# PreTokenizerType.BYTELEVEL

We can then add a digit splitter to the tokenizer.

from skeletoken import TokenizerModel
from skeletoken.pretokenizers import DigitsPreTokenizer

model = TokenizerModel.from_pretrained("gpt2")
tok = model.to_tokenizer()

# Create the digits pretokenizer
digits = DigitsPreTokenizer(individual_digits=True)
model = model.add_pre_tokenizer(digits)

new_tok = model.to_tokenizer()
print(tok.encode("hello 123").tokens)
# ['hello', 'Ġ123']
print(new_tok.encode("hello 123").tokens)
# ['hello', 'Ġ', '1', '2', '3']

Decasing a tokenizer

For background, see this blogpost. Decasing is super easy using skeletoken.

from tokenizers import Tokenizer
from skeletoken import TokenizerModel

model_name = "intfloat/multilingual-e5-small"

tokenizer = Tokenizer.from_pretrained(model_name)

print([tokenizer.encode(x).tokens for x in ["Amsterdam", "amsterdam"]])
# [['<s>', '▁Amsterdam', '</s>'], ['<s>', '▁am', 'ster', 'dam', '</s>']]

model = TokenizerModel.from_pretrained(model_name)
model = model.decase_vocabulary()

lower_tokenizer = model.to_tokenizer()
print([lower_tokenizer.encode(x).tokens for x in ["Amsterdam", "amsterdam"]])
# [['<s>', '▁amsterdam', '</s>'], ['<s>', '▁amsterdam', '</s>']]

Making a tokenizer greedy

For background, see this blog post. Like decasing, turning any tokenizer into a greedy one is super easy using skeletoken.

from tokenizers import Tokenizer
from skeletoken import TokenizerModel

model_name = "gpt2"

tokenizer = Tokenizer.from_pretrained(model_name)

print([tokenizer.encode(x).tokens for x in [" hellooo", " bluetooth"]])
# [['Ġhell', 'ooo'], ['Ġblu', 'etooth']]

model = TokenizerModel.from_pretrained(model_name)
model = model.make_model_greedy()
greedy_tokenizer = model.to_tokenizer()
print([greedy_tokenizer.encode(x).tokens for x in [" hellooo", " bluetooth"]])
# [['Ġhello', 'oo'], ['Ġblue', 'too', 'th']]

Roadmap

Here's a rough roadmap:

  • ✅ Add automated lowercasing (see blog)
  • ✅ Add vocabulary changes + checks (e.g., check the merge table if a token is added)
  • ✅ Add helper functions for adding modules
  • ✅ Add secondary constraints (e.g., if an AddedToken refers to a vocabulary item does not exist, we should crash.)
  • ✅ Add a front end for the Hugging Face trainer
  • ✅ Add automatic model editing
  • Consistent tokenizer hashing: instantly know if two tokenizers implement the same thing.
  • Add a front end for sentencepiece training.

License

MIT

Author

Stéphan Tulkens

Citation

If you use skeletoken in your work, please cite:

@software{stephan_tulkens_2026_18501953,
  author       = {Stephan Tulkens},
  title        = {skeletoken},
  month        = feb,
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.18501953},
  url          = {https://doi.org/10.5281/zenodo.18501953},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skeletoken-0.3.3.tar.gz (234.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skeletoken-0.3.3-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file skeletoken-0.3.3.tar.gz.

File metadata

  • Download URL: skeletoken-0.3.3.tar.gz
  • Upload date:
  • Size: 234.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for skeletoken-0.3.3.tar.gz
Algorithm Hash digest
SHA256 f96405a7583ba089fb327a65e92a650d939ea6d35621b3a05b05205246030f1a
MD5 d31b2361aaef7fce9a142e23e8d5031d
BLAKE2b-256 19234892fa72b6f3ba7fc38a023621216dc4fc2de221288e8a3d98d7978f9cd1

See more details on using hashes here.

File details

Details for the file skeletoken-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: skeletoken-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for skeletoken-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 258b801852312d6c247ca9ca495a758dd6f39523d70c92334f9c617152478b20
MD5 d0d1961f4e1055057ebf8786d9ebbd36
BLAKE2b-256 2a0dfdef375cd6cadb706df245d2ba831eb127a3cb01cb583088aea6422c786d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page