Convert HuggingFace tokenizers to tiktoken format
Project description
🤗→⏳ TikTokenizer
Convert HuggingFace tokenizers to tiktoken format.
TikTokenizer allows you to use any compatible HuggingFace tokenizer with OpenAI's fast tiktoken library. It automatically handles the conversion from HuggingFace's tokenizer format to tiktoken's encoding format, with built-in caching for fast subsequent loads.
Installation
pip install tiktokenizer
Or install from source (using uv):
git clone https://github.com/shakedzy/tiktokenizer.git
cd tiktokenizer
uv sync
uv run pip install -e .
Quick Start
from tiktokenizer import TikTokenizer
# Create a tiktoken encoding from any compatible HuggingFace model
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")
# Use it like any tiktoken encoding
tokens = encoding.encode("Hello, world!")
text = encoding.decode(tokens)
print(tokens) # [9707, 11, 1879, 0]
print(text) # Hello, world!
Usage
Creating an Encoding
from tiktokenizer import TikTokenizer
# Basic usage - caches to ~/.cache/tiktokenizer/
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")
# Custom cache directory
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B", cache_dir="./my-cache")
Loading from Cache
# Load a previously cached encoding (no HuggingFace download needed)
encoding = TikTokenizer.load("Qwen/Qwen3-0.6B")
Checking Compatibility
# Check if a model is compatible before attempting conversion
if TikTokenizer.is_compatible("some-model/name"):
encoding = TikTokenizer.create("some-model/name")
else:
print("Model uses incompatible tokenizer")
Special Tokens
# Encode text containing special tokens
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")
text = "<|im_start|>user\nHello!<|im_end|>"
tokens = encoding.encode(text, allowed_special="all")
Compatible Models
TikTokenizer works with models that use byte-level BPE tokenizers (GPT-2/GPT-4 style):
| Model Family | Example | Compatible |
|---|---|---|
| Qwen | Qwen/Qwen3-0.6B |
✅ |
| GPT-2 | openai-community/gpt2 |
✅ |
| Phi | microsoft/phi-2 |
✅ |
| Mistral | mistralai/Mistral-7B-v0.1 |
❌ |
| LLaMA | meta-llama/Llama-2-7b |
❌ |
| BERT | bert-base-uncased |
❌ |
Why Some Models Are Incompatible
TikTokenizer only supports byte-level BPE tokenizers that use the GPT-2 byte encoding scheme. Models using different tokenizer architectures are not compatible:
- SentencePiece (Mistral, LLaMA, T5): Uses
▁for spaces, different byte encoding - WordPiece (BERT, DistilBERT): Uses
##subword prefixes - Unigram (XLNet, ALBERT): Different algorithm entirely
API Reference
TikTokenizer.create(model_name, cache_dir=None)
Create a tiktoken Encoding from a HuggingFace model.
Parameters:
model_name(str): HuggingFace model name or pathcache_dir(str | Path | None): Cache directory. Defaults to~/.cache/tiktokenizer/
Returns: tiktoken.Encoding
Raises:
FileExistsError: If the encoder already exists andoverride=FalseIncompatibleTokenizerErrorif the model's tokenizer is not compatible
TikTokenizer.load(model_name, cache_dir=None)
Load a tiktoken Encoding from a cached file.
Parameters:
model_name(str): HuggingFace model name or pathcache_dir(str | Path | None): Cache directory. Defaults to~/.cache/tiktokenizer/
Returns: tiktoken.Encoding
Raises: FileNotFoundError if the cache file doesn't exist
TikTokenizer.is_compatible(model_name)
Check if a HuggingFace model's tokenizer can be converted.
Parameters:
model_name(str): HuggingFace model name or path
Returns: bool
How It Works
- Load HuggingFace tokenizer using
transformers.AutoTokenizer - Check compatibility by verifying the tokenizer uses byte-level BPE with ByteLevel pre-tokenizer
- Convert vocabulary from HuggingFace's string format to tiktoken's bytes format using the GPT-2 byte-to-unicode mapping
- Extract regex pattern for pre-tokenization from the tokenizer config
- Extract special tokens and map them to their IDs
- Create tiktoken.Encoding with the converted data
- Cache to disk as JSON for fast subsequent loads
Command Line Interface
After installation, the tiktokenizer command is available globally:
Create an encoding
# Create and cache a tiktoken encoding from a HuggingFace model
tiktokenizer create Qwen/Qwen3-0.6B
# Create with custom cache directory
tiktokenizer create Qwen/Qwen3-0.6B --cache-dir ./my-cache
# Create and test with custom text
tiktokenizer create Qwen/Qwen3-0.6B --test "Hello, world!"
Load a cached encoding
# Load and display info about a cached encoding
tiktokenizer load Qwen/Qwen3-0.6B
# Load and test with text
tiktokenizer load Qwen/Qwen3-0.6B --test "Test text"
Check compatibility
# Check if a model is compatible before creating
tiktokenizer check Qwen/Qwen3-0.6B
# ✓ Qwen/Qwen3-0.6B is compatible with TikTokenizer
tiktokenizer check mistralai/Mistral-7B-v0.1
# ✗ mistralai/Mistral-7B-v0.1 is NOT compatible with TikTokenizer
Using as a module
# You can also run as a Python module
python -m tiktokenizer create Qwen/Qwen3-0.6B
Why Use This?
- Speed: tiktoken is significantly faster than HuggingFace tokenizers
- Simplicity: Single-file encoding, no need for HuggingFace at runtime after caching
- Compatibility: Works anywhere tiktoken works
- Offline: Once cached, no internet connection needed
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tiktokenizer-0.1.1.tar.gz.
File metadata
- Download URL: tiktokenizer-0.1.1.tar.gz
- Upload date:
- Size: 8.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36d0fce1965777bb9ce271b0c2645591ca00861fcb7188025e2530d857f3eb8b
|
|
| MD5 |
d23d76fb4c0f07e4ec3d783f81e506bd
|
|
| BLAKE2b-256 |
83977521f5a0980f0ad0d669a319019aad0a282d51aa5ff7c9b6ba36fd4b2bbe
|
File details
Details for the file tiktokenizer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: tiktokenizer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f83f334d975e269157ad576c9faea63a9e2266a0cf8fbd6be6b5a73317568920
|
|
| MD5 |
b4c9117dc87a1be8808f76a2a172bf96
|
|
| BLAKE2b-256 |
a11d7ce78e19db6e6b1229576e78f5c0d955c97c1fc90594355753e703c5f539
|