Skip to main content

Convert HuggingFace tokenizers to tiktoken format

Project description

🤗→⏳ TikTokenizer

Convert HuggingFace tokenizers to tiktoken format.

TikTokenizer allows you to use any compatible HuggingFace tokenizer with OpenAI's fast tiktoken library. It automatically handles the conversion from HuggingFace's tokenizer format to tiktoken's encoding format, with built-in caching for fast subsequent loads.

Installation

pip install tiktokenizer

Or install from source (using uv):

git clone https://github.com/shakedzy/tiktokenizer.git
cd tiktokenizer
uv sync
uv run pip install -e .

Quick Start

from tiktokenizer import TikTokenizer

# Create a tiktoken encoding from any compatible HuggingFace model
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")

# Use it like any tiktoken encoding
tokens = encoding.encode("Hello, world!")
text = encoding.decode(tokens)

print(tokens)  # [9707, 11, 1879, 0]
print(text)    # Hello, world!

Usage

Creating an Encoding

from tiktokenizer import TikTokenizer

# Basic usage - caches to ~/.cache/tiktokenizer/
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")

# Custom cache directory
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B", cache_dir="./my-cache")

Loading from Cache

# Load a previously cached encoding (no HuggingFace download needed)
encoding = TikTokenizer.load("Qwen/Qwen3-0.6B")

Checking Compatibility

# Check if a model is compatible before attempting conversion
if TikTokenizer.is_compatible("some-model/name"):
    encoding = TikTokenizer.create("some-model/name")
else:
    print("Model uses incompatible tokenizer")

Special Tokens

# Encode text containing special tokens
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")

text = "<|im_start|>user\nHello!<|im_end|>"
tokens = encoding.encode(text, allowed_special="all")

Compatible Models

TikTokenizer works with models that use byte-level BPE tokenizers (GPT-2/GPT-4 style):

Model Family Example Compatible
Qwen Qwen/Qwen3-0.6B
GPT-2 openai-community/gpt2
Phi microsoft/phi-2
Mistral mistralai/Mistral-7B-v0.1
LLaMA meta-llama/Llama-2-7b
BERT bert-base-uncased

Why Some Models Are Incompatible

TikTokenizer only supports byte-level BPE tokenizers that use the GPT-2 byte encoding scheme. Models using different tokenizer architectures are not compatible:

  • SentencePiece (Mistral, LLaMA, T5): Uses for spaces, different byte encoding
  • WordPiece (BERT, DistilBERT): Uses ## subword prefixes
  • Unigram (XLNet, ALBERT): Different algorithm entirely

API Reference

TikTokenizer.create(model_name, cache_dir=None)

Create a tiktoken Encoding from a HuggingFace model.

Parameters:

  • model_name (str): HuggingFace model name or path
  • cache_dir (str | Path | None): Cache directory. Defaults to ~/.cache/tiktokenizer/

Returns: tiktoken.Encoding

Raises:

  • FileExistsError: If the encoder already exists and override=False
  • IncompatibleTokenizerError if the model's tokenizer is not compatible

TikTokenizer.load(model_name, cache_dir=None)

Load a tiktoken Encoding from a cached file.

Parameters:

  • model_name (str): HuggingFace model name or path
  • cache_dir (str | Path | None): Cache directory. Defaults to ~/.cache/tiktokenizer/

Returns: tiktoken.Encoding

Raises: FileNotFoundError if the cache file doesn't exist


TikTokenizer.is_compatible(model_name)

Check if a HuggingFace model's tokenizer can be converted.

Parameters:

  • model_name (str): HuggingFace model name or path

Returns: bool

How It Works

  1. Load HuggingFace tokenizer using transformers.AutoTokenizer
  2. Check compatibility by verifying the tokenizer uses byte-level BPE with ByteLevel pre-tokenizer
  3. Convert vocabulary from HuggingFace's string format to tiktoken's bytes format using the GPT-2 byte-to-unicode mapping
  4. Extract regex pattern for pre-tokenization from the tokenizer config
  5. Extract special tokens and map them to their IDs
  6. Create tiktoken.Encoding with the converted data
  7. Cache to disk as JSON for fast subsequent loads

Command Line Interface

After installation, the tiktokenizer command is available globally:

Create an encoding

# Create and cache a tiktoken encoding from a HuggingFace model
tiktokenizer create Qwen/Qwen3-0.6B

# Create with custom cache directory
tiktokenizer create Qwen/Qwen3-0.6B --cache-dir ./my-cache

# Create and test with custom text
tiktokenizer create Qwen/Qwen3-0.6B --test "Hello, world!"

Load a cached encoding

# Load and display info about a cached encoding
tiktokenizer load Qwen/Qwen3-0.6B

# Load and test with text
tiktokenizer load Qwen/Qwen3-0.6B --test "Test text"

Check compatibility

# Check if a model is compatible before creating
tiktokenizer check Qwen/Qwen3-0.6B
# ✓ Qwen/Qwen3-0.6B is compatible with TikTokenizer

tiktokenizer check mistralai/Mistral-7B-v0.1
# ✗ mistralai/Mistral-7B-v0.1 is NOT compatible with TikTokenizer

Using as a module

# You can also run as a Python module
python -m tiktokenizer create Qwen/Qwen3-0.6B

Why Use This?

  • Speed: tiktoken is significantly faster than HuggingFace tokenizers
  • Simplicity: Single-file encoding, no need for HuggingFace at runtime after caching
  • Compatibility: Works anywhere tiktoken works
  • Offline: Once cached, no internet connection needed

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiktokenizer-0.1.1.tar.gz (8.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tiktokenizer-0.1.1-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file tiktokenizer-0.1.1.tar.gz.

File metadata

  • Download URL: tiktokenizer-0.1.1.tar.gz
  • Upload date:
  • Size: 8.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.13

File hashes

Hashes for tiktokenizer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 36d0fce1965777bb9ce271b0c2645591ca00861fcb7188025e2530d857f3eb8b
MD5 d23d76fb4c0f07e4ec3d783f81e506bd
BLAKE2b-256 83977521f5a0980f0ad0d669a319019aad0a282d51aa5ff7c9b6ba36fd4b2bbe

See more details on using hashes here.

File details

Details for the file tiktokenizer-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tiktokenizer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f83f334d975e269157ad576c9faea63a9e2266a0cf8fbd6be6b5a73317568920
MD5 b4c9117dc87a1be8808f76a2a172bf96
BLAKE2b-256 a11d7ce78e19db6e6b1229576e78f5c0d955c97c1fc90594355753e703c5f539

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page