Convert HuggingFace tokenizers to tiktoken format

These details have not been verified by PyPI

Project links

Project description

🤗→⏳ TikTokenizer

Convert HuggingFace tokenizers to tiktoken format.

TikTokenizer allows you to use any compatible HuggingFace tokenizer with OpenAI's fast tiktoken library. It automatically handles the conversion from HuggingFace's tokenizer format to tiktoken's encoding format, with built-in caching for fast subsequent loads.

Installation

Install from source:

git clone https://github.com/shakedzy/tiktokenizer.git
cd tiktokenizer
pip install -e .

Quick Start

from tiktokenizer import TikTokenizer

# Create a tiktoken encoding from any compatible HuggingFace model
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")

# Use it like any tiktoken encoding
tokens = encoding.encode("Hello, world!")
text = encoding.decode(tokens)

print(tokens)  # [9707, 11, 1879, 0]
print(text)    # Hello, world!

Usage

Creating an Encoding

from tiktokenizer import TikTokenizer

# Basic usage - caches to ~/.cache/tiktokenizer/
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")

# Custom cache directory
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B", cache_dir="./my-cache")

Loading from Cache

# Load a previously cached encoding (no HuggingFace download needed)
encoding = TikTokenizer.load("Qwen/Qwen3-0.6B")

Checking Compatibility

# Check if a model is compatible before attempting conversion
if TikTokenizer.is_compatible("some-model/name"):
    encoding = TikTokenizer.create("some-model/name")
else:
    print("Model uses incompatible tokenizer")

Special Tokens

# Encode text containing special tokens
encoding = TikTokenizer.create("Qwen/Qwen3-0.6B")

text = "<|im_start|>user\nHello!<|im_end|>"
tokens = encoding.encode(text, allowed_special="all")

Compatible Models

TikTokenizer works with models that use byte-level BPE tokenizers (GPT-2/GPT-4 style):

Model Family	Example	Compatible
Qwen	`Qwen/Qwen3-0.6B`	✅
GPT-2	`openai-community/gpt2`	✅
Phi	`microsoft/phi-2`	✅
Mistral	`mistralai/Mistral-7B-v0.1`	❌
LLaMA	`meta-llama/Llama-2-7b`	❌
BERT	`bert-base-uncased`	❌

Why Some Models Are Incompatible

TikTokenizer only supports byte-level BPE tokenizers that use the GPT-2 byte encoding scheme. Models using different tokenizer architectures are not compatible:

SentencePiece (Mistral, LLaMA, T5): Uses ▁ for spaces, different byte encoding
WordPiece (BERT, DistilBERT): Uses ## subword prefixes
Unigram (XLNet, ALBERT): Different algorithm entirely

API Reference

`TikTokenizer.create(model_name, cache_dir=None)`

Create a tiktoken Encoding from a HuggingFace model.

Parameters:

model_name (str): HuggingFace model name or path
cache_dir (str | Path | None): Cache directory. Defaults to ~/.cache/tiktokenizer/

Returns: tiktoken.Encoding

Raises:

FileExistsError: If the encoder already exists and override=False
IncompatibleTokenizerError if the model's tokenizer is not compatible

`TikTokenizer.load(model_name, cache_dir=None)`

Load a tiktoken Encoding from a cached file.

Parameters:

model_name (str): HuggingFace model name or path
cache_dir (str | Path | None): Cache directory. Defaults to ~/.cache/tiktokenizer/

Returns: tiktoken.Encoding

Raises: FileNotFoundError if the cache file doesn't exist

`TikTokenizer.is_compatible(model_name)`

Check if a HuggingFace model's tokenizer can be converted.

Parameters:

model_name (str): HuggingFace model name or path

Returns: bool

How It Works

Load HuggingFace tokenizer using transformers.AutoTokenizer
Check compatibility by verifying the tokenizer uses byte-level BPE with ByteLevel pre-tokenizer
Convert vocabulary from HuggingFace's string format to tiktoken's bytes format using the GPT-2 byte-to-unicode mapping
Extract regex pattern for pre-tokenization from the tokenizer config
Extract special tokens and map them to their IDs
Create tiktoken.Encoding with the converted data
Cache to disk as JSON for fast subsequent loads

Command Line Interface

After installation, the tiktokenizer command is available globally:

Create an encoding

# Create and cache a tiktoken encoding from a HuggingFace model
tiktokenizer create Qwen/Qwen3-0.6B

# Create with custom cache directory
tiktokenizer create Qwen/Qwen3-0.6B --cache-dir ./my-cache

# Create and test with custom text
tiktokenizer create Qwen/Qwen3-0.6B --test "Hello, world!"

Load a cached encoding

# Load and display info about a cached encoding
tiktokenizer load Qwen/Qwen3-0.6B

# Load and test with text
tiktokenizer load Qwen/Qwen3-0.6B --test "Test text"

Check compatibility

# Check if a model is compatible before creating
tiktokenizer check Qwen/Qwen3-0.6B
# ✓ Qwen/Qwen3-0.6B is compatible with TikTokenizer

tiktokenizer check mistralai/Mistral-7B-v0.1
# ✗ mistralai/Mistral-7B-v0.1 is NOT compatible with TikTokenizer

Using as a module

# You can also run as a Python module
python -m tiktokenizer create Qwen/Qwen3-0.6B

Why Use This?

Speed: tiktoken is significantly faster than HuggingFace tokenizers
Simplicity: Single-file encoding, no need for HuggingFace at runtime after caching
Compatibility: Works anywhere tiktoken works
Offline: Once cached, no internet connection needed

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Dec 15, 2025

This version

0.1.0

Dec 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiktokenizer-0.1.0.tar.gz (8.7 kB view details)

Uploaded Dec 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tiktokenizer-0.1.0-py3-none-any.whl (10.7 kB view details)

Uploaded Dec 15, 2025 Python 3

File details

Details for the file tiktokenizer-0.1.0.tar.gz.

File metadata

Download URL: tiktokenizer-0.1.0.tar.gz
Upload date: Dec 15, 2025
Size: 8.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for tiktokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`83ce02a62e118a131643261d9e58c6d17828fa84a64b15dd14005c7259439a8a`
MD5	`83e307599d080d3f5c9138eff38a1e6a`
BLAKE2b-256	`f08b3a935f761af780de4911c8413cd2ece905661eeb542d456ab473821e622e`

See more details on using hashes here.

File details

Details for the file tiktokenizer-0.1.0-py3-none-any.whl.

File metadata

Download URL: tiktokenizer-0.1.0-py3-none-any.whl
Upload date: Dec 15, 2025
Size: 10.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for tiktokenizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7468a39ee80feb67724695a535b3c8ec569c098599f79c5bda86318ab786e536`
MD5	`06b2e333d0562f2fa8064190c2631b9c`
BLAKE2b-256	`9b19720e51df94ada0e032047113bc60d1592ad1c22e63c1efaaa40c70e9247a`

See more details on using hashes here.

tiktokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤗→⏳ TikTokenizer

Installation

Quick Start

Usage

Creating an Encoding

Loading from Cache

Checking Compatibility

Special Tokens

Compatible Models

Why Some Models Are Incompatible

API Reference

TikTokenizer.create(model_name, cache_dir=None)

TikTokenizer.load(model_name, cache_dir=None)

TikTokenizer.is_compatible(model_name)

How It Works

Command Line Interface

Create an encoding

Load a cached encoding

Check compatibility

Using as a module

Why Use This?

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`TikTokenizer.create(model_name, cache_dir=None)`

`TikTokenizer.load(model_name, cache_dir=None)`

`TikTokenizer.is_compatible(model_name)`