Tokenizer analysis toolkit. Compare vocabulary coverage, compression ratios, and token boundaries across GPT-4o, Llama 3, Mistral, and any HuggingFace tokenizer.
Project description
toksight
Tokenizer analysis toolkit. Compare vocabulary coverage, compression ratios, and token boundaries across GPT-4o, Llama 3, Mistral, and any HuggingFace tokenizer.
Why toksight?
Every LLM developer interacts with tokenizers daily — for cost estimation, multilingual planning, context window budgeting — yet there is zero tooling for analyzing or comparing them. tiktoken, sentencepiece, and tokenizers are implementations; toksight is the microscope.
| Question | Before toksight | With toksight |
|---|---|---|
| "How does GPT-4o handle Korean vs Llama 3?" | Manual guessing | toksight coverage --blocks Hangul |
| "What's the vocabulary overlap?" | Nobody knows | toksight compare gpt-4o llama3 |
| "How much more expensive is this corpus on GPT-4 vs Claude?" | Manual counting | toksight cost --corpus data.txt |
| "Does this tokenizer have glitch tokens?" | Run each token manually | toksight audit gpt-4o |
Install
pip install toksight
With tokenizer backends:
pip install toksight[tiktoken] # OpenAI tokenizers
pip install toksight[transformers] # HuggingFace tokenizers
pip install toksight[all] # Everything
pip install toksight[cli] # CLI tools
Quick Start
Compression Analysis
from toksight import load_tiktoken
from toksight.compression import compute_compression
tok = load_tiktoken("cl100k_base")
stats = compute_compression(tok, ["Hello world! This is a test."])
print(f"Bytes/token: {stats.bytes_per_token:.2f}")
print(f"Fertility: {stats.fertility:.2f} tokens/word")
Unicode Coverage
from toksight.coverage import analyze_coverage
result = analyze_coverage(tok, blocks=["CJK Unified", "Hangul Syllables", "Arabic"])
for block, info in result.blocks_analyzed.items():
print(f"{block}: {info['ratio']:.1%} coverage")
Vocabulary Comparison
from toksight.compare import compare_vocabularies, compare_on_corpus
tok_a = load_tiktoken("cl100k_base") # GPT-4
tok_b = load_tiktoken("o200k_base") # GPT-4o
result = compare_vocabularies(tok_a, tok_b)
print(f"Overlap: {result.vocab_overlap:,} tokens")
print(f"Jaccard: {result.jaccard_similarity:.2%}")
Token Mapping
from toksight.mapping import map_tokens
mapping = map_tokens(tok_a, tok_b, "Artificial intelligence is transforming healthcare")
for entry in mapping:
src = entry["source_token"]
targets = [t["text"] for t in entry["target_tokens"]]
print(f" {src!r} → {targets}")
Tokenizer Audit
from toksight.audit import audit
result = audit(tok)
for finding in result.findings[:10]:
print(f"[{finding.severity}] {finding.category}: {finding.description}")
Cost Estimation
from toksight.cost import compare_costs
corpus = ["Long document text..."] * 1000
costs = compare_costs(
[(tok_a, "gpt-4o"), (tok_b, "gpt-4o-mini")],
corpus,
)
for name, est in costs.estimates.items():
print(f"{name}: {est['total_tokens']:,} tokens, ${est['input_cost_usd']:.4f}")
CLI
# Vocabulary stats
toksight info cl100k_base
# Compression analysis
toksight compress cl100k_base --corpus data.txt
# Unicode coverage
toksight coverage cl100k_base
# Tokenizer audit
toksight audit cl100k_base --max-tokens 5000
Modules
| Module | Purpose |
|---|---|
toksight.loader |
Unified tokenizer loading (tiktoken, HuggingFace, SentencePiece, custom) |
toksight.compression |
Compression ratios, bytes/token, fertility analysis |
toksight.coverage |
Unicode block coverage, script analysis, roundtrip testing |
toksight.compare |
Vocabulary overlap, boundary alignment, fragmentation mapping |
toksight.mapping |
Token-to-token mapping between tokenizers |
toksight.audit |
Glitch token detection, degenerate tokens, control chars |
toksight.cost |
Provider cost estimation from tokenization differences |
toksight.stats |
Vocabulary statistics, length distributions, script coverage |
Supported Backends
| Backend | Install | Tokenizers |
|---|---|---|
| tiktoken | pip install toksight[tiktoken] |
cl100k_base (GPT-4), o200k_base (GPT-4o) |
| HuggingFace | pip install toksight[transformers] |
Any model on HuggingFace Hub |
| SentencePiece | pip install toksight[sentencepiece] |
Any .model file |
| Custom | Built-in | Any encode/decode functions |
See Also
Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:
| Project | What it does |
|---|---|
| tokonomics | Token counting & cost management for LLM APIs |
| datacrux | Training data quality — dedup, PII, contamination |
| castwright | Synthetic instruction data generation |
| datamix | Dataset mixing & curriculum optimization |
| trainpulse | Training health monitoring |
| ckpt | Checkpoint inspection, diffing & merging |
| quantbench | Quantization quality analysis |
| infermark | Inference benchmarking |
| modeldiff | Behavioral regression testing |
| vibesafe | AI-generated code safety scanner |
| injectionguard | Prompt injection detection |
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file toksight-0.3.0.tar.gz.
File metadata
- Download URL: toksight-0.3.0.tar.gz
- Upload date:
- Size: 50.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e029bf6becd95ed0255fbbcb87a3149b4384b6a8fdfcf32e352e69532e96473
|
|
| MD5 |
89c5c18df61445bd970627aac67846f4
|
|
| BLAKE2b-256 |
3f023af43729c38dd0f581687c256034eb5d327d29eaa15d6e44ca24ac65ebce
|
File details
Details for the file toksight-0.3.0-py3-none-any.whl.
File metadata
- Download URL: toksight-0.3.0-py3-none-any.whl
- Upload date:
- Size: 33.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
907a44016d7e745941c3f1968c472d78472b271faab541a3cce150cdf539ee22
|
|
| MD5 |
521fc18ed9154f13737c1da3188bfade
|
|
| BLAKE2b-256 |
b2336525fc6799ae27719b01561e2dd9f993e13195115caae9f7161ec2a5560f
|