Tokenizer analysis toolkit. Compare vocabulary coverage, compression ratios, and token boundaries across GPT-4o, Llama 3, Mistral, and any HuggingFace tokenizer.

These details have not been verified by PyPI

Project links

Project description

toksight

Tokenizer analysis toolkit. Compare vocabulary coverage, compression ratios, and token boundaries across GPT-4o, Llama 3, Mistral, and any HuggingFace tokenizer.

toksight coverage analysis

Why toksight?

Every LLM developer interacts with tokenizers daily — for cost estimation, multilingual planning, context window budgeting — yet there is zero tooling for analyzing or comparing them. tiktoken, sentencepiece, and tokenizers are implementations; toksight is the microscope.

Question	Before toksight	With toksight
"How does GPT-4o handle Korean vs Llama 3?"	Manual guessing	`toksight coverage --blocks Hangul`
"What's the vocabulary overlap?"	Nobody knows	`toksight compare gpt-4o llama3`
"How much more expensive is this corpus on GPT-4 vs Claude?"	Manual counting	`toksight cost --corpus data.txt`
"Does this tokenizer have glitch tokens?"	Run each token manually	`toksight audit gpt-4o`

Install

pip install toksight

With tokenizer backends:

pip install toksight[tiktoken]       # OpenAI tokenizers
pip install toksight[transformers]   # HuggingFace tokenizers
pip install toksight[all]            # Everything
pip install toksight[cli]            # CLI tools

Quick Start

Compression Analysis

from toksight import load_tiktoken
from toksight.compression import compute_compression

tok = load_tiktoken("cl100k_base")
stats = compute_compression(tok, ["Hello world! This is a test."])
print(f"Bytes/token: {stats.bytes_per_token:.2f}")
print(f"Fertility:   {stats.fertility:.2f} tokens/word")

Unicode Coverage

from toksight.coverage import analyze_coverage

result = analyze_coverage(tok, blocks=["CJK Unified", "Hangul Syllables", "Arabic"])
for block, info in result.blocks_analyzed.items():
    print(f"{block}: {info['ratio']:.1%} coverage")

Vocabulary Comparison

from toksight.compare import compare_vocabularies, compare_on_corpus

tok_a = load_tiktoken("cl100k_base")   # GPT-4
tok_b = load_tiktoken("o200k_base")    # GPT-4o

result = compare_vocabularies(tok_a, tok_b)
print(f"Overlap:  {result.vocab_overlap:,} tokens")
print(f"Jaccard:  {result.jaccard_similarity:.2%}")

toksight audit

Token Mapping

from toksight.mapping import map_tokens

mapping = map_tokens(tok_a, tok_b, "Artificial intelligence is transforming healthcare")
for entry in mapping:
    src = entry["source_token"]
    targets = [t["text"] for t in entry["target_tokens"]]
    print(f"  {src!r} → {targets}")

Tokenizer Audit

from toksight.audit import audit

result = audit(tok)
for finding in result.findings[:10]:
    print(f"[{finding.severity}] {finding.category}: {finding.description}")

Cost Estimation

from toksight.cost import compare_costs

corpus = ["Long document text..."] * 1000
costs = compare_costs(
    [(tok_a, "gpt-4o"), (tok_b, "gpt-4o-mini")],
    corpus,
)
for name, est in costs.estimates.items():
    print(f"{name}: {est['total_tokens']:,} tokens, ${est['input_cost_usd']:.4f}")

CLI

# Vocabulary stats
toksight info cl100k_base

# Compression analysis
toksight compress cl100k_base --corpus data.txt

# Unicode coverage
toksight coverage cl100k_base

# Tokenizer audit
toksight audit cl100k_base --max-tokens 5000

Modules

Module	Purpose
`toksight.loader`	Unified tokenizer loading (tiktoken, HuggingFace, SentencePiece, custom)
`toksight.compression`	Compression ratios, bytes/token, fertility analysis
`toksight.coverage`	Unicode block coverage, script analysis, roundtrip testing
`toksight.compare`	Vocabulary overlap, boundary alignment, fragmentation mapping
`toksight.mapping`	Token-to-token mapping between tokenizers
`toksight.audit`	Glitch token detection, degenerate tokens, control chars
`toksight.cost`	Provider cost estimation from tokenization differences
`toksight.stats`	Vocabulary statistics, length distributions, script coverage

Supported Backends

Backend	Install	Tokenizers
tiktoken	`pip install toksight[tiktoken]`	cl100k_base (GPT-4), o200k_base (GPT-4o)
HuggingFace	`pip install toksight[transformers]`	Any model on HuggingFace Hub
SentencePiece	`pip install toksight[sentencepiece]`	Any .model file
Custom	Built-in	Any encode/decode functions

Project	What it does
tokonomics	Token counting & cost management for LLM APIs
datacrux	Training data quality — dedup, PII, contamination
castwright	Synthetic instruction data generation
datamix	Dataset mixing & curriculum optimization
trainpulse	Training health monitoring
ckpt	Checkpoint inspection, diffing & merging
quantbench	Quantization quality analysis
infermark	Inference benchmarking
modeldiff	Behavioral regression testing
vibesafe	AI-generated code safety scanner
injectionguard	Prompt injection detection

License

Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Apr 10, 2026

This version

0.2.0

Apr 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toksight-0.2.0.tar.gz (41.2 kB view details)

Uploaded Apr 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

toksight-0.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Apr 10, 2026 Python 3

File details

Details for the file toksight-0.2.0.tar.gz.

File metadata

Download URL: toksight-0.2.0.tar.gz
Upload date: Apr 10, 2026
Size: 41.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for toksight-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d7aaf58e364bd0417d5335e62697b742f386b39903381bd93516b02eb20d3ec2`
MD5	`6b900ac6f812d81178e785113c6c49a7`
BLAKE2b-256	`ccddba31eacc0a62279669d8a35f2a023cd07694b560ed2e8d0547a69e642d2a`

See more details on using hashes here.

File details

Details for the file toksight-0.2.0-py3-none-any.whl.

File metadata

Download URL: toksight-0.2.0-py3-none-any.whl
Upload date: Apr 10, 2026
Size: 29.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for toksight-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5977f4047516f2ba7d73ab0b69019b50743d66b44ee694e157b1db05dd14c917`
MD5	`1d14a06a70f889eb9dc375eeffdb23fb`
BLAKE2b-256	`d9b57b38e5bca84e732daa20dfb716676df2244f0f7403ea431b73c2077eb7f5`

See more details on using hashes here.

toksight 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

toksight

Why toksight?

Install

Quick Start

Compression Analysis

Unicode Coverage

Vocabulary Comparison

Token Mapping

Tokenizer Audit

Cost Estimation

CLI

Modules

Supported Backends

See Also

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes