Skip to main content

Tokenizer analysis toolkit. Compare vocabulary coverage, compression ratios, and token boundaries across GPT-4o, Llama 3, Mistral, and any HuggingFace tokenizer.

Project description

toksight

CI Python 3.9+ License: Apache 2.0

Tokenizer analysis toolkit. Compare vocabulary coverage, compression ratios, and token boundaries across GPT-4o, Llama 3, Mistral, and any HuggingFace tokenizer.

toksight coverage analysis

Why toksight?

Every LLM developer interacts with tokenizers daily — for cost estimation, multilingual planning, context window budgeting — yet there is zero tooling for analyzing or comparing them. tiktoken, sentencepiece, and tokenizers are implementations; toksight is the microscope.

Question Before toksight With toksight
"How does GPT-4o handle Korean vs Llama 3?" Manual guessing toksight coverage --blocks Hangul
"What's the vocabulary overlap?" Nobody knows toksight compare gpt-4o llama3
"How much more expensive is this corpus on GPT-4 vs Claude?" Manual counting toksight cost --corpus data.txt
"Does this tokenizer have glitch tokens?" Run each token manually toksight audit gpt-4o

Install

pip install toksight

With tokenizer backends:

pip install toksight[tiktoken]       # OpenAI tokenizers
pip install toksight[transformers]   # HuggingFace tokenizers
pip install toksight[all]            # Everything
pip install toksight[cli]            # CLI tools

Quick Start

Compression Analysis

from toksight import load_tiktoken
from toksight.compression import compute_compression

tok = load_tiktoken("cl100k_base")
stats = compute_compression(tok, ["Hello world! This is a test."])
print(f"Bytes/token: {stats.bytes_per_token:.2f}")
print(f"Fertility:   {stats.fertility:.2f} tokens/word")

Unicode Coverage

from toksight.coverage import analyze_coverage

result = analyze_coverage(tok, blocks=["CJK Unified", "Hangul Syllables", "Arabic"])
for block, info in result.blocks_analyzed.items():
    print(f"{block}: {info['ratio']:.1%} coverage")

Vocabulary Comparison

from toksight.compare import compare_vocabularies, compare_on_corpus

tok_a = load_tiktoken("cl100k_base")   # GPT-4
tok_b = load_tiktoken("o200k_base")    # GPT-4o

result = compare_vocabularies(tok_a, tok_b)
print(f"Overlap:  {result.vocab_overlap:,} tokens")
print(f"Jaccard:  {result.jaccard_similarity:.2%}")

toksight audit

Token Mapping

from toksight.mapping import map_tokens

mapping = map_tokens(tok_a, tok_b, "Artificial intelligence is transforming healthcare")
for entry in mapping:
    src = entry["source_token"]
    targets = [t["text"] for t in entry["target_tokens"]]
    print(f"  {src!r}{targets}")

Tokenizer Audit

from toksight.audit import audit

result = audit(tok)
for finding in result.findings[:10]:
    print(f"[{finding.severity}] {finding.category}: {finding.description}")

Cost Estimation

from toksight.cost import compare_costs

corpus = ["Long document text..."] * 1000
costs = compare_costs(
    [(tok_a, "gpt-4o"), (tok_b, "gpt-4o-mini")],
    corpus,
)
for name, est in costs.estimates.items():
    print(f"{name}: {est['total_tokens']:,} tokens, ${est['input_cost_usd']:.4f}")

CLI

# Vocabulary stats
toksight info cl100k_base

# Compression analysis
toksight compress cl100k_base --corpus data.txt

# Unicode coverage
toksight coverage cl100k_base

# Tokenizer audit
toksight audit cl100k_base --max-tokens 5000

Modules

Module Purpose
toksight.loader Unified tokenizer loading (tiktoken, HuggingFace, SentencePiece, custom)
toksight.compression Compression ratios, bytes/token, fertility analysis
toksight.coverage Unicode block coverage, script analysis, roundtrip testing
toksight.compare Vocabulary overlap, boundary alignment, fragmentation mapping
toksight.mapping Token-to-token mapping between tokenizers
toksight.audit Glitch token detection, degenerate tokens, control chars
toksight.cost Provider cost estimation from tokenization differences
toksight.stats Vocabulary statistics, length distributions, script coverage

Supported Backends

Backend Install Tokenizers
tiktoken pip install toksight[tiktoken] cl100k_base (GPT-4), o200k_base (GPT-4o)
HuggingFace pip install toksight[transformers] Any model on HuggingFace Hub
SentencePiece pip install toksight[sentencepiece] Any .model file
Custom Built-in Any encode/decode functions

See Also

Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:

Project What it does
tokonomics Token counting & cost management for LLM APIs
datacrux Training data quality — dedup, PII, contamination
castwright Synthetic instruction data generation
datamix Dataset mixing & curriculum optimization
trainpulse Training health monitoring
ckpt Checkpoint inspection, diffing & merging
quantbench Quantization quality analysis
infermark Inference benchmarking
modeldiff Behavioral regression testing
vibesafe AI-generated code safety scanner
injectionguard Prompt injection detection

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toksight-0.2.0.tar.gz (41.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

toksight-0.2.0-py3-none-any.whl (29.1 kB view details)

Uploaded Python 3

File details

Details for the file toksight-0.2.0.tar.gz.

File metadata

  • Download URL: toksight-0.2.0.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for toksight-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d7aaf58e364bd0417d5335e62697b742f386b39903381bd93516b02eb20d3ec2
MD5 6b900ac6f812d81178e785113c6c49a7
BLAKE2b-256 ccddba31eacc0a62279669d8a35f2a023cd07694b560ed2e8d0547a69e642d2a

See more details on using hashes here.

File details

Details for the file toksight-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: toksight-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for toksight-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5977f4047516f2ba7d73ab0b69019b50743d66b44ee694e157b1db05dd14c917
MD5 1d14a06a70f889eb9dc375eeffdb23fb
BLAKE2b-256 d9b57b38e5bca84e732daa20dfb716676df2244f0f7403ea431b73c2077eb7f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page