Skip to main content

TokLens: Looking Beyond Fertility in Tokenizer Evaluation

Project description

TokLens: A Multilingual Lens on Tokenizer Quality for LLMs

Accepted to ACL 2026 SRW. 🎉

Open-source toolkit for evaluating tokenizer quality across languages using six intrinsic metrics. We evaluate 24 tokenizers from major LLM families across 15 typologically diverse languages and correlate with downstream performance.

Key Findings

  1. Stark cross-lingual disparities persist. GPT-2 produces 56x more tokens per word in Japanese than English. Qwen2.5 and Gemma-2 reduce this gap to under 4x.
  2. No metric predicts English benchmark performance after controlling for model size (Bonferroni-corrected). Tokenizer quality does not drive English leaderboard scores.
  3. STRR significantly predicts multilingual performance. On MMLU-ProX, linear mixed-effects models show STRR has a large positive effect (β = +5.7, z = 18.5, p < 0.001).
  4. Higher STRR correlates with steeper scaling. A controlled experiment on the Qwen2.5 family (fixed tokenizer, varying model size) shows languages with higher STRR scale more steeply (ρ = 0.91, p < 0.001).

Benchmark correlations

Per-language scaling slope vs. tokenizer metrics (Qwen2.5 family)

Scaling slope vs metrics

Metrics

Metric Description
Fertility Tokens per whitespace-delimited word. Lower = better compression.
CPT Characters per token.
Compression ratio Bytes per token.
NSL Normalized sequence length relative to a reference tokenizer.
STRR Single-token retention rate. Fraction of words encoded as a single token.
Parity Cross-lingual fairness: ratio of token counts for parallel English sentences.

Models and Languages

22 models with Open LLM Leaderboard v2 scores, plus 2 extra tokenizers (Qwen3, DeepSeek-V3) for metric-only analysis.

15 languages across 6 scripts: English, Chinese, Japanese, Arabic, Hindi, German, Turkish, Korean, Thai, Russian, French, Spanish, Portuguese, Vietnamese, Indonesian.

Quickstart

pip install toklens
from toklens import Analyzer

analyzer = Analyzer.from_pretrained("meta-llama/Llama-3.1-8B")
report = analyzer.evaluate(langs=["en", "zh", "ja", "ar"])
report.print_table()
toklens eval meta-llama/Llama-3.1-8B --langs en zh ja ar
toklens compare meta-llama/Llama-2-7b-hf meta-llama/Llama-3.1-8B

Experiments

Reproduces the full evaluation. Run steps in order:

uv run python -m experiments.pipeline.01_collect_benchmarks
uv run python -m experiments.pipeline.02_compute_metrics
uv run python -m experiments.pipeline.03_correlation
uv run python -m experiments.pipeline.04_figures

Supplementary analyses (LME models, Qtok comparison, BPB, Qwen scaling) are in experiments/analyses/. See experiments/README.md for details.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

toklens-0.1.1.tar.gz (29.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

toklens-0.1.1-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file toklens-0.1.1.tar.gz.

File metadata

  • Download URL: toklens-0.1.1.tar.gz
  • Upload date:
  • Size: 29.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.8

File hashes

Hashes for toklens-0.1.1.tar.gz
Algorithm Hash digest
SHA256 e9d8cc964e9049b406715cc133ffd0f71b93626f6dcf671489e8cce0d035b227
MD5 62bd0b36943e9eee99c5047c6a1c43c4
BLAKE2b-256 cb04730afdde45f6e81c2f2f23171be7446002545bdd12ca226493ef143c09bb

See more details on using hashes here.

File details

Details for the file toklens-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: toklens-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 12.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.8

File hashes

Hashes for toklens-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7907d2025e2811459fe095d430fd911d0e04b131b6caf4189e6d4363bd5ff5d7
MD5 0aa3ba48de7a52da43aaa9beb341133a
BLAKE2b-256 47eecc414bf94f2451cc50048a70765b4fcadea27615e0bc47e416342d0e3202

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page