Skip to main content

Tokenizer fertility, cost, and multi-turn context-budget analyzer for low-resource Asian languages.

Project description

asia-fertility 🌏

The hidden multilingual tax in your tokenizer — measured before you deploy.

PyPI CI License Python HF Dataset

asia-fertility measures the structural cost penalty that LLM tokenizers impose on lower-resource Asian languages. The same content can cost up to 11× more tokens in Burmese than in English on a frontier tokenizer — silent inflation of API bills, smaller usable context windows, and fewer in-context examples.

Quickstart

pip install "asia-fertility[oai]"

# Measure your own text
asia-fertility measure --text "தமிழ் ஒரு செம்மொழி" --lang tam --tokenizer openai/o200k_base

# Compare cost across providers (in your local currency)
asia-fertility cost --text "Xin chào" --lang vie \
  --models openai/gpt-4o,openai/gpt-3.5-turbo \
  --currencies USD,VND

# Reproduce the full 16-language × 9-tokenizer leaderboard
asia-fertility run --config configs/study_main.yaml
asia-fertility figures --run runs/main --out runs/main/figures
asia-fertility leaderboard --run runs/main --out runs/main/leaderboard.json

What's inside

  • 16 lower-resource Asian languages: Vietnamese, Indonesian, Malay, Filipino, Thai, Hindi, Bengali, Sinhala, Tamil, Telugu, Kannada, Malayalam, Burmese, Khmer, Lao, plus English baseline.
  • 9 tokenizers measured: OpenAI o200k_base/cl100k_base/o200k_harmony, Mistral Tekken, Qwen3, DeepSeek v3, BLOOM, Gemma-2, Aya Expanse. 3 more registered behind license walls (Llama-3.1, etc.).
  • 5 metrics with 95% bootstrap CIs: fertility, premium, same-content cost ratio, characters/token (CPT), and bytes/token (BPT) — the only cross-script-fair comparator.
  • NIAH benchmark: script-native needle-in-haystack across gpt-4o-mini, gpt-3.5-turbo, llama-3.1-8b-instruct on Tamil/Hindi/Burmese/Lao haystacks.

Key findings (v0.2.0)

  • Same content costs 7–12× more tokens on cl100k_base for Brahmic-derived scripts (Tamil 7.61×, Burmese 11.66×).
  • Switching to o200k_base cuts the penalty 3–6× (Tamil → 1.98×, Burmese → 3.18×).
  • Gemma-2 is the best open-weight tokenizer for South Asian workloads (Tamil 2.58×, Burmese 4.80×).
  • NIAH recall collapses to 0–7% on Hindi/Tamil/Burmese/Lao with script-native markers, even at 4k context — across all three frontier models tested. See paper §4.4.

Paper

The full writeup is at paper/paper.pdf (11 pages). Cite as:

@misc{pedretti2026asianlanguagetax,
  title  = {The Asian Language Tax: Quantifying the Cost, Context, and Recall Penalty of Tokenizing Lower-Resource Asian Languages in Frontier LLMs},
  author = {Pedretti, Antoine},
  year   = {2026},
  url    = {https://github.com/Helmo21/asia-fertility},
}

Data

The full results leaderboard + NIAH benchmark are published as a HuggingFace dataset:

from datasets import load_dataset

ds   = load_dataset("Helmo21/asia-fertility", "leaderboard")  # 144 (lang × tokenizer) rows
niah = load_dataset("Helmo21/asia-fertility", "niah")          # 536 NIAH cells

License

MIT © 2026 Antoine Pedretti. Bundled FLORES-200 data: CC-BY-SA 4.0 (Meta NLLB).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asia_fertility-0.2.2.tar.gz (38.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asia_fertility-0.2.2-py3-none-any.whl (54.9 kB view details)

Uploaded Python 3

File details

Details for the file asia_fertility-0.2.2.tar.gz.

File metadata

  • Download URL: asia_fertility-0.2.2.tar.gz
  • Upload date:
  • Size: 38.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asia_fertility-0.2.2.tar.gz
Algorithm Hash digest
SHA256 261005afb735904e61ac24943aa820f989211cda376a1f767718e7877ebda9b1
MD5 a7b0d33aa07a5367451ddd3aa875bd91
BLAKE2b-256 1c302193396c308fad4bac4ec8fdf514b42176f8b61168ab0810f0099ebbbbcb

See more details on using hashes here.

File details

Details for the file asia_fertility-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: asia_fertility-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 54.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for asia_fertility-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fb3be1be943b9b8b0244afa7d51bcb227d4ee9a1c5ca3744ac81f1b5e5bd233d
MD5 6fb3aafc9266c6546ee694e29d982eaa
BLAKE2b-256 456c2bbb65bff7d7561634cf9511f4be3e617a0ec4e792ee7579794cfd05a7ad

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page