Measure the tokenization tax on African languages across frontier LLMs
Project description
afri-fertility
Measure the tokenization "tax" on African languages.
Commercial LLMs bill, throttle, and context-budget per token. Because the same meaning takes more tokens in African languages than in English, speakers and builders face a structural cost, latency, and context penalty — before the model is even invoked. afri-fertility measures that penalty precisely.
It is the open measurement engine behind The African Language Tax paper, the public African Tokenization Tax Leaderboard, and the cost-calculator widget at datalens.africa.
afri-fertility reproduce
Tokenizer Language Fertility Premium
──────────────────────────────────────────────────
openai/o200k_base amh 8.500 7.83×
openai/o200k_base yor 2.674 2.46×
openai/o200k_base swh 1.800 1.66×
openai/o200k_base fra 1.265 1.17×
Install
pip install afri-fertility # core: tiktoken + HF backends
pip install "afri-fertility[api]" # + Claude / Gemini count-only
pip install "afri-fertility[viz]" # + matplotlib figures
pip install "afri-fertility[dev]" # + pytest, hypothesis
Requires Python 3.11+. The core path is CPU-only and key-free.
Quickstart
Single-text measurement
from afri_fertility import measure_text
m = measure_text("Àwọn ará Nàìjíríà", tokenizer="openai/o200k_base")
print(f"tokens={m.tokens} fertility={m.fertility:.2f} cpt={m.cpt:.2f}")
# tokens=8 fertility=2.67 cpt=2.25
afri-fertility measure \
--text "Àwọn ará Nàìjíríà tó ń gbé ní ìlú Èkó" \
--lang yor \
--models openai/o200k_base,openai/cl100k_base
Cost calculator
from afri_fertility import cost_of
results = cost_of("Àwọn ará Nàìjíríà", lang="yor", models=["openai/o200k_base"])
for r in results:
print(f"{r.tokenizer}: ${r.total_cost_usd:.6f} NGN {r.costs_local['NGN']:.4f}")
afri-fertility cost \
--text "Àwọn ará Nàìjíríà" \
--lang yor \
--models openai/o200k_base,mistral/tekken \
--currencies NGN,ZAR,KES
Offline credibility demo
afri-fertility reproduce
Runs the bundled reference suite (10 parallel sentences, 7 languages) against all available tokenizers. No network, no API keys.
Full study
Run the complete pre-registered study from the locked config:
afri-fertility run --config configs/study_main.yaml
Results land in runs/main/:
runs/main/
├── results.parquet / results.csv / results.json
├── leaderboard.json
├── manifest.json
└── figures/
├── fig1_heatmap.{png,svg}
├── fig2_premium_script.{png,svg}
├── fig3_cost.{png,svg}
├── fig4_context.{png,svg}
├── fig5_general_indomain.{png,svg}
└── fig6_premium_accuracy.{png,svg}
For HF-gated tokenizers (Llama, Gemma, …):
afri-fertility run --config configs/study_main.yaml --hf-token $HF_TOKEN
# or: export HF_TOKEN=hf_... && afri-fertility run ...
Python API
from afri_fertility import measure_text, cost_of, run_study, load_tokenizer
from afri_fertility.config import StudyConfig
# Single-text metrics
m = measure_text("Ẹ káàbọ̀", tokenizer="openai/o200k_base")
# Widget backend: cost per model
results = cost_of("Ẹ káàbọ̀", lang="yor", models=["openai/o200k_base"])
# Full study
config = StudyConfig.from_yaml("configs/study_main.yaml")
result = run_study(config)
df = result.dataframe # pandas DataFrame
result.figures("runs/figs") # generate all 6 figures
lb = result.to_leaderboard() # list[dict] for the frontend
CLI reference
| Command | What it does |
|---|---|
afri-fertility measure |
Token count, fertility, CPT, BPT for input text |
afri-fertility cost |
Cost per model per language (widget backend) |
afri-fertility run |
Full study from YAML config |
afri-fertility figures |
Regenerate figures from an existing run |
afri-fertility leaderboard |
Emit leaderboard JSON from a run |
afri-fertility reproduce |
Offline reference suite — one-command credibility demo |
afri-fertility tokenizers list |
Registry of all tokenizers and their availability |
afri-fertility corpora list |
Registered corpus loaders |
afri-fertility languages list |
All 22 study languages with ISO codes and scripts |
Global flags: --hf-token, --cache-dir, --log-level, --json.
Full documentation: docs/usage.md
Metrics
All metrics are computed on parallel corpora (same meaning, different languages), so the language effect is isolated from content.
| Metric | Formula | Meaning |
|---|---|---|
Fertility F(L,T) |
tokens / words |
Tokens per word. Lower = more efficient. |
Premium P(L,T) |
F(L,T) / F(eng,T) |
How many times more tokens L uses vs English. |
| CPT | chars / tokens |
Characters packed per token. |
| BPT | utf8_bytes / tokens |
Bytes per token (cross-script fair). |
| Context efficiency | window_size × CPT |
Effective real chars in a fixed context window. |
Aggregation: sum-then-divide over all sentences (not mean-of-ratios). Bootstrap 95% CIs over sentences. Baseline: English. Normalization: NFC.
Supported tokenizers
| Tokenizer id | Backend | Notes |
|---|---|---|
openai/o200k_base |
tiktoken | GPT-4o / GPT-4.1 / o-series |
openai/o200k_harmony |
tiktoken | OpenAI open-weight (gpt-oss) |
openai/cl100k_base |
tiktoken | GPT-3.5 / GPT-4 (legacy) |
meta/llama-3.1 |
HF | Gated — needs HF_TOKEN |
meta/llama-4 |
HF | Gated — needs HF_TOKEN |
google/gemma-4 |
HF | Gated — needs HF_TOKEN |
mistral/tekken |
HF | |
qwen/qwen3 |
HF | |
deepseek/v3 |
HF | |
bigscience/bloom |
HF | Multilingual baseline |
cohere/aya-expanse |
HF | Multilingual-optimized baseline |
anthropic/claude |
API | [api] extra — count-only; needs ANTHROPIC_API_KEY |
google/gemini |
API | [api] extra — count-only; needs GEMINI_API_KEY |
Unavailable tokenizers are skipped with a warning; they never crash a run. See docs/adding_a_tokenizer.md to add your own.
Supported languages
23 languages across 5 tiers:
Core (6): Yoruba · Hausa · Igbo · Wolof · Swahili · Amharic
Latin breadth (11): Zulu · Xhosa · Shona · Kinyarwanda · Luganda · Akan/Twi · Lingala · Oromo · Nigerian Pidgin · Sesotho · Bambara
Non-Latin (3): Tigrinya · Hausa-Ajami · N'Ko
Control (1): Afrikaans
Baselines (2): English · French
afri-fertility languages list
Reproducing the paper numbers
git clone https://github.com/ciphersenseai/afri-fertility
cd afri-fertility
pip install -e ".[dev,viz]"
afri-fertility reproduce # offline check, no keys
export HF_TOKEN=hf_...
afri-fertility run --config configs/study_main.yaml # full locked study
afri-fertility figures --run runs/main
afri-fertility leaderboard --run runs/main --out leaderboard.json
All tokenizer versions, price snapshot date, FX rates, and config hash are recorded in runs/main/manifest.json.
Project structure
afri-fertility/
├── src/afri_fertility/
│ ├── core/ # segmentation, metrics, aggregation (pure functions)
│ ├── tokenizers/ # tiktoken + HF + API adapters + registry
│ ├── corpora/ # FLORES-200, SIB-200, MAFAND-MT, custom JSONL/CSV
│ ├── cost/ # cost model, price/FX snapshots
│ ├── study/ # orchestrator, accuracy linkage
│ ├── report/ # tables, 6 figures, leaderboard JSON
│ ├── cli.py # typer CLI
│ └── config.py # pydantic StudyConfig
├── configs/ # locked study config + pinned price/FX snapshots
├── data/
│ ├── languages.yaml # 22-language registry
│ └── reference_suite/ # offline reproduce dataset
└── tests/ # unit · golden · integration
Citation
@misc{somide2026african,
title = {The African Language Tax: Quantifying the Cost, Latency,
and Context Penalty of Tokenizing African Languages in Frontier LLMs},
author = {Somide, Anthony Olaoye and {DataLens Africa Research}},
year = {2026},
eprint = {2606.24460},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.24460}
}
License
Apache-2.0 · © 2026 DataLens Africa Research
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afri_fertility-0.1.0.tar.gz.
File metadata
- Download URL: afri_fertility-0.1.0.tar.gz
- Upload date:
- Size: 65.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
01e6e989aed2f3031fee6c16eb185d5d5d4dc36df7e9adcc320a10665c64f27b
|
|
| MD5 |
05a894bd760366f3a6950d054ef247aa
|
|
| BLAKE2b-256 |
6ad55963e0d18406ff3093a30264d79aaa814c06724111b75a2279a5c7dce5f9
|
Provenance
The following attestation bundles were made for afri_fertility-0.1.0.tar.gz:
Publisher:
publish.yml on CipherSenseAI/afri-fertility
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afri_fertility-0.1.0.tar.gz -
Subject digest:
01e6e989aed2f3031fee6c16eb185d5d5d4dc36df7e9adcc320a10665c64f27b - Sigstore transparency entry: 1936460733
- Sigstore integration time:
-
Permalink:
CipherSenseAI/afri-fertility@14b532fee95fa9d3427ac2a03b806c157174aea6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/CipherSenseAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@14b532fee95fa9d3427ac2a03b806c157174aea6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file afri_fertility-0.1.0-py3-none-any.whl.
File metadata
- Download URL: afri_fertility-0.1.0-py3-none-any.whl
- Upload date:
- Size: 56.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bb204abeec6f8e61d943bc21bc45ee6e875b2544f6452d6e89bef71c94bb8ac5
|
|
| MD5 |
58beb69cef039896392179bdd0894670
|
|
| BLAKE2b-256 |
03d1fea6e995efd638937f276d19cce65f71a14758bfdd09ebd71e6667e90d4c
|
Provenance
The following attestation bundles were made for afri_fertility-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on CipherSenseAI/afri-fertility
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afri_fertility-0.1.0-py3-none-any.whl -
Subject digest:
bb204abeec6f8e61d943bc21bc45ee6e875b2544f6452d6e89bef71c94bb8ac5 - Sigstore transparency entry: 1936460794
- Sigstore integration time:
-
Permalink:
CipherSenseAI/afri-fertility@14b532fee95fa9d3427ac2a03b806c157174aea6 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/CipherSenseAI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@14b532fee95fa9d3427ac2a03b806c157174aea6 -
Trigger Event:
push
-
Statement type: