Open benchmark of LLM tokenization across 5 providers — offline vs empirical calibration deltas.
Project description
llm-tokens-atlas
Open benchmark of LLM tokenization — offline vs empirical calibration deltas. v0.1.0 ships 3 providers (Anthropic, OpenAI, Mistral); Google + Cohere pending for v0.2.0.
What this is
llm-tokens-atlas is a reproducible, open dataset and analysis pipeline that measures how offline tokenizers (e.g. tiktoken proxies, the published BPE vocabularies) compare against the empirical token counts returned by each provider's own API or OSS tokenizer. v0.1.0 covers 3 providers (Anthropic claude-opus-4-7, OpenAI gpt-4o, Mistral mistral-large-latest), 5 prompt formats (Markdown, XML, JSON, YAML, plain text), and 7,485 real-world prompt requests (499 unique prompts × 5 formats × 3 providers; n=2,495 per provider) drawn from open corpora. Google gemini-2.5-pro and Cohere command-r parity sweeps are scheduled for v0.2.0 — the schema is already validated for both providers; the only missing piece is the empirical sweep.
The output is a per-provider, per-format calibration delta distribution — so anyone estimating cost or context-window budgets ahead of an API call can quantify the bias of their offline counter instead of treating it as exact.
This project builds on tokenometer, which surfaced the underlying methodology and a notable preliminary finding: cl100k_base underestimates claude-opus-4-7 tokens by ~62% (median).
Status
v0.1.0 (released 2026-05-11) — 3-provider coverage (Anthropic + OpenAI + Mistral). Schema, drivers, and data are stable for the three shipped providers; Google + Cohere rows will be added in v0.2.0 without a breaking schema change.
Headline findings (v0.1.0)
Released 2026-05-11. n = 7,485 rows (2,495 per provider). Detailed numbers live in analysis/results.json.
| Provider | Model | Median offline-vs-empirical delta | OLS slope | R² |
|---|---|---|---|---|
| anthropic | claude-opus-4-7 |
+41.3% (cl100k_base underestimates) | 1.611 | 0.9956 |
| openai | gpt-4o |
0.0% (tiktoken-as-truth oracle, mean +3.0%) | 1.024 | 0.9986 |
| mistral | mistral-large-latest |
−0.1% (mistral-tokenizer-js, mean +1.9%) | 1.016 | 0.9993 |
The Anthropic row is the headline: the publicly-recommended offline tokenizer
underestimates real claude-opus-4-7 cost by ~41% across thousands of prompts,
and 100% of rows underestimate (no exact / overestimate cases). OpenAI and
Mistral are baselines confirming the offline-vs-empirical pipeline is calibrated
correctly when the provider's own tokenizer is the oracle.
Install
TBD — see pyproject.toml. The recommended workflow uses uv:
uv sync
Usage
TBD. See llm_tokens_atlas/ for collection and counting drivers, and analysis/notebooks/ for plots.
Reproducing results
make reproduce
This regenerates the dataset from scratch. Tokenizer and provider API versions are pinned (see data/lockfile.json once published).
See docs/REPRODUCING.md for full instructions — required API keys per provider, expected runtime at each scale, output sizes, and a CI-friendly tiny variant (make reproduce-tiny).
Tokenometer integration
Atlas reuses tokenometer's
multi-provider tokenizer logic (5 providers supported upstream; Atlas v0.1.0
exercises 3 of them — Anthropic, OpenAI, Mistral — with Google + Cohere
exercises arriving in v0.2.0) instead of reimplementing it in Python. The
integration lives in a single module:
llm_tokens_atlas/tokenometer_bridge.py— Python facade over the tokenometer CLI. Exposescount_offline,count_empirical,list_providers,list_models,list_formats, plus acount_offline_batch/count_empirical_batchpair for the high-throughput atlas pipeline.llm_tokens_atlas/install_tokenometer.sh— idempotent installer;make installruns it. Finds tokenometer via (1)tokenometeron PATH, (2) a sibling../tokenometer/repo build, (3) builds the sibling if source is present, or (4) fails with an install hint.
Any new Python code that needs token counts should import from
llm_tokens_atlas.tokenometer_bridge. Do not invoke the tokenometer CLI
directly from other modules.
Publishing the dataset
The canonical home for the dataset is
https://huggingface.co/datasets/faraa2m/llm-tokens-atlas. The Hugging Face
dataset card lives at data/README.md. The upload
script is llm_tokens_atlas/publish_to_hf.py; set
HF_TOKEN in your env and run it with --dataset llm-tokens-atlas.
Reproducing
docs/REPRODUCING.md—make reproducemechanics + expected runtime.
Citation
Released 2026-05-11. Cite as v0.1.0 (3-provider coverage). Coverage will expand to 5 providers in v0.2.0 (Google + Cohere); cite the version you used. Until the paper is on arxiv, cite the GitHub repo and the HuggingFace dataset directly:
@misc{llm-tokens-atlas-2026,
author = {Faraazuddin Mohammed},
title = {{llm-tokens-atlas}: An Open Benchmark of LLM Tokenization Calibration},
year = {2026},
version = {v0.1.0},
howpublished = {\url{https://github.com/faraa2m/llm-tokens-atlas}},
note = {3-provider coverage (Anthropic, OpenAI, Mistral); v0.2.0 adds Google + Cohere. Companion arxiv preprint forthcoming.}
}
License
- Code: Apache-2.0
- Data (everything under
data/): CC-BY-4.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_tokens_atlas-0.0.1.tar.gz.
File metadata
- Download URL: llm_tokens_atlas-0.0.1.tar.gz
- Upload date:
- Size: 91.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
82b63a9269f6963c6766317713bbc80cbef362a000fccecba7a064cfadb7f013
|
|
| MD5 |
4ca06231e543db2f5dc0d3a74ffa9b65
|
|
| BLAKE2b-256 |
58a68ab6467fca65b4f19f188aca5a3ce8d9c3b967ad687108d0d7b2e9834ed9
|
Provenance
The following attestation bundles were made for llm_tokens_atlas-0.0.1.tar.gz:
Publisher:
publish.yml on faraa2m/llm-tokens-atlas
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_tokens_atlas-0.0.1.tar.gz -
Subject digest:
82b63a9269f6963c6766317713bbc80cbef362a000fccecba7a064cfadb7f013 - Sigstore transparency entry: 1515464271
- Sigstore integration time:
-
Permalink:
faraa2m/llm-tokens-atlas@538a9783e8246d99a201f04ab573f28c638ce83a -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/faraa2m
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@538a9783e8246d99a201f04ab573f28c638ce83a -
Trigger Event:
release
-
Statement type:
File details
Details for the file llm_tokens_atlas-0.0.1-py3-none-any.whl.
File metadata
- Download URL: llm_tokens_atlas-0.0.1-py3-none-any.whl
- Upload date:
- Size: 69.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69603c70f8b8600f4961782fe44bef8551e552312f87037d76d9bbf13bc44aaf
|
|
| MD5 |
4ae71a43f6dcfaef26184542b7aed6cd
|
|
| BLAKE2b-256 |
3c56ec0db570cfeb52310e65d8a30340afaad681dcb70cdd3e99c3f00164321d
|
Provenance
The following attestation bundles were made for llm_tokens_atlas-0.0.1-py3-none-any.whl:
Publisher:
publish.yml on faraa2m/llm-tokens-atlas
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_tokens_atlas-0.0.1-py3-none-any.whl -
Subject digest:
69603c70f8b8600f4961782fe44bef8551e552312f87037d76d9bbf13bc44aaf - Sigstore transparency entry: 1515464443
- Sigstore integration time:
-
Permalink:
faraa2m/llm-tokens-atlas@538a9783e8246d99a201f04ab573f28c638ce83a -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/faraa2m
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@538a9783e8246d99a201f04ab573f28c638ce83a -
Trigger Event:
release
-
Statement type: