Open benchmark of LLM tokenization across 5 providers — offline vs empirical calibration deltas.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

faraa2m

These details have not been verified by PyPI

Project links

Documentation

Project description

llm-tokens-atlas

Open benchmark of LLM tokenization — offline vs empirical calibration deltas. v0.1.0 ships 3 providers (Anthropic, OpenAI, Mistral); Google + Cohere are scheduled for v0.2.0.

What this is

llm-tokens-atlas is a reproducible, open dataset and analysis pipeline that measures how offline tokenizers (e.g. tiktoken proxies, the published BPE vocabularies) compare against the empirical token counts returned by each provider's own API or OSS tokenizer. v0.1.0 covers 3 providers (Anthropic claude-opus-4-7, OpenAI gpt-4o, Mistral mistral-large-latest), 5 prompt formats (Markdown, XML, JSON, YAML, plain text), and 7,485 real-world prompt requests (499 unique prompts × 5 formats × 3 providers; n=2,495 per provider) drawn from open corpora. Google gemini-2.5-pro and Cohere command-r parity sweeps are scheduled for v0.2.0 — the schema is already validated for both providers; the only missing piece is the empirical sweep.

The output is a per-provider, per-format calibration delta distribution — so anyone estimating cost or context-window budgets ahead of an API call can quantify the bias of their offline counter instead of treating it as exact.

This project builds on tokenometer, which surfaced the underlying methodology and a notable preliminary finding: cl100k_base underestimates claude-opus-4-7 tokens by ~62% (median).

Status

v0.1.0 (released 2026-05-11) — 3-provider coverage (Anthropic + OpenAI + Mistral). Schema, drivers, and data are stable for the three shipped providers; Google + Cohere rows will be added in v0.2.0 without a breaking schema change.

Headline findings (v0.1.0)

Released 2026-05-11. n = 7,485 rows (2,495 per provider). Detailed numbers live in analysis/results.json.

Provider	Model	Median offline-vs-empirical delta	OLS slope	R²
anthropic	`claude-opus-4-7`	+41.3% (cl100k_base underestimates)	1.611	0.9956
openai	`gpt-4o`	0.0% (tiktoken-as-truth oracle, mean +3.0%)	1.024	0.9986
mistral	`mistral-large-latest`	−0.1% (mistral-tokenizer-js, mean +1.9%)	1.016	0.9993

The Anthropic row is the headline: the publicly-recommended offline tokenizer underestimates real claude-opus-4-7 cost by ~41% across thousands of prompts, and 100% of rows underestimate (no exact / overestimate cases). OpenAI and Mistral are baselines confirming the offline-vs-empirical pipeline is calibrated correctly when the provider's own tokenizer is the oracle.

Install

The recommended local workflow uses uv:

uv sync
make install

For library use from a project environment:

pip install llm-tokens-atlas

Usage

Load the published dataset from Hugging Face:

from datasets import load_dataset

dataset = load_dataset("faraa2m/llm-tokens-atlas")
df = dataset["train"].to_pandas()

anthropic = df[df["provider"] == "anthropic"]
print(anthropic["delta_pct"].median())

Run a small credentials-free reproduction locally:

make reproduce-tiny

Run the full pipeline with the default provider set:

make reproduce

Use the Python bridge when another analysis script needs token counts through the same Tokenometer path as the published dataset:

from llm_tokens_atlas.tokenometer_bridge import count_offline

result = count_offline(
    text="Summarize this support ticket.",
    provider="openai",
    model="gpt-4o",
    format="markdown",
)
print(result)

See docs/REPRODUCING.md for provider keys, expected runtime, generated files, and CI-sized runs.

Calibration Examples

Use Atlas when an offline tokenizer needs a correction factor before a large batch job:

Claude budgeting — the v0.1.0 Anthropic sweep shows systematic undercounting versus empirical provider counts, so production budgets should include a provider-specific calibration margin.
OpenAI sanity checks — the gpt-4o row acts as an oracle-style baseline for o200k_base counting.
Mistral validation — the Mistral row validates the OSS tokenizer path for SentencePiece-family models.

Generated result tables live in analysis/results.json when the analysis pipeline has been run. Generated figures are expected under analysis/figures/.

Reproducing results

make reproduce

This regenerates the dataset from scratch. Tokenizer and provider API versions are pinned (see data/lockfile.json once published).

See docs/REPRODUCING.md for full instructions — required API keys per provider, expected runtime at each scale, output sizes, and a CI-friendly tiny variant (make reproduce-tiny).

Tokenometer integration

Atlas reuses tokenometer's multi-provider tokenizer logic (5 providers supported upstream; Atlas v0.1.0 exercises 3 of them — Anthropic, OpenAI, Mistral — with Google + Cohere exercises arriving in v0.2.0) instead of reimplementing it in Python. The integration lives in a single module:

llm_tokens_atlas/tokenometer_bridge.py — Python facade over the tokenometer CLI. Exposes count_offline, count_empirical, list_providers, list_models, list_formats, plus a count_offline_batch / count_empirical_batch pair for the high-throughput atlas pipeline.
llm_tokens_atlas/install_tokenometer.sh — idempotent installer; make install runs it. Finds tokenometer via (1) tokenometer on PATH, (2) a sibling ../tokenometer/ repo build, (3) builds the sibling if source is present, or (4) fails with an install hint.

Any new Python code that needs token counts should import from llm_tokens_atlas.tokenometer_bridge. Do not invoke the tokenometer CLI directly from other modules.

Publishing the dataset

The canonical home for the dataset is https://huggingface.co/datasets/faraa2m/llm-tokens-atlas. The Hugging Face dataset card lives at data/README.md. The upload script is llm_tokens_atlas/publish_to_hf.py; set HF_TOKEN in your env and run it with --dataset llm-tokens-atlas.

Reproducing

docs/REPRODUCING.md — make reproduce mechanics + expected runtime.

Citation

Released 2026-05-11. Cite as v0.1.0 (3-provider coverage). Coverage will expand to 5 providers in v0.2.0 (Google + Cohere); cite the version you used. Until the paper is on arxiv, cite the GitHub repo and the HuggingFace dataset directly:

@misc{llm-tokens-atlas-2026,
  author       = {Faraazuddin Mohammed},
  title        = {{llm-tokens-atlas}: An Open Benchmark of LLM Tokenization Calibration},
  year         = {2026},
  version      = {v0.1.0},
  howpublished = {\url{https://github.com/faraa2m/llm-tokens-atlas}},
  note         = {3-provider coverage (Anthropic, OpenAI, Mistral); v0.2.0 adds Google + Cohere. Companion arxiv preprint forthcoming.}
}

License

Code: Apache-2.0
Data (everything under data/): CC-BY-4.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

faraa2m

These details have not been verified by PyPI

Project links

Documentation

Release history Release notifications | RSS feed

This version

0.0.3

May 18, 2026

0.0.1

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_tokens_atlas-0.0.3.tar.gz (91.6 kB view details)

Uploaded May 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_tokens_atlas-0.0.3-py3-none-any.whl (69.9 kB view details)

Uploaded May 18, 2026 Python 3

File details

Details for the file llm_tokens_atlas-0.0.3.tar.gz.

File metadata

Download URL: llm_tokens_atlas-0.0.3.tar.gz
Upload date: May 18, 2026
Size: 91.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_tokens_atlas-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`9521447495cc5f4dbeec68ead74f2db443eeb007223df56b5953a667ddeb92bf`
MD5	`e60d4607676474b27b53ad2cbd7d709a`
BLAKE2b-256	`f6e12a745fb6349f6d63dff00f298fa4ed0efacc9a215345c8a81c9b32ee9d90`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_tokens_atlas-0.0.3.tar.gz:

Publisher: publish.yml on faraa2m/llm-tokens-atlas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_tokens_atlas-0.0.3.tar.gz
- Subject digest: 9521447495cc5f4dbeec68ead74f2db443eeb007223df56b5953a667ddeb92bf
- Sigstore transparency entry: 1568758158
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: faraa2m/llm-tokens-atlas@87ed60837df8947afaf7dcb9e312806aec94469a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/faraa2m
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@87ed60837df8947afaf7dcb9e312806aec94469a
- Trigger Event: workflow_dispatch

File details

Details for the file llm_tokens_atlas-0.0.3-py3-none-any.whl.

File metadata

Download URL: llm_tokens_atlas-0.0.3-py3-none-any.whl
Upload date: May 18, 2026
Size: 69.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llm_tokens_atlas-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`68797635ff7023c57aecb051efa051bc48c8935f94bd026bebd23d41500f4e46`
MD5	`1b1eff8db18504e973b15d67c553b05e`
BLAKE2b-256	`16481ca69ca1e4cc691f9de19e6749d4bfd88a237420c1a522449a596bcb49d7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llm_tokens_atlas-0.0.3-py3-none-any.whl:

Publisher: publish.yml on faraa2m/llm-tokens-atlas

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llm_tokens_atlas-0.0.3-py3-none-any.whl
- Subject digest: 68797635ff7023c57aecb051efa051bc48c8935f94bd026bebd23d41500f4e46
- Sigstore transparency entry: 1568758201
- Sigstore integration time: May 18, 2026
Source repository:
- Permalink: faraa2m/llm-tokens-atlas@87ed60837df8947afaf7dcb9e312806aec94469a
- Branch / Tag: refs/heads/main
- Owner: https://github.com/faraa2m
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@87ed60837df8947afaf7dcb9e312806aec94469a
- Trigger Event: workflow_dispatch

llm-tokens-atlas 0.0.3

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-tokens-atlas

What this is

Status

Headline findings (v0.1.0)

Install

Usage

Calibration Examples

Reproducing results

Tokenometer integration

Publishing the dataset

Reproducing

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance