Benchmarking + evaluation harness that proves TokenJam's savings with executable-accuracy ground truth
Project description
Evidence-based LLM benchmarking
Does TokenJam's "downsize this model" recommendation hold up? This runs the cheaper model against executable benchmarks and tells you, with statistics, where it keeps up and where it breaks.
pip install -e .
tjb run # zero-flag offline proof — writes a stamped artifact
tjb serve # browse the bundled real evidence in the dashboard
No cloud · No signup · Offline-first
It answers the one question TokenJam itself can't: when TokenJam says "downsize this model," does the cheaper model still get the work right — and how much does it actually save?
benchmark tasks ─▶ run on ORIGINAL model ─▶ score (pass/fail) + cost
─▶ run on CANDIDATE model ─▶ score (pass/fail) + cost
─▶ proof: Δaccuracy (objective) + Δcost, stamped to tokenjam vX.Y.Z
How it's a proof of TokenJam (not a generic model comparison)
- The cheaper candidate is the model TokenJam's own downsize analyzer would route to — pulled live from
tokenjam.core.optimize.DOWNGRADE_CANDIDATES. - Cost is priced with TokenJam's own pricing table (
tokenjam.core.pricing.get_rates) — same dollars TokenJam reports. - Accuracy is the benchmark pass-rate against real test suites — a measurement, never a judgment.
- Every result is stamped with the exact TokenJam version under test.
Real evidence: downsizing Claude to Haiku breaks code, not math
The 2026-06-26 multipair run benchmarked 18 model-pair × suite configurations at real measured rates (tokenjam 0.5.2). The sharpest result comes from routing claude-opus-4-7 down to claude-haiku-4-5. The cheaper model costs far less. It holds on grade-school math and falls apart on code.
| suite | pass rate (opus → haiku) | Δ accuracy [95% CI] | measured cost Δ | McNemar | verdict |
|---|---|---|---|---|---|
| HumanEval (code) | 90% → 56% | −34pp [−47.1, −20.9] | −81.6% | p<0.001 | significant_regression |
| GSM8K (math) | 98% → 96% | −2pp [−5.9, +1.9] | −59.2% | p=1.000 | no_significant_regression |
Same downsize, opposite calls. On code the bench flags a regression you would not want to ship. On math the cheaper model is statistically indistinguishable from the original while costing about 60% less. Reporting a single blended "accuracy" would have hidden both facts, so the bench reports per benchmark and lets the McNemar test decide each one.
The same code regression shows up for claude-sonnet-4-6 → claude-haiku-4-5 (94% → 56%, significant_regression). The gpt-4o → gpt-4o-mini and o3 → o4-mini downsizes pass HumanEval at this sample size.
Full run (18 configs, 7 suites, real rates): docs/evidence/live/2026-06-26-multipair/. Browse it with tjb serve.
Five Proof Modes
Quickstart (offline, no keys)
pip install -e .
tjb run # runs the `samples` benchmark, anthropic:claude-opus-4-7 → its TokenJam candidate
tjb serve # browse the bundled real evidence in the dashboard
tjb run with no flags is offline-first: with no provider key in the environment it auto-enables mock mode (no SDKs, no keys, no spend — numbers illustrative, plumbing real) and writes a version-stamped artifact. Set a provider key (e.g. ANTHROPIC_API_KEY) and it runs for real.
Real proof (live, multi-provider)
pip install -e ".[providers,datasets]"
export ANTHROPIC_API_KEY=... # and/or OPENAI_API_KEY / DEEPSEEK_API_KEY
tjb run --benchmark humaneval \
--original anthropic:claude-opus-4-7 \
--limit 50 --html
Replay your own sessions (DeepSeek)
pip install -e ".[providers,judge]"
export DEEPSEEK_API_KEY=...
TJBENCH_JUDGE=deepseek tjb replay \
--telemetry sessions.jsonl \
--candidate deepseek:deepseek-chat \
--judge deepseek --limit 50 --html
Benchmarks
| name | ground truth | needs | notes |
|---|---|---|---|
samples |
built-in code + math tasks | nothing (offline) | smoke test, always runs |
humaneval |
unit-test pass/fail | [datasets] |
executable — runs model code in subprocess |
gsm8k |
numeric exact-match | [datasets] |
|
judged |
DeepEval correctness | [judge,providers] |
LLM-as-judge; key-gated |
sample-agent |
tool use + safety gate | nothing (offline) | agent benchmark |
swe-bench-lite |
⚠️ none — experimental scaffold | SWE-bench dataset | tool/prompt scaffold only; fix-verification not implemented, scoring disabled — not a real pass-rate |
replay |
agreement with the original model's own historical output | [providers] + telemetry |
your real sessions; measures agreement-with-history, not correctness/safety |
Statistics
Every proof carries its full statistical block — no single fabricated confidence scalar:
- Wilson score interval —
46/50becomes92% [95% CI 81–97%], not just92% - McNemar's exact paired test — correct for same-task paired comparison (not a two-proportion z-test)
- Paired delta CI — 95% CI on the accuracy delta itself, consistent with McNemar
- pass@k — unbiased Chen et al. 2021 estimator for multi-sample runs
- Verdicts:
no_significant_regression·significant_regression·insufficient_evidence— neverSAFE
Zero external dependencies (scipy, numpy not required). Pure Python from first principles. Full writeup →
TokenJam changes every day — that's the design center
TokenJam is consumed as a published package, never vendored:
make update-tokenjam # pip install -U tokenjam; prints the new version
tjb version # shows the exact tokenjam build proofs will stamp
tjb run ... # every artifact records tokenjam_version
tjb matrix # diff proofs across releases; exits non-zero on regression
Because every artifact in results/ carries tokenjam_version, you can re-run the same benchmark across releases and catch the day a TokenJam change moves the numbers. tjb matrix doubles as a CI guard.
Testing
pip install -e ".[dev]"
pytest
| Module | Tests | What's covered |
|---|---|---|
test_stats.py |
9 | Wilson CI, McNemar exact, paired delta CI, pass@k edge cases |
test_pipeline_offline.py |
8 | Full proof pipeline, mock + real candidate, savings direction |
test_matrix.py |
8 | Cross-version regression detection, sorting, serialisation |
test_replay.py |
7 | JSONL + DuckDB loader, dominant_model, replay pipeline wiring |
test_judge.py |
6 | MockJudge overlap, DeepEval backend resolution, metric routing |
test_deepseek.py |
6 | OpenAI-compatible client, mock-directive stripping, provider registry |
test_real_scenarios.py |
6 | Real-scenarios benchmark scoring + agent trace validation |
test_agent_validation.py |
4 | Safety gate, tool ordering, forbidden-tool blocking |
test_scoring.py |
4 | Code extraction, exact-match normalisation, number parsing |
test_report_html.py |
4 | HTML render, artifact load, regression flag rendering |
test_agent_pipeline_offline.py |
3 | Agent proof pipeline, token aggregation, assemble_proof parity |
test_agent_runner.py |
3 | Multi-turn loop, max-turns guard, tool execution |
test_dashboard.py |
3 | Dashboard artifact loading, serve routing |
test_report.py |
3 | ProofResult derived properties, headline, write/round-trip |
test_swe_bench_lite.py |
19 | swe-bench scaffold: scoring-disabled gate + tools/parsing |
test_version_stamp.py |
1 | resolve_tokenjam_build metadata read |
All tests run offline with no provider SDKs or keys. Live-provider tests skip cleanly without [providers] installed.
Honesty
Accuracy = pass-rate on the chosen benchmark suite — never a general "quality preserved" claim. Reports record n, flag --mock runs as illustrative, and flag when cost fell back to TokenJam's $0.50/$2.00 default placeholder rates. There is deliberately no single confidence = 95% scalar — the honest expression is the CI + p-value. Executing model-generated code (HumanEval) happens in a timed subprocess; run only trusted benchmark suites on a machine you control.
Boundary statement. tokenjam-bench is an offline measurement layer. It tells you whether a downsize candidate held up on these benchmark suites, and — for
replay— whether it agreed with your own historical outputs (the original model's past output is the reference, so replay measures agreement-with-history, not correctness and not safety). It is not a runtime safety certification: it does not monitor production traffic, gate live requests, or guarantee the candidate will behave on inputs outside what was measured here.
FAQ
Does a passing verdict mean the cheaper model is safe to deploy?
No. A verdict is the pass-rate on a specific suite plus a McNemar test. no_significant_regression means the bench did not detect a drop on that suite at this sample size. It is not a runtime safety certification, and it says nothing about inputs outside the measured set.
Why did the same downsize pass on math and fail on code?
Accuracy is per-suite. claude-haiku-4-5 matched the original on GSM8K and dropped 34 points on HumanEval. A single blended number would bury that, so the bench scores each benchmark on its own.
Do I need API keys?
No. tjb run is offline-first: with no provider key set it auto-enables mock mode and still writes a stamped artifact. Add a key like ANTHROPIC_API_KEY to run for real.
What are the verdicts, and why no confidence score?
The three are no_significant_regression, significant_regression, and insufficient_evidence. There is no single confidence percentage on purpose. The honest expression of certainty is the Wilson CI plus the McNemar p-value.
Why are some old runs priced with default rates?
A provider without a TokenJam rate falls back to a $0.50/$2.00 placeholder. Those runs are flagged and kept under docs/evidence/archive/, out of the dashboard's headline path. The 2026-06-26 multipair run uses real rates.
How is this a proof of TokenJam and not a generic comparison?
The candidate model and the cost rates both come from the installed tokenjam package, and every artifact is stamped with the exact version that produced it.
Project Layout
cli.py CLI entry point (tjb version | recommend | run | agent | replay | matrix | serve | report)
pipeline.py Single-shot proof pipeline (run_proof, assemble_proof)
agent_pipeline.py Agent proof pipeline (run_agent_proof)
replay.py Telemetry loader (.jsonl / .duckdb read-only)
replay_pipeline.py Replay measurement — real sessions, judge-scored agreement with the original model's historical output
report.py ProofResult, ProofStats, TaskOutcome dataclasses
stats.py Wilson, McNemar exact, paired delta CI, pass@k (zero deps)
cost.py Pricing via tokenjam.core.pricing.get_rates
recommend.py Candidate resolution via tokenjam.core.optimize.DOWNGRADE_CANDIDATES
version.py Installed tokenjam version stamp
exec_sandbox.py Subprocess sandbox for model-generated code (HumanEval)
judge.py MockJudge (offline) + DeepEvalJudge (key-gated)
deepeval_judge.py DeepEval adapter (correctness / relevancy / faithfulness / task-completion)
matrix.py Cross-version regression detection
dashboard.py Local proof browser (auto-refresh)
report_html.py Self-contained HTML report renderer
models/ Provider-agnostic model clients
base.py Completion dataclass, ModelClient protocol
registry.py parse_spec, get_client (live + mock)
openai_compatible.py OpenAI + DeepSeek (+ future) via one abstraction
anthropic_client.py Live Anthropic single-shot
google_client.py Live Google Gemini
mock_client.py Offline deterministic
anthropic_agent_client.py Live Anthropic tool-calling
mock_agent_client.py Offline deterministic tool-calling
tool_calling.py ToolCall, AssistantTurn, ToolCallingClient protocol
benchmarks/ Benchmark definitions + scoring
base.py Task, ScoreResult, Benchmark protocol
scoring.py score_code (subprocess), score_exact_match
samples.py Built-in offline benchmark (code + math)
humaneval.py HumanEval loader
gsm8k.py GSM8K loader
judged.py LLM-judged benchmark (DeepEval)
real_scenarios.py Real-world scenario tasks
agent_base.py AgentTask, AgentBenchmark protocol
sample_agent.py Offline agent benchmark (tool use + safety)
swe_bench_lite.py experimental scaffold — tool-usage only (no fix verification)
agents/ Multi-turn agent execution
runner.py AgentRunner — the multi-turn loop
tools.py Tool, ToolResult, ToolRegistry
trace.py AgentTrace, TurnRecord, ToolCallRecord
validation.py validate_tools — safety gate
results/ Version-stamped JSON + HTML proof artifacts
docs/ Full documentation
docs/brand/ Logo lockup + jar icon (SVG + PNG)
docs/evidence/ Real proof runs; headline set in live/2026-06-26-multipair/
docs/evidence/archive/ Pre-multipair runs, kept for provenance (non-headline)
tests/ Offline pytest suite (no keys, no network)
Documentation
| Doc | Contents |
|---|---|
| docs/README.md | Documentation index |
| docs/overview.md | What this project is and why it exists |
| docs/architecture.md | System design, data flow, module relationships |
| docs/quickstart.md | Get running in 5 minutes |
| docs/cli-reference.md | Complete tjb command reference |
| docs/pipelines.md | Single-shot and agent proof pipelines |
| docs/replay.md | Replay validation — your real sessions |
| docs/models.md | Model client adapters and protocols |
| docs/benchmarks.md | Available benchmarks and scoring |
| docs/agents.md | Multi-turn agent execution framework |
| docs/statistics.md | Statistical methods used for proof |
| docs/cost-pricing.md | How costs are computed |
| docs/tokenjam-integration.md | How we consume TokenJam |
| docs/development.md | Contributing, testing, extending |
| docs/api-reference.md | Module-level API documentation |
| docs/proof-runbook.md | Reproduce the first evidence runs |
Related Projects
- TokenJam — The main cost-optimization and observability platform
- TokenJam Docs — TokenJam's own documentation
- TokenJam Python SDK — SDK for instrumenting agents
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tokenjam_bench-0.1.0.tar.gz.
File metadata
- Download URL: tokenjam_bench-0.1.0.tar.gz
- Upload date:
- Size: 3.2 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3cb896419627306b64bbde1771ab545f10fd061137b50cedf17376a091cf6efd
|
|
| MD5 |
e31e563c2204bf81cd9716aeffb5ded7
|
|
| BLAKE2b-256 |
5a59ee789b76b20c76a2c6f079a0ed128b3027b775f6105414128a20852daaaa
|
Provenance
The following attestation bundles were made for tokenjam_bench-0.1.0.tar.gz:
Publisher:
publish-pypi.yml on Metabuilder-Labs/tokenjam-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenjam_bench-0.1.0.tar.gz -
Subject digest:
3cb896419627306b64bbde1771ab545f10fd061137b50cedf17376a091cf6efd - Sigstore transparency entry: 1990227433
- Sigstore integration time:
-
Permalink:
Metabuilder-Labs/tokenjam-bench@d391b204aa4414070f129b3e8e34201bc980331c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Metabuilder-Labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@d391b204aa4414070f129b3e8e34201bc980331c -
Trigger Event:
release
-
Statement type:
File details
Details for the file tokenjam_bench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tokenjam_bench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 124.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2242b877989194f9b574b8186c85d3262d4a844b1e921263d97f1f87299a07b2
|
|
| MD5 |
d75f642596b554fdecee9345b4f4ed9a
|
|
| BLAKE2b-256 |
a0322de4756c749508ec302dede1495c7fd15068feb8ed5eae98964fce5e30c5
|
Provenance
The following attestation bundles were made for tokenjam_bench-0.1.0-py3-none-any.whl:
Publisher:
publish-pypi.yml on Metabuilder-Labs/tokenjam-bench
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tokenjam_bench-0.1.0-py3-none-any.whl -
Subject digest:
2242b877989194f9b574b8186c85d3262d4a844b1e921263d97f1f87299a07b2 - Sigstore transparency entry: 1990227501
- Sigstore integration time:
-
Permalink:
Metabuilder-Labs/tokenjam-bench@d391b204aa4414070f129b3e8e34201bc980331c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Metabuilder-Labs
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-pypi.yml@d391b204aa4414070f129b3e8e34201bc980331c -
Trigger Event:
release
-
Statement type: