The cost-and-failure-mode benchmark for LLM agents.
Project description
bellwether
The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.
Live leaderboard · Methodology
Why
Cross-provider LLM benchmarks today rank capability ("which model is smarter on average"). HELM and Chatbot Arena own that ground.
Practitioners building production systems need a different answer: which provider for THIS task, at THIS cost when retries and failures are accounted for, with THESE failure modes that map to my product's tolerance.
bellwether answers the procurement question and ships the toolkit anyone can run on their own prompts.
What it measures
effective_TCoT: total cost per successfully completed task, including the cost of failed retries. The procurement-question metric, not the average-quality one.- Failure-mode taxonomy: classify how models fail, not just whether (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). Maps to product-tolerance decisions.
- Machine-checkable ground truth only. No LLM-as-judge. Sidesteps the well-documented judge-bias issue.
- Prompt portability. Headline numbers use one canonical prompt across providers; portability cost (tuned vs canonical) is a v1 promise with a real contract.
See METHODOLOGY.md for formulas, retry policy, validator contract, and reproducibility caveats.
Install
From source (current; PyPI publish pending):
git clone https://github.com/cartesianxr7/bellwether
cd bellwether
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env # add ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY
pre-commit install # optional, gates secret leaks
pytest # 120+ tests; all should pass
After v0.1.0 publish to PyPI:
pip install bellwether
Run
bellwether list providers # show registered provider adapters
bellwether list tasks # show registered tasks
# Smoke test: 2 instances, 1 run each, $1 cap, takes ~10 seconds and ~$0.01:
bellwether run --instances 2 --n 1 --max-cost 1
# Standard bench: 5 instances, 3 runs per instance, all 3 providers, $5 cap:
bellwether run --instances 5 --n 3 --max-cost 5
# Re-render leaderboard from existing results without re-running:
bellwether report results
The cost guardrail (--max-cost USD) is a hard cap on total spend per invocation. Strongly recommended.
Status
v0.1: methodology, package, CLI, structured-output extraction task across Claude Sonnet 4.6, GPT-4o, and Gemini 2.5 Flash Lite. 1-task leaderboard, 3-pass reproducibility data.
v0.2 through v0.5: function calling (BFCL), RAG (FinanceBench/NQ-open/HotpotQA), multi-step reasoning (GAIA validation set), long-context summarization (GovReport). One task per release.
v1: code-generation task with sandboxing, OpenRouter open-weights, tuned-prompt-track formalization, plugin loader.
Repository
- Code: github.com/cartesianxr7/bellwether
- Leaderboard: cartesianxr7.github.io/bellwether
- Methodology: cartesianxr7.github.io/bellwether/methodology.html
- Raw results JSON: results/
Contributing
See CONTRIBUTING.md. Adding a task or a provider adapter is a single PR; the contract is documented and small.
License
MIT. See LICENSE.
Author
Stephen Hedrick.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bellwether-0.1.0.tar.gz.
File metadata
- Download URL: bellwether-0.1.0.tar.gz
- Upload date:
- Size: 33.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70c65d886bebd5f28bdbe337cafc63c2409685127e5d12b65a0565eccb7a721e
|
|
| MD5 |
d8c3d677b726ee0f70069d1c59676eb8
|
|
| BLAKE2b-256 |
3c5b642b9d411550253eda651bc1824455b1c9bd65802ae63ace919f6c63cafa
|
Provenance
The following attestation bundles were made for bellwether-0.1.0.tar.gz:
Publisher:
publish.yml on CartesianXR7/bellwether
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bellwether-0.1.0.tar.gz -
Subject digest:
70c65d886bebd5f28bdbe337cafc63c2409685127e5d12b65a0565eccb7a721e - Sigstore transparency entry: 1449390619
- Sigstore integration time:
-
Permalink:
CartesianXR7/bellwether@06042ddd03303cad4c4bcb20f1475a156075df11 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/CartesianXR7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@06042ddd03303cad4c4bcb20f1475a156075df11 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file bellwether-0.1.0-py3-none-any.whl.
File metadata
- Download URL: bellwether-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16b87a7a19da6773483d7ad19b8f17b6293d43f90303dfc09877421c88cdb259
|
|
| MD5 |
353195f8f90e0a2b1dc55ad461059372
|
|
| BLAKE2b-256 |
ece12daa038d64b9dda4e597c4c80083662a899ecf24fb26d763177413e123c5
|
Provenance
The following attestation bundles were made for bellwether-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on CartesianXR7/bellwether
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bellwether-0.1.0-py3-none-any.whl -
Subject digest:
16b87a7a19da6773483d7ad19b8f17b6293d43f90303dfc09877421c88cdb259 - Sigstore transparency entry: 1449391019
- Sigstore integration time:
-
Permalink:
CartesianXR7/bellwether@06042ddd03303cad4c4bcb20f1475a156075df11 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/CartesianXR7
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@06042ddd03303cad4c4bcb20f1475a156075df11 -
Trigger Event:
workflow_dispatch
-
Statement type: