Skip to main content

The cost-and-failure-mode benchmark for LLM agents.

Project description

bellwether

tests python methodology license

The cost-and-failure-mode benchmark for LLM agents. Methodology plus Python package for honest, reproducible cross-provider agent evaluation.

Live leaderboard · Methodology

Why

Cross-provider LLM benchmarks today rank capability ("which model is smarter on average"). HELM and Chatbot Arena own that ground.

Practitioners building production systems need a different answer: which provider for THIS task, at THIS cost when retries and failures are accounted for, with THESE failure modes that map to my product's tolerance.

bellwether answers the procurement question and ships the toolkit anyone can run on their own prompts.

What it measures

  • effective_TCoT: total cost per successfully completed task, including the cost of failed retries. The procurement-question metric, not the average-quality one.
  • Failure-mode taxonomy: classify how models fail, not just whether (refusal, confabulation, schema break, truncation, partial, off-task, timeout, error). Maps to product-tolerance decisions.
  • Machine-checkable ground truth only. No LLM-as-judge. Sidesteps the well-documented judge-bias issue.
  • Prompt portability. Headline numbers use one canonical prompt across providers; portability cost (tuned vs canonical) is a v1 promise with a real contract.

See METHODOLOGY.md for formulas, retry policy, validator contract, and reproducibility caveats.

Install

From source (current; PyPI publish pending):

git clone https://github.com/cartesianxr7/bellwether
cd bellwether
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cp .env.example .env       # add ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY
pre-commit install         # optional, gates secret leaks
pytest                     # 120+ tests; all should pass

After v0.1.0 publish to PyPI:

pip install bellwether

Run

bellwether list providers           # show registered provider adapters
bellwether list tasks               # show registered tasks

# Smoke test: 2 instances, 1 run each, $1 cap, takes ~10 seconds and ~$0.01:
bellwether run --instances 2 --n 1 --max-cost 1

# Standard bench: 5 instances, 3 runs per instance, all 3 providers, $5 cap:
bellwether run --instances 5 --n 3 --max-cost 5

# Re-render leaderboard from existing results without re-running:
bellwether report results

The cost guardrail (--max-cost USD) is a hard cap on total spend per invocation. Strongly recommended.

Status

v0.1: methodology, package, CLI, structured-output extraction task across Claude Sonnet 4.6, GPT-4o, and Gemini 2.5 Flash Lite. 1-task leaderboard, 3-pass reproducibility data.

v0.2 through v0.5: function calling (BFCL), RAG (FinanceBench/NQ-open/HotpotQA), multi-step reasoning (GAIA validation set), long-context summarization (GovReport). One task per release.

v1: code-generation task with sandboxing, OpenRouter open-weights, tuned-prompt-track formalization, plugin loader.

Repository

Contributing

See CONTRIBUTING.md. Adding a task or a provider adapter is a single PR; the contract is documented and small.

License

MIT. See LICENSE.

Author

Stephen Hedrick.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bellwether-0.1.0.tar.gz (33.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bellwether-0.1.0-py3-none-any.whl (29.0 kB view details)

Uploaded Python 3

File details

Details for the file bellwether-0.1.0.tar.gz.

File metadata

  • Download URL: bellwether-0.1.0.tar.gz
  • Upload date:
  • Size: 33.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bellwether-0.1.0.tar.gz
Algorithm Hash digest
SHA256 70c65d886bebd5f28bdbe337cafc63c2409685127e5d12b65a0565eccb7a721e
MD5 d8c3d677b726ee0f70069d1c59676eb8
BLAKE2b-256 3c5b642b9d411550253eda651bc1824455b1c9bd65802ae63ace919f6c63cafa

See more details on using hashes here.

Provenance

The following attestation bundles were made for bellwether-0.1.0.tar.gz:

Publisher: publish.yml on CartesianXR7/bellwether

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bellwether-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: bellwether-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for bellwether-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 16b87a7a19da6773483d7ad19b8f17b6293d43f90303dfc09877421c88cdb259
MD5 353195f8f90e0a2b1dc55ad461059372
BLAKE2b-256 ece12daa038d64b9dda4e597c4c80083662a899ecf24fb26d763177413e123c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for bellwether-0.1.0-py3-none-any.whl:

Publisher: publish.yml on CartesianXR7/bellwether

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page