Skip to main content

Reliability and reproducibility statistics for stochastic agent / tool-use evals.

Project description

agentrel

Reliability & reproducibility statistics for stochastic agent / tool-use evals.

CI License: MIT Python 3.10+

The problem

Agent / tool-use evals — multi-turn, partial-credit, trajectory-scored — are the hottest and least reproducible corner of LLM evaluation. The same agent on the same task scores differently run to run, because the agent is stochastic. That breaks the habits people carry over from single-shot benchmarks:

  • A single run is a coin flip, not a measurement. Reporting one pass/fail per task throws away the run-to-run variance that dominates these evals.
  • A headline number with no interval invites false conclusions. "Agent A beats agent B by 3 points" is meaningless without knowing the 3 points clears the noise — and partial-credit, multi-run scores aren't binomial, so the textbook proportion CI doesn't apply.
  • pass@k quietly conflates capability with luck. Best-of-k looks great while the agent silently fails most of the time; the reliability question ("does it succeed every time?") needs pass^k, which is rarely reported.
  • "How many runs do I need?" usually goes unasked, so half the per-task pass rates in a report have confidence intervals too wide to support any claim.

agentrel is a small, framework-agnostic library that makes all of this explicit: it decomposes how much of a score is the agent vs. luck, attaches honest confidence intervals to every number, does paired model-vs-model comparison with multiple-comparison control, and stamps results for reproducibility.

Keyless & offline. agentrel never calls a model. It analyzes eval results you already have (or the built-in simulate_agent_runs() generator). Bring your own runs via the JSON/CSV or Inspect-log adapters.

Install

git clone https://github.com/yongzhe2160cs/agent-eval-reliability
cd agent-eval-reliability
uv venv && uv pip install -e ".[dev]"

Worked example

examples/synthetic_runs.json is a synthetic multi-run agent-eval log: 16 tasks, 10 runs each, partial-credit scores in [0, 1]. Load it and ask for a report:

import agentrel as ar

runs = ar.from_json("examples/synthetic_runs.json", agent="demo-agent")
print(ar.reliability_report(runs, ks=(1, 2, 5)).summary())
Reliability report — agent='demo-agent'
  design: 16 tasks x ~10.0 runs (160 runs)
  mean score:   0.6261 [0.4951, 0.7554] (95% CI, mean-score cluster-bootstrap)
  variance:     between-task=0.0577  within-task(luck)=0.0849
  ICC(1):       0.404  (luck-dominated; single runs are noisy)
  pass@1:       0.4688 [0.2938, 0.6438] (95% CI, pass@1 cluster-bootstrap)
  pass@2:       0.6181 [0.4194, 0.7833] (95% CI, pass@2 cluster-bootstrap)
  pass@5:       0.8090 [0.5816, 0.9187] (95% CI, pass@5 cluster-bootstrap)
  pass^1:       0.4688 [0.2938, 0.6438] (95% CI, pass^1 cluster-bootstrap)
  pass^2:       0.3194 [0.1556, 0.5223] (95% CI, pass^2 cluster-bootstrap)
  pass^5:       0.1947 [0.0506, 0.4137] (95% CI, pass^5 cluster-bootstrap)
  flakiness:    6/16 tasks have a pass-rate CI wider than 0.4 (too few runs to trust)
  provenance:   input=e1b47bf4082d5e13 numpy=2.4.6 agentrel=0.1.0 seed=0

How to read it:

  • ICC(1) = 0.40 — only ~40% of the score variance is the agent's consistent per-task skill; the other 60% is run-to-run luck. A single run on this suite is genuinely noisy, and the verdict line says so.
  • pass@5 (0.81) vs pass^5 (0.19) — best-of-5 looks strong, but the agent almost never solves a task on all 5 attempts. That gap is the reliability story, and it is invisible if you only report pass@k.
  • Every number carries a 95% CI computed by hierarchical bootstrap (resampling tasks and runs), because partial-credit multi-run scores are not binomial.
  • 6 of 16 tasks are flagged flaky — their pass-rate CI is wider than 0.4, i.e. too few runs to trust. Those are the first place to spend more compute.

Full runnable walkthrough — ingest → report → power → flakiness → paired comparison → determinism check — is in examples/demo.py:

uv run python examples/demo.py

Comparing two agents

Paired comparison on a shared task set, with multiple-comparison control across tasks:

import agentrel as ar

a = ar.simulate_agent_runs(n_tasks=40, runs_per_task=10, base_skill=0.62, agent="candidate", seed=11)
b = ar.simulate_agent_runs(n_tasks=40, runs_per_task=10, base_skill=0.45, agent="baseline", seed=11)

cmp = ar.compare_agents(a, b)
print(cmp.paired.delta)            # 0.1214 [0.0788, 0.1642] (95% CI, paired-delta bootstrap)
print(cmp.paired.significant)      # True  (paired t-test on per-task means)
print(cmp.n_sig_holm, cmp.n_sig_bh)  # per-task discoveries after Holm / BH correction

The aggregate paired delta answers "is A better overall?" (respecting the pairing — same tasks). The per-task tests answer "on which tasks?" — and are corrected by both Holm (controls family-wise error) and Benjamini-Hochberg (controls false discovery rate), so testing 40 tasks at once doesn't manufacture false positives.

What's in the box

Area API What it gives you
Data model RunSet, TaskRun per-task, multi-run results with partial-credit scores
Ingest from_json, from_csv, from_records, from_inspect_log generic adapters + an Inspect-AI .eval-log stub
Variance variance_components, icc one-way random-effects decomposition; ICC(1) = "agent vs luck"
Coverage pass_at_k, pass_hat_k unbiased best-of-k and all-of-k, with bootstrap CIs
Uncertainty mean_score_ci cluster-bootstrap CI for the task-averaged mean score
Power min_runs_for_ci_width minimum runs/task for a target CI half-width
Comparison compare_agents, paired_delta, holm, benjamini_hochberg paired delta + per-task tests with multiplicity control
Reproducibility stamp, determinism_check, flakiness_report provenance, pipeline determinism, flaky-task flagging
One call reliability_report bundles all of the above into a readable summary
Demo data simulate_agent_runs synthetic stochastic agent runs (tune ICC via skill-spread vs luck)

Methods notes

  • ICC(1) uses the standard one-way random-effects ANOVA estimator with the n0 correction for unbalanced designs (different #runs per task). Verified against a hand-worked example in the tests.
  • pass@k / pass^k use the unbiased combinatorial estimators (pass@k = 1 - C(n-c,k)/C(n,k), the Chen et al. 2021 / HumanEval form; pass^k = C(c,k)/C(n,k)), verified against brute-force enumeration over all size-k subsets. CIs come from a hierarchical (task-then-run) bootstrap.
  • Paired comparison uses a paired t-test and Wilcoxon on per-task mean scores plus a paired bootstrap CI; per-task Welch tests feed Holm and BH.
  • Partial-credit, multi-run scores are not binomial, so aggregate CIs are bootstrap-based rather than Wald/Wilson. (Wilson is used for per-task binary pass-rate flakiness, where the binomial model applies.)

Why simulation, and what a full empirical version adds

This toolkit is the statistics layer; it is deliberately model-free so it stays keyless and reproducible in CI. simulate_agent_runs() stands in for real runs by modeling the structure that makes agent evals hard (between-task skill spread vs within-task luck), which is exactly what's needed to unit-test the estimators.

A full empirical study on top would add: real multi-epoch agent runs against a live model + tool sandbox; a non-stub Inspect adapter reading .eval logs directly; trajectory-level features (tool-call counts, tokens, wall-clock) joined to scores; and a recommended-N report driven by observed within-task variance. None of that changes the math here — it just feeds it real data.

Development

uv run pytest          # 83 tests, verified against closed-form / brute-force references
uv run ruff check .    # lint
uv run ruff format .   # format

License

MIT — see LICENSE.


agentrel is part of a statistical-rigor-for-AI-evals toolkit: deltagate (paired-delta validation for eval comparisons), calibstats (calibration metrics with confidence intervals), leaderboard-ci (leaderboard re-ranking with CIs and tie bands). Full portfolio: github.com/yongzhe2160cs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agentrel-0.1.0.tar.gz (27.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agentrel-0.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file agentrel-0.1.0.tar.gz.

File metadata

  • Download URL: agentrel-0.1.0.tar.gz
  • Upload date:
  • Size: 27.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for agentrel-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e97042f32449c6b8be5b18ad74c37dd5711a4e2da520365e6cb637958c319eb6
MD5 185ca990a416e984dc0586c903ecc323
BLAKE2b-256 ea3aad3accc8a58fbc9dc256ab469b7ab2dc77a16b56091d28be52c56e0e0519

See more details on using hashes here.

File details

Details for the file agentrel-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agentrel-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for agentrel-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d498c15d9a330883f4539ec0a2e8787722dae7ab36610724a39f176e355b5f8f
MD5 6bd24ac7a8dd9bab71616d1d29f04303
BLAKE2b-256 5036135f41e188d9b191be3c8105b5e16f2f8ed956ac9d940cfea4139e825c0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page