Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents: pass@k, pass^k, Bayesian posteriors, RDC/VAF/GDS/MOP.
Project description
passwedge
Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents.
pass@1 tells you whether a model can do a task once. It says nothing about whether an
agent does so consistently across repeated attempts — the property that actually matters
when a task takes hundreds of tool calls and one slip ends the run. passwedge measures that
gap: capability vs. reliability.
- Pure Python (
numpy+scipyonly), CPU-only, no GPU, no network, no API keys. - Every metric is implemented exactly as defined in its source paper, with the equation
and citation in
docs/DEFINITIONS.md. Where a paper gives a concept without one canonical estimator, passwedge documents its operationalization and never presents it as a verbatim reproduction. - Ships three ways: a library, a pytest plugin (
@pytest.mark.passk), and a GitHub Action that comments a reliability table on your PRs.
Alpha (
0.0.1a1). The API may change before0.1.0.
Install
pip install passwedge
Quickstart (library)
import passwedge as pw
# 12 repeated attempts at one task; True = the attempt passed.
trial = pw.coerce_trial([True, True, False, True, True, False, True, True, True, False, True, True])
print("pass@1 :", pw.pass_at_k(trial.n, trial.c, 1)) # >=1 of 1 succeeds
print("pass@5 :", pw.pass_at_k(trial.n, trial.c, 5)) # >=1 of 5 succeeds
print("pass^5 :", pw.pass_pow_k(trial.n, trial.c, 5)) # ALL 5 succeed (reliability)
# Bayesian posterior over the task's true success probability (Jeffreys prior).
post = pw.beta_posterior(trial.c, trial.n)
print("posterior mean :", round(post.mean(), 3))
print("95% credible :", tuple(round(x, 3) for x in post.credible_interval(0.95)))
print("E[p^5] :", round(post.expected_pow_k(5), 3)) # expected all-5-success
Quickstart (CI gate / GitHub Action)
Given a JSON file of task outcomes:
[
{"task_id": "t1", "duration_bucket": "short", "outcomes": [true, true, true, false]},
{"task_id": "t2", "duration_bucket": "long", "outcomes": [true, false, false, false]}
]
passwedge ci --input trials.json --k 1,2 --metric pass_pow_k --fail-under 0.5
prints a Markdown report, emits $GITHUB_OUTPUT values, and exits non-zero if the chosen
metric is below the threshold. In a workflow, use the bundled action:
- uses: hinanohart/passwedge@v0.0.1a1
with:
input: trials.json
fail-under: "0.5"
metric: pass_pow_k
Quickstart (pytest plugin)
import pytest
@pytest.mark.passk(attempts=20, k=5, min_pass_pow_k=0.9)
def test_agent_is_reliable():
assert run_agent().solved # executed 20×; passes iff pass^5 >= 0.9
Metrics
| Metric | Meaning | Source |
|---|---|---|
pass_at_k |
probability ≥1 of k attempts succeeds | Chen et al. 2021 (arXiv:2107.03374) |
pass_pow_k |
probability all k attempts succeed | Beyond pass@1 (arXiv:2603.29231), Def. 2 |
reliability_decay_curve / _slope |
how pass^k decays with task duration (RDC/RDS) | arXiv:2603.29231, Def. 3 |
variance_amplification_factor |
VAF: variance ratio long-vs-short bucket | arXiv:2603.29231, Def. 4 |
graceful_degradation_score |
GDS: weighted partial credit over subtasks | arXiv:2603.29231, Def. 5 |
meltdown_onset_point |
MOP: tool-call entropy collapse detector | arXiv:2603.29231, Def. 6 |
beta_posterior / dirichlet_posterior |
Bayesian posterior mean + credible interval, E[p^k] |
Don't Pass@k (arXiv:2510.04265) |
Honest-marketing note. arXiv:2510.04265 does not define a metric called "Bayes@k"; passwedge's Bayesian helpers are our operationalization of that paper's Dirichlet framework. The
scoriopackage (the paper's reference implementation) is the numeric baseline our test suite reproduces under a uniform prior.MOP thresholds are dataset-specific.
meltdown_onset_pointrequires explicittheta_h,delta,w(no defaults). The paper's calibration is exposed asMOP_PAPER_DEFAULTSbut applying it verbatim to other data produces false positives.
Where it fits
passwedge is the measurement layer: it consumes repeated-attempt outcomes (a bool list,
a JSON file, or a trace export) and reports reliability. It deliberately does not run
rollouts, score reward/fitness functions, audit reward gameability, or detect reward hacking
— those are separate concerns handled by tools such as
scorewright (fitness scoring + anti-gaming),
rewardfuzz (reward gameability auditing), and
mav-bench (multi-agent verification). passwedge
sits one layer up from those: feed it the per-attempt pass/fail outcomes they (or any eval
harness) produce, and it tells you how reliably the agent succeeds. For confidence /
hallucination fragility see yuragi; for streaming
inference verification see conformlock.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file passwedge-0.0.1a1.tar.gz.
File metadata
- Download URL: passwedge-0.0.1a1.tar.gz
- Upload date:
- Size: 32.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c00aa5a69006a50adc92e58a78613559ccc5e0081c0519e4e90e618f89a6d3f
|
|
| MD5 |
e64e719ecb626811dc1d252ced7d737e
|
|
| BLAKE2b-256 |
6e47fd6197106dd8429cbf0ec15f0705a4b86aab53bb4e964aa70b0ac1fdaac2
|
File details
Details for the file passwedge-0.0.1a1-py3-none-any.whl.
File metadata
- Download URL: passwedge-0.0.1a1-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
359d3a365b57dee24d92cc756841fab4d868ad3ff48ef4390c131344589b2e15
|
|
| MD5 |
23f9745420f7ac04c5ffc04c306d8a2a
|
|
| BLAKE2b-256 |
023f406aa472c7c89d5ca5ed90d2c921d59e7d006c7674d1ca324cf65d53ccff
|