Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents: pass@k, pass^k, Bayesian posteriors, RDC/VAF/GDS/MOP.

These details have not been verified by PyPI

Project links

Project description

passwedge

Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents.

pass@1 tells you whether a model can do a task once. It says nothing about whether an agent does so consistently across repeated attempts — the property that actually matters when a task takes hundreds of tool calls and one slip ends the run. passwedge measures that gap: capability vs. reliability.

Pure Python (numpy + scipy only), CPU-only, no GPU, no network, no API keys.
Every metric is implemented exactly as defined in its source paper, with the equation and citation in docs/DEFINITIONS.md. Where a paper gives a concept without one canonical estimator, passwedge documents its operationalization and never presents it as a verbatim reproduction.
Ships three ways: a library, a pytest plugin (@pytest.mark.passk), and a GitHub Action that comments a reliability table on your PRs.

Alpha (0.0.1a1). The API may change before 0.1.0.

Install

pip install passwedge

Quickstart (library)

import passwedge as pw

# 12 repeated attempts at one task; True = the attempt passed.
trial = pw.coerce_trial([True, True, False, True, True, False, True, True, True, False, True, True])

print("pass@1 :", pw.pass_at_k(trial.n, trial.c, 1))   # >=1 of 1 succeeds
print("pass@5 :", pw.pass_at_k(trial.n, trial.c, 5))   # >=1 of 5 succeeds
print("pass^5 :", pw.pass_pow_k(trial.n, trial.c, 5))  # ALL 5 succeed (reliability)

# Bayesian posterior over the task's true success probability (Jeffreys prior).
post = pw.beta_posterior(trial.c, trial.n)
print("posterior mean :", round(post.mean(), 3))
print("95% credible   :", tuple(round(x, 3) for x in post.credible_interval(0.95)))
print("E[p^5]         :", round(post.expected_pow_k(5), 3))  # expected all-5-success

Quickstart (CI gate / GitHub Action)

Given a JSON file of task outcomes:

[
  {"task_id": "t1", "duration_bucket": "short", "outcomes": [true, true, true, false]},
  {"task_id": "t2", "duration_bucket": "long",  "outcomes": [true, false, false, false]}
]

passwedge ci --input trials.json --k 1,2 --metric pass_pow_k --fail-under 0.5

prints a Markdown report, emits $GITHUB_OUTPUT values, and exits non-zero if the chosen metric is below the threshold. In a workflow, use the bundled action:

- uses: hinanohart/passwedge@v0.0.1a2
  with:
    input: trials.json
    fail-under: "0.5"
    metric: pass_pow_k

Quickstart (pytest plugin)

import pytest

@pytest.mark.passk(attempts=20, k=5, min_pass_pow_k=0.9)
def test_agent_is_reliable():
    assert run_agent().solved   # executed 20×; passes iff pass^5 >= 0.9

Metrics

Metric	Meaning	Source
`pass_at_k`	probability ≥1 of k attempts succeeds	Chen et al. 2021 (arXiv:2107.03374)
`pass_pow_k`	probability all k attempts succeed	Beyond pass@1 (arXiv:2603.29231), Def. 2
`reliability_decay_curve` / `_slope`	how pass^k decays with task duration (RDC/RDS)	arXiv:2603.29231, Def. 3
`variance_amplification_factor`	VAF: variance ratio long-vs-short bucket	arXiv:2603.29231, Def. 4
`graceful_degradation_score`	GDS: weighted partial credit over subtasks	arXiv:2603.29231, Def. 5
`meltdown_onset_point`	MOP: tool-call entropy collapse detector	arXiv:2603.29231, Def. 6
`beta_posterior` / `dirichlet_posterior`	Bayesian posterior mean + credible interval, `E[p^k]`	Don't Pass@k (arXiv:2510.04265)

Honest-marketing note. arXiv:2510.04265 does not define a metric called "Bayes@k"; passwedge's Bayesian helpers are our operationalization of that paper's Dirichlet framework. The scorio package (the paper's reference implementation) is the numeric baseline our test suite reproduces under a uniform prior.

MOP thresholds are dataset-specific. meltdown_onset_point requires explicit theta_h, delta, w (no defaults). The paper's calibration is exposed as MOP_PAPER_DEFAULTS but applying it verbatim to other data produces false positives.

Where it fits

passwedge is the measurement layer: it consumes repeated-attempt outcomes (a bool list, a JSON file, or a trace export) and reports reliability. It deliberately does not run rollouts, score reward/fitness functions, audit reward gameability, or detect reward hacking — those are separate concerns handled by tools such as scorewright (fitness scoring + anti-gaming), rewardfuzz (reward gameability auditing), and mav-bench (multi-agent verification). passwedge sits one layer up from those: feed it the per-attempt pass/fail outcomes they (or any eval harness) produce, and it tells you how reliably the agent succeeds. For confidence / hallucination fragility see yuragi; for streaming inference verification see conformlock.

License

Apache-2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.0.1a2 pre-release

May 26, 2026

0.0.1a1 pre-release

May 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

passwedge-0.0.1a2.tar.gz (33.0 kB view details)

Uploaded May 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

passwedge-0.0.1a2-py3-none-any.whl (29.8 kB view details)

Uploaded May 26, 2026 Python 3

File details

Details for the file passwedge-0.0.1a2.tar.gz.

File metadata

Download URL: passwedge-0.0.1a2.tar.gz
Upload date: May 26, 2026
Size: 33.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for passwedge-0.0.1a2.tar.gz
Algorithm	Hash digest
SHA256	`1add953ac1589dd550e63df6c24bc2e32605b407639c8f35033ab22d5be4302c`
MD5	`c08c686ce46c8282deb087f38372d8b2`
BLAKE2b-256	`f4055bee57b20b0fce998f8622dc045a658d7f082545687633884471d757a936`

See more details on using hashes here.

File details

Details for the file passwedge-0.0.1a2-py3-none-any.whl.

File metadata

Download URL: passwedge-0.0.1a2-py3-none-any.whl
Upload date: May 26, 2026
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for passwedge-0.0.1a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e9a42b38ac90f4cf27ff52b0a2516ecd6148315327d48c76e7fa343ff9ed521f`
MD5	`43e0bd397539a34cdeba67e30c0844a5`
BLAKE2b-256	`348403908646bb1f337ff0b9a466012744a5a1bf5e345dc452d71387708a2ff8`

See more details on using hashes here.

passwedge 0.0.1a2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

passwedge

Install

Quickstart (library)

Quickstart (CI gate / GitHub Action)

Quickstart (pytest plugin)

Metrics

Where it fits

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes