Skip to main content

Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents: pass@k, pass^k, Bayesian posteriors, RDC/VAF/GDS/MOP.

Project description

passwedge

Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents.

pass@1 tells you whether a model can do a task once. It says nothing about whether an agent does so consistently across repeated attempts — the property that actually matters when a task takes hundreds of tool calls and one slip ends the run. passwedge measures that gap: capability vs. reliability.

  • Pure Python (numpy + scipy only), CPU-only, no GPU, no network, no API keys.
  • Every metric is implemented exactly as defined in its source paper, with the equation and citation in docs/DEFINITIONS.md. Where a paper gives a concept without one canonical estimator, passwedge documents its operationalization and never presents it as a verbatim reproduction.
  • Ships three ways: a library, a pytest plugin (@pytest.mark.passk), and a GitHub Action that comments a reliability table on your PRs.

Alpha (0.0.1a1). The API may change before 0.1.0.

Install

pip install passwedge

Quickstart (library)

import passwedge as pw

# 12 repeated attempts at one task; True = the attempt passed.
trial = pw.coerce_trial([True, True, False, True, True, False, True, True, True, False, True, True])

print("pass@1 :", pw.pass_at_k(trial.n, trial.c, 1))   # >=1 of 1 succeeds
print("pass@5 :", pw.pass_at_k(trial.n, trial.c, 5))   # >=1 of 5 succeeds
print("pass^5 :", pw.pass_pow_k(trial.n, trial.c, 5))  # ALL 5 succeed (reliability)

# Bayesian posterior over the task's true success probability (Jeffreys prior).
post = pw.beta_posterior(trial.c, trial.n)
print("posterior mean :", round(post.mean(), 3))
print("95% credible   :", tuple(round(x, 3) for x in post.credible_interval(0.95)))
print("E[p^5]         :", round(post.expected_pow_k(5), 3))  # expected all-5-success

Quickstart (CI gate / GitHub Action)

Given a JSON file of task outcomes:

[
  {"task_id": "t1", "duration_bucket": "short", "outcomes": [true, true, true, false]},
  {"task_id": "t2", "duration_bucket": "long",  "outcomes": [true, false, false, false]}
]
passwedge ci --input trials.json --k 1,2 --metric pass_pow_k --fail-under 0.5

prints a Markdown report, emits $GITHUB_OUTPUT values, and exits non-zero if the chosen metric is below the threshold. In a workflow, use the bundled action:

- uses: hinanohart/passwedge@v0.0.1a2
  with:
    input: trials.json
    fail-under: "0.5"
    metric: pass_pow_k

Quickstart (pytest plugin)

import pytest

@pytest.mark.passk(attempts=20, k=5, min_pass_pow_k=0.9)
def test_agent_is_reliable():
    assert run_agent().solved   # executed 20×; passes iff pass^5 >= 0.9

Metrics

Metric Meaning Source
pass_at_k probability ≥1 of k attempts succeeds Chen et al. 2021 (arXiv:2107.03374)
pass_pow_k probability all k attempts succeed Beyond pass@1 (arXiv:2603.29231), Def. 2
reliability_decay_curve / _slope how pass^k decays with task duration (RDC/RDS) arXiv:2603.29231, Def. 3
variance_amplification_factor VAF: variance ratio long-vs-short bucket arXiv:2603.29231, Def. 4
graceful_degradation_score GDS: weighted partial credit over subtasks arXiv:2603.29231, Def. 5
meltdown_onset_point MOP: tool-call entropy collapse detector arXiv:2603.29231, Def. 6
beta_posterior / dirichlet_posterior Bayesian posterior mean + credible interval, E[p^k] Don't Pass@k (arXiv:2510.04265)

Honest-marketing note. arXiv:2510.04265 does not define a metric called "Bayes@k"; passwedge's Bayesian helpers are our operationalization of that paper's Dirichlet framework. The scorio package (the paper's reference implementation) is the numeric baseline our test suite reproduces under a uniform prior.

MOP thresholds are dataset-specific. meltdown_onset_point requires explicit theta_h, delta, w (no defaults). The paper's calibration is exposed as MOP_PAPER_DEFAULTS but applying it verbatim to other data produces false positives.

Where it fits

passwedge is the measurement layer: it consumes repeated-attempt outcomes (a bool list, a JSON file, or a trace export) and reports reliability. It deliberately does not run rollouts, score reward/fitness functions, audit reward gameability, or detect reward hacking — those are separate concerns handled by tools such as scorewright (fitness scoring + anti-gaming), rewardfuzz (reward gameability auditing), and mav-bench (multi-agent verification). passwedge sits one layer up from those: feed it the per-attempt pass/fail outcomes they (or any eval harness) produce, and it tells you how reliably the agent succeeds. For confidence / hallucination fragility see yuragi; for streaming inference verification see conformlock.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

passwedge-0.0.1a2.tar.gz (33.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

passwedge-0.0.1a2-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file passwedge-0.0.1a2.tar.gz.

File metadata

  • Download URL: passwedge-0.0.1a2.tar.gz
  • Upload date:
  • Size: 33.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for passwedge-0.0.1a2.tar.gz
Algorithm Hash digest
SHA256 1add953ac1589dd550e63df6c24bc2e32605b407639c8f35033ab22d5be4302c
MD5 c08c686ce46c8282deb087f38372d8b2
BLAKE2b-256 f4055bee57b20b0fce998f8622dc045a658d7f082545687633884471d757a936

See more details on using hashes here.

File details

Details for the file passwedge-0.0.1a2-py3-none-any.whl.

File metadata

  • Download URL: passwedge-0.0.1a2-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for passwedge-0.0.1a2-py3-none-any.whl
Algorithm Hash digest
SHA256 e9a42b38ac90f4cf27ff52b0a2516ecd6148315327d48c76e7fa343ff9ed521f
MD5 43e0bd397539a34cdeba67e30c0844a5
BLAKE2b-256 348403908646bb1f337ff0b9a466012744a5a1bf5e345dc452d71387708a2ff8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page