Skip to main content

Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents: pass@k, pass^k, Bayesian posteriors, RDC/VAF/GDS/MOP.

Project description

passwedge

Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents.

pass@1 tells you whether a model can do a task once. It says nothing about whether an agent does so consistently across repeated attempts — the property that actually matters when a task takes hundreds of tool calls and one slip ends the run. passwedge measures that gap: capability vs. reliability.

  • Pure Python (numpy + scipy only), CPU-only, no GPU, no network, no API keys.
  • Every metric is implemented exactly as defined in its source paper, with the equation and citation in docs/DEFINITIONS.md. Where a paper gives a concept without one canonical estimator, passwedge documents its operationalization and never presents it as a verbatim reproduction.
  • Ships three ways: a library, a pytest plugin (@pytest.mark.passk), and a GitHub Action that comments a reliability table on your PRs.

Alpha (0.0.1a1). The API may change before 0.1.0.

Install

pip install passwedge

Quickstart (library)

import passwedge as pw

# 12 repeated attempts at one task; True = the attempt passed.
trial = pw.coerce_trial([True, True, False, True, True, False, True, True, True, False, True, True])

print("pass@1 :", pw.pass_at_k(trial.n, trial.c, 1))   # >=1 of 1 succeeds
print("pass@5 :", pw.pass_at_k(trial.n, trial.c, 5))   # >=1 of 5 succeeds
print("pass^5 :", pw.pass_pow_k(trial.n, trial.c, 5))  # ALL 5 succeed (reliability)

# Bayesian posterior over the task's true success probability (Jeffreys prior).
post = pw.beta_posterior(trial.c, trial.n)
print("posterior mean :", round(post.mean(), 3))
print("95% credible   :", tuple(round(x, 3) for x in post.credible_interval(0.95)))
print("E[p^5]         :", round(post.expected_pow_k(5), 3))  # expected all-5-success

Quickstart (CI gate / GitHub Action)

Given a JSON file of task outcomes:

[
  {"task_id": "t1", "duration_bucket": "short", "outcomes": [true, true, true, false]},
  {"task_id": "t2", "duration_bucket": "long",  "outcomes": [true, false, false, false]}
]
passwedge ci --input trials.json --k 1,2 --metric pass_pow_k --fail-under 0.5

prints a Markdown report, emits $GITHUB_OUTPUT values, and exits non-zero if the chosen metric is below the threshold. In a workflow, use the bundled action:

- uses: hinanohart/passwedge@v0.0.1a1
  with:
    input: trials.json
    fail-under: "0.5"
    metric: pass_pow_k

Quickstart (pytest plugin)

import pytest

@pytest.mark.passk(attempts=20, k=5, min_pass_pow_k=0.9)
def test_agent_is_reliable():
    assert run_agent().solved   # executed 20×; passes iff pass^5 >= 0.9

Metrics

Metric Meaning Source
pass_at_k probability ≥1 of k attempts succeeds Chen et al. 2021 (arXiv:2107.03374)
pass_pow_k probability all k attempts succeed Beyond pass@1 (arXiv:2603.29231), Def. 2
reliability_decay_curve / _slope how pass^k decays with task duration (RDC/RDS) arXiv:2603.29231, Def. 3
variance_amplification_factor VAF: variance ratio long-vs-short bucket arXiv:2603.29231, Def. 4
graceful_degradation_score GDS: weighted partial credit over subtasks arXiv:2603.29231, Def. 5
meltdown_onset_point MOP: tool-call entropy collapse detector arXiv:2603.29231, Def. 6
beta_posterior / dirichlet_posterior Bayesian posterior mean + credible interval, E[p^k] Don't Pass@k (arXiv:2510.04265)

Honest-marketing note. arXiv:2510.04265 does not define a metric called "Bayes@k"; passwedge's Bayesian helpers are our operationalization of that paper's Dirichlet framework. The scorio package (the paper's reference implementation) is the numeric baseline our test suite reproduces under a uniform prior.

MOP thresholds are dataset-specific. meltdown_onset_point requires explicit theta_h, delta, w (no defaults). The paper's calibration is exposed as MOP_PAPER_DEFAULTS but applying it verbatim to other data produces false positives.

Where it fits

passwedge is the measurement layer: it consumes repeated-attempt outcomes (a bool list, a JSON file, or a trace export) and reports reliability. It deliberately does not run rollouts, score reward/fitness functions, audit reward gameability, or detect reward hacking — those are separate concerns handled by tools such as scorewright (fitness scoring + anti-gaming), rewardfuzz (reward gameability auditing), and mav-bench (multi-agent verification). passwedge sits one layer up from those: feed it the per-attempt pass/fail outcomes they (or any eval harness) produce, and it tells you how reliably the agent succeeds. For confidence / hallucination fragility see yuragi; for streaming inference verification see conformlock.

License

Apache-2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

passwedge-0.0.1a1.tar.gz (32.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

passwedge-0.0.1a1-py3-none-any.whl (29.6 kB view details)

Uploaded Python 3

File details

Details for the file passwedge-0.0.1a1.tar.gz.

File metadata

  • Download URL: passwedge-0.0.1a1.tar.gz
  • Upload date:
  • Size: 32.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for passwedge-0.0.1a1.tar.gz
Algorithm Hash digest
SHA256 0c00aa5a69006a50adc92e58a78613559ccc5e0081c0519e4e90e618f89a6d3f
MD5 e64e719ecb626811dc1d252ced7d737e
BLAKE2b-256 6e47fd6197106dd8429cbf0ec15f0705a4b86aab53bb4e964aa70b0ac1fdaac2

See more details on using hashes here.

File details

Details for the file passwedge-0.0.1a1-py3-none-any.whl.

File metadata

  • Download URL: passwedge-0.0.1a1-py3-none-any.whl
  • Upload date:
  • Size: 29.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for passwedge-0.0.1a1-py3-none-any.whl
Algorithm Hash digest
SHA256 359d3a365b57dee24d92cc756841fab4d868ad3ff48ef4390c131344589b2e15
MD5 23f9745420f7ac04c5ffc04c306d8a2a
BLAKE2b-256 023f406aa472c7c89d5ca5ed90d2c921d59e7d006c7674d1ca324cf65d53ccff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page