Skip to main content

A pre-flight auditor for rubric-based LLM rewards: detect gameable reward signals before you spend GPU on RL.

Project description

rubric-reward-lens

A pre-flight auditor for rubric-based LLM rewards. Point it at your rubric, your grader, and a few example answers — it tells you, before you spend GPU on RL, whether the reward signal actually measures quality or can be gamed.

⚠️ Status: v0.1, early. The API may change.

The problem

To train an LLM on a "soft" task with no single right answer (long-form writing, medical Q&A, support replies), you need a score for every attempt. The popular recipe — Rubrics as Rewards (Gunjal et al., 2025, arXiv:2507.17746) — is to write a checklist (a rubric), have an LLM grade each answer against it, and use that grade as the reward.

That reward is fragile and fails silently. Recent work shows it gets hacked:

You usually discover the reward was gameable only after an expensive RL run produces a worse model that scores higher. rubric-reward-lens turns that detection into an off-the-shelf, pip install-able check you run first.

It audits a different object from the reward-model benchmarks (RewardBench 2, arXiv:2506.01937; RM-Bench, arXiv:2410.16184): those score learned reward models on preference accuracy. This scores a rubric + LLM-grader reward signal for gameability.

Install

pip3 install rubric-reward-lens

Requires Python ≥ 3.11. Inference-only: needs numpy, pyyaml, httpx. No GPU, no torch.

From a clone (for development, or to run the examples/ scripts):

git clone https://github.com/aishwaryawambule/rubric-reward-lens
cd rubric-reward-lens
pip3 install .          # add ".[dev]" for pytest

Use a regular pip3 install ., not the editable pip3 install -e . — on recent Python (3.14) the editable install can silently produce a non-importable package.

Quickstart

Two ways to use it — as a Python library (embed it in an eval pipeline / CI gate / RL setup) or from the CLI (quick one-off audits). Both are shown below; see Usage for the full guide, custom graders, and integration patterns.

from rubric_reward_lens import Rubric, OpenRouterGrader, audit, load_demo

# Bring your own, or try the bundled synthetic demo:
rubric, responses = load_demo()

grader = OpenRouterGrader(model="anthropic/claude-haiku-4.5")  # needs OPENROUTER_API_KEY
card = audit(rubric, grader, responses, human_labels=True)

print(card.verdict)            # e.g. "⚠️ Hackable — reward can be gamed for +0.34 ..."
card.to_html("report.html")    # full report card

Prefer a local model (no API key, no cost, no data leaves your machine)? Use a model served by Ollama:

from rubric_reward_lens import OllamaGrader, audit, load_demo

rubric, responses = load_demo()
grader = OllamaGrader(model="qwen2.5:14b")   # needs `ollama serve` + the model pulled
card = audit(rubric, grader, responses, human_labels=True)

No grader at all? The CLI runs the whole pipeline offline with a deterministic grader:

rrl demo --out report.html
rrl audit --rubric examples/rubric.yaml --grader examples/grader.fake.yaml \
          --responses responses.json --out report.html

Grader configs for the CLI: {type: fake}, {type: ollama, model: ...}, or {type: openrouter, model: ...} — see examples/.

What it checks (the diagnostics)

All diagnostics are label-free except the last; bootstrap confidence intervals are reported throughout (following Adding Error Bars to Evals, arXiv:2411.00640).

Diagnostic Question it answers
Reward hacking Do cheap "wins" — keyword-stuffing, verbosity padding, confident-wrong claims, format mimicry — earn reward they shouldn't?
Discrimination / monotonicity When an answer is progressively degraded, does the reward actually fall?
Grader stability Re-grading the same answer, how much does the reward wobble?
Criterion structure Are criteria redundant, low-signal, or does the reward over-depend on one?
Criterion-order invariance Does the reward change when the rubric's criteria are listed in a different order? (judge position bias, arXiv:2602.02219)
Human alignment (optional) When you have human scores, how well does the reward agree (QWK / κ / calibration)?

The report card combines these into a composite trust score and a one-line verdict, over a per-diagnostic table where every score reads 0–1 (1 = good) with a plain-English "what it means" — raw metrics are kept in the JSON output:

⚠️ Caution — trust score 0.63; review the diagnostics before training.
- Composite trust score: 0.63  (0–1, higher is better)

## Diagnostics
| Diagnostic   | Score | What it means                                |
| hacking      | 0.97  | not gameable                                 |
| monotonicity | 0.69  | mostly tracks quality (7 inversions)         |
| stability    | 1.00  | identical on re-grade                        |
| structure    | 0.50  | 2 low-signal criteria: accurate, not_evasive |
| alignment    | 0.00  | does not match human scores                  |

See a full generated report (the output of rrl demo): sample_report.md (renders right here on GitHub) or sample_report.html (styled — open in a browser).

Two short docs explain the output: Interpreting the report card (how to read it and decide if a reward is safe to train on) and Metrics & scores reference (the precise definition of every metric and score).

Using an LLM-as-a-judge for anything?

You don't need to be doing RL. If you grade anything with an LLM against a rubric (evals, autograding, quality gates), the same machinery tells you whether that judge can be gamed and how stable it is. Use audit(...) with your own Grader.

How it's validated

The tool has its own falsification test (tests/test_validation.py): it must flag a presence-only rubric graded by a keyword-matching grader as hackable, and clear a grader that resists the same probes as robust — reproducing the qualitative pattern of arXiv:2606.04923. If it can't separate the two, it doesn't ship.

Limitations

v0.1 is deliberately a small, honest core. Know these before you rely on it:

  • The four hack probes are not exhaustive. They model the documented failure modes (presence, verbosity, confidence, format). Domain-specific gaming — sycophancy, prompt-injection, self-preference, fabricated citations — is not covered. Add your own probe (a function (response, rubric) -> response) for those.
  • Custom probes and diagnostics work — in code, not from the CLI. audit(..., probes=[...], extra_diagnostics=[...]) accepts your own hack probes (each a Probe wrapping a (response, rubric) -> response transform) and your own diagnostics (a callable (rubric, grader, responses) -> DiagnosticResult). The CLI (rrl audit) still uses the built-ins only — a YAML config can't hold a Python function. If you drive the tool from a shell and need a custom probe/diagnostic, write a short script that calls audit(...). (Named/registered probes for the CLI are on the roadmap.)
  • Diagnostics need a real sample. With very few responses, hacking / monotonicity / structure / alignment are statistical artifacts (any 2 points force a correlation of ±1). Aim for ~15–20+ responses; grader stability and criterion-order are the exceptions — they measure the judge, so they're meaningful at any n.
  • Auditing an LLM judge is not free. Each response costs ~20 grade calls (probes + degradation ladder + stability re-grades + criterion-order permutations), so cost ≈ 20 × responses × grader-speed. On a large local model this is minutes-to-hours; use a fast/cheap grader or a smaller sample.
  • The trust score is a rough heuristic, not a calibrated probability. It's a plain average of the sub-scores — 0.75 doesn't mean "75% safe" — and because hacking is only one term in that average, a gameable reward can still show a middling trust score. So don't rely on the composite alone. The Hackable verdict is the single most important thing to check — always heed it: it disqualifies a reward on its own, no matter how high the trust score. Read the per-diagnostic table for the rest.
  • Early and evolving (v0.1). Not yet stable software — the API, report format, and verdict thresholds may change between versions, so pin the version if you depend on it.

See ROADMAP.md for what's planned next.

Note on the demo data

The bundled healthbench_demo dataset is synthetic and hand-authored — inspired by the shape of HealthBench (OpenAI, arXiv:2505.08775) (physician-style rubric + graded responses + human scores), but not real HealthBench data, to avoid any licensing question. Bring your own data to audit a real reward.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric_reward_lens-0.1.2.tar.gz (62.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rubric_reward_lens-0.1.2-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file rubric_reward_lens-0.1.2.tar.gz.

File metadata

  • Download URL: rubric_reward_lens-0.1.2.tar.gz
  • Upload date:
  • Size: 62.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for rubric_reward_lens-0.1.2.tar.gz
Algorithm Hash digest
SHA256 876df46d1b3425bcb7941c2c87aa89a80d3e7ae2425527d9f10c1778e8f1b135
MD5 8e92f64ff6024a0153755b45d2dc2e7e
BLAKE2b-256 772e1121e75ef993101af365c6a756d1a44289e483f296b7acd3d8ecbe7a7dde

See more details on using hashes here.

File details

Details for the file rubric_reward_lens-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for rubric_reward_lens-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a62ae258bc44dc4ddd4e0a5d0fdc17c4b90b21dbcc7f2b9e8d414371ea2b556c
MD5 bc6dec0387036c017a57daeaca326ea3
BLAKE2b-256 5609df84c4b5d02f790cf1ab9f669e9f4dbc71757cd57e0145aebadd55806952

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page