A pre-flight auditor for rubric-based LLM rewards: detect gameable reward signals before you spend GPU on RL.

These details have not been verified by PyPI

Project links

Project description

rubric-reward-lens

A pre-flight auditor for rubric-based LLM rewards. Point it at your rubric, your grader, and a few example answers — it tells you, before you spend GPU on RL, whether the reward signal actually measures quality or can be gamed.

⚠️ Status: v0.1, early. The API may change.

The problem

To train an LLM on a "soft" task with no single right answer (long-form writing, medical Q&A, support replies), you need a score for every attempt. The popular recipe — Rubrics as Rewards (Gunjal et al., 2025, arXiv:2507.17746) — is to write a checklist (a rubric), have an LLM grade each answer against it, and use that grade as the reward.

That reward is fragile and fails silently. Recent work shows it gets hacked:

Reward Hacking in Rubric-Based RL (arXiv:2605.12474) — gains concentrate in presence/completeness criteria ("did it mention X") while factual correctness, conciseness, and relevance decline; stronger graders don't fix an under-specified rubric.
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based RL (arXiv:2606.04923) — characterizes how to detect it.

You usually discover the reward was gameable only after an expensive RL run produces a worse model that scores higher. rubric-reward-lens turns that detection into an off-the-shelf, pip install-able check you run first.

It audits a different object from the reward-model benchmarks (RewardBench 2, arXiv:2506.01937; RM-Bench, arXiv:2410.16184): those score learned reward models on preference accuracy. This scores a rubric + LLM-grader reward signal for gameability.

Install

pip3 install rubric-reward-lens

Requires Python ≥ 3.11. Inference-only: needs numpy, pyyaml, httpx. No GPU, no torch.

From a clone (for development, or to run the examples/ scripts):

git clone https://github.com/aishwaryawambule/rubric-reward-lens
cd rubric-reward-lens
pip3 install .          # add ".[dev]" for pytest

Use a regular pip3 install ., not the editable pip3 install -e . — on recent Python (3.14) the editable install can silently produce a non-importable package.

Quickstart

Two ways to use it — as a Python library (embed it in an eval pipeline / CI gate / RL setup) or from the CLI (quick one-off audits). Both are shown below; see Usage for the full guide, custom graders, and integration patterns.

from rubric_reward_lens import Rubric, OpenRouterGrader, audit, load_demo

# Bring your own, or try the bundled synthetic demo:
rubric, responses = load_demo()

grader = OpenRouterGrader(model="anthropic/claude-haiku-4.5")  # needs OPENROUTER_API_KEY
card = audit(rubric, grader, responses, human_labels=True)

print(card.verdict)            # e.g. "⚠️ Hackable — reward can be gamed for +0.34 ..."
card.to_html("report.html")    # full report card

Prefer a local model (no API key, no cost, no data leaves your machine)? Use a model served by Ollama:

from rubric_reward_lens import OllamaGrader, audit, load_demo

rubric, responses = load_demo()
grader = OllamaGrader(model="qwen2.5:14b")   # needs `ollama serve` + the model pulled
card = audit(rubric, grader, responses, human_labels=True)

No grader at all? The CLI runs the whole pipeline offline with a deterministic grader:

rrl demo --out report.html
rrl audit --rubric examples/rubric.yaml --grader examples/grader.fake.yaml \
          --responses responses.json --out report.html

Grader configs for the CLI: {type: fake}, {type: ollama, model: ...}, or {type: openrouter, model: ...} — see examples/.

What it checks (the diagnostics)

All diagnostics are label-free except the last; bootstrap confidence intervals are reported throughout (following Adding Error Bars to Evals, arXiv:2411.00640).

Diagnostic	Question it answers
Reward hacking	Do cheap "wins" — keyword-stuffing, verbosity padding, confident-wrong claims, format mimicry — earn reward they shouldn't?
Discrimination / monotonicity	When an answer is progressively degraded, does the reward actually fall?
Grader stability	Re-grading the same answer, how much does the reward wobble?
Criterion structure	Are criteria redundant, low-signal, or does the reward over-depend on one?
Criterion-order invariance	Does the reward change when the rubric's criteria are listed in a different order? (judge position bias, arXiv:2602.02219)
Human alignment (optional)	When you have human scores, how well does the reward agree (QWK / κ / calibration)?

The report card combines these into a composite trust score and a one-line verdict, over a per-diagnostic table where every score reads 0–1 (1 = good) with a plain-English "what it means" — raw metrics are kept in the JSON output:

⚠️ Caution — trust score 0.63; review the diagnostics before training.
- Composite trust score: 0.63  (0–1, higher is better)

## Diagnostics
| Diagnostic   | Score | What it means                                |
| hacking      | 0.97  | not gameable                                 |
| monotonicity | 0.69  | mostly tracks quality (7 inversions)         |
| stability    | 1.00  | identical on re-grade                        |
| structure    | 0.50  | 2 low-signal criteria: accurate, not_evasive |
| alignment    | 0.00  | does not match human scores                  |

See a full generated report (the output of rrl demo): sample_report.md (renders right here on GitHub) or sample_report.html (styled — open in a browser).

Two short docs explain the output: Interpreting the report card (how to read it and decide if a reward is safe to train on) and Metrics & scores reference (the precise definition of every metric and score).

Using an LLM-as-a-judge for anything?

You don't need to be doing RL. If you grade anything with an LLM against a rubric (evals, autograding, quality gates), the same machinery tells you whether that judge can be gamed and how stable it is. Use audit(...) with your own Grader.

How it's validated

The tool has its own falsification test (tests/test_validation.py): it must flag a presence-only rubric graded by a keyword-matching grader as hackable, and clear a grader that resists the same probes as robust — reproducing the qualitative pattern of arXiv:2606.04923. If it can't separate the two, it doesn't ship.

Limitations

v0.1 is deliberately a small, honest core. Know these before you rely on it:

The four hack probes are not exhaustive. They model the documented failure modes (presence, verbosity, confidence, format). Domain-specific gaming — sycophancy, prompt-injection, self-preference, fabricated citations — is not covered. Add your own probe (a function (response, rubric) -> response) for those.
Custom probes and diagnostics work — in code, not from the CLI. audit(..., probes=[...], extra_diagnostics=[...]) accepts your own hack probes (each a Probe wrapping a (response, rubric) -> response transform) and your own diagnostics (a callable (rubric, grader, responses) -> DiagnosticResult). The CLI (rrl audit) still uses the built-ins only — a YAML config can't hold a Python function. If you drive the tool from a shell and need a custom probe/diagnostic, write a short script that calls audit(...). (Named/registered probes for the CLI are on the roadmap.)
Diagnostics need a real sample. With very few responses, hacking / monotonicity / structure / alignment are statistical artifacts (any 2 points force a correlation of ±1). Aim for ~15–20+ responses; grader stability and criterion-order are the exceptions — they measure the judge, so they're meaningful at any n.
Auditing an LLM judge is not free. Each response costs ~20 grade calls (probes + degradation ladder + stability re-grades + criterion-order permutations), so cost ≈ 20 × responses × grader-speed. On a large local model this is minutes-to-hours; use a fast/cheap grader or a smaller sample.
The trust score is a rough heuristic, not a calibrated probability. It's a plain average of the sub-scores — 0.75 doesn't mean "75% safe" — and because hacking is only one term in that average, a gameable reward can still show a middling trust score. So don't rely on the composite alone. The Hackable verdict is the single most important thing to check — always heed it: it disqualifies a reward on its own, no matter how high the trust score. Read the per-diagnostic table for the rest.
Early and evolving (v0.1). Not yet stable software — the API, report format, and verdict thresholds may change between versions, so pin the version if you depend on it.

See ROADMAP.md for what's planned next.

Note on the demo data

The bundled healthbench_demo dataset is synthetic and hand-authored — inspired by the shape of HealthBench (OpenAI, arXiv:2505.08775) (physician-style rubric + graded responses + human scores), but not real HealthBench data, to avoid any licensing question. Bring your own data to audit a real reward.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Jul 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric_reward_lens-0.1.2.tar.gz (62.0 kB view details)

Uploaded Jul 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rubric_reward_lens-0.1.2-py3-none-any.whl (34.1 kB view details)

Uploaded Jul 1, 2026 Python 3

File details

Details for the file rubric_reward_lens-0.1.2.tar.gz.

File metadata

Download URL: rubric_reward_lens-0.1.2.tar.gz
Upload date: Jul 1, 2026
Size: 62.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for rubric_reward_lens-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`876df46d1b3425bcb7941c2c87aa89a80d3e7ae2425527d9f10c1778e8f1b135`
MD5	`8e92f64ff6024a0153755b45d2dc2e7e`
BLAKE2b-256	`772e1121e75ef993101af365c6a756d1a44289e483f296b7acd3d8ecbe7a7dde`

See more details on using hashes here.

File details

Details for the file rubric_reward_lens-0.1.2-py3-none-any.whl.

File metadata

Download URL: rubric_reward_lens-0.1.2-py3-none-any.whl
Upload date: Jul 1, 2026
Size: 34.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for rubric_reward_lens-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a62ae258bc44dc4ddd4e0a5d0fdc17c4b90b21dbcc7f2b9e8d414371ea2b556c`
MD5	`bc6dec0387036c017a57daeaca326ea3`
BLAKE2b-256	`5609df84c4b5d02f790cf1ab9f669e9f4dbc71757cd57e0145aebadd55806952`

See more details on using hashes here.

rubric-reward-lens 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rubric-reward-lens

The problem

Install

Quickstart

What it checks (the diagnostics)

Using an LLM-as-a-judge for anything?

How it's validated

Limitations

Note on the demo data

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes