Pre-training stress-testing for reward functions. Find bugs in minutes on CPU instead of days into a $10K training run.
Project description
rewardprobe
Know what your model will learn — before you train.
You write a reward function. You're about to spend $10K on a GRPO training run. rewardprobe tells you what the model will actually learn to do:
rewardprobe simulate — production_math_rlvr
50 completions across 5 tasks
2 critical found
1. critical
'Shortcut' strategy scores 0.71
A model using the shortcut strategy earns 103% of what a correct
answer earns. It will learn to skip computation and take shortcuts
because that's easier AND scores higher.
2. critical
'Lazy correct' strategy scores only 0.07
A correct answer without formatting scores near zero. Your reward
function punishes correct-but-unformatted answers more than it
punishes wrong-but-formatted ones.
Strategy scoreboard:
perfect ████████████████████ 1.00
correct_verbose ████████████████████ 1.00
shortcut ██████████████░░░░░░ 0.71 ← problem
near_miss █████░░░░░░░░░░░░░░░ 0.29
format_only █████░░░░░░░░░░░░░░░ 0.29
garbage ███░░░░░░░░░░░░░░░░░ 0.18
correct_lazy █░░░░░░░░░░░░░░░░░░░ 0.07 ← problem
The strategy scoreboard shows exactly how your reward function scores different model behaviors. If a lazy or wrong strategy scores close to a correct one, the model will learn the lazy path. You see this in 30 seconds instead of discovering it 3 days into training.
The Problem
You write a reward function for RL training. It looks correct. You start training. Days later, the model is gaming the reward — outputting shortcuts, copying format without thinking, or guessing. OpenAI documented this happening with exit(0) and raise SkipTest. METR found frontier models monkey-patching their own graders.
The fix is to test reward functions before training, the same way you test code before deploying.
Install
pip install rewardprobe
Three Modes
1. Quick Check (free, instant, no API key)
30 deterministic probes. Catches parser bugs, edge cases, format tricks. Runs in under a second on CPU.
rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl
rewardprobe — my_reward
1 critical, 2 warning found
1. critical
Correct answer in reasoning section scores 1.0 even when the
answer field contains a wrong answer.
2. warning
Different scores depending on answer tag order.
28/30 checks passed.
2. Deep Analysis (needs API key)
Claude reads your source code, understands what each function does, and generates realistic adversarial completions. Finds bugs that static probes can't.
export ANTHROPIC_API_KEY=sk-...
rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --deep
This adds:
- Code analysis — Claude identifies logic bugs by reading your Python code
- Adversarial completions — wrong-but-plausible model outputs tested against your function
- False positive filtering — classifies each function (correctness/format/auxiliary) so findings are precise
3. Simulate (needs API key)
The flagship feature. Generates diverse completions spanning the full range of what a model might produce during training — from perfect solutions to garbage — and maps the reward landscape.
rewardprobe simulate my_reward.py::my_fn --dataset tasks.jsonl
The strategy scoreboard shows you at a glance:
- Green strategies (perfect, correct_lazy, correct_verbose) — what you WANT the model to learn
- Red strategies (shortcut, format_only, hedge, garbage) — what you DON'T want
If a red strategy scores close to or higher than a green one, your reward function has a problem.
What We Found
We ran rewardprobe against reward functions from 4 major RL codebases plus 3 non-math domains. Results:
| Codebase | Domain | Key Finding |
|---|---|---|
| verifiers/gsm8k (Prime Intellect) | Math | Model can skip reasoning — correct_lazy scores 1.0 |
| Open-R1 (HuggingFace) | Math | first_match mode lets models hedge with multiple answers |
| verl (ByteDance) | Math | format_score parameter can reward wrong answers |
| willccbb GRPO gist | Math | Returns 2.0 (outside [0,1]); rejects "42.0" for "42" |
| Custom code reward | Code | Off-by-one bugs score 0.83 — substring matching misses logic errors |
| Sentiment classifier | Text | Reasoned answers score 0.0, bare labels score 1.0 |
Works With Any Framework
Auto-detects your reward function's signature. No configuration.
# Any of these just work:
def my_reward(completion, answer): ... # Raw Python
def accuracy_reward(completions, solution, **kwargs): ... # TRL / GRPO
def correctness(prompts, completions, answer, **kwargs): ... # TRL with prompts
async def correct_answer(completion, answer): ... # verifiers
def compute_score(solution_str, ground_truth): ... # ByteDance verl
rewardprobe test file.py::fn --dataset tasks.jsonl # Just works
rewardprobe test environments/gsm8k.py # verifiers environments too
GitHub Action
- run: pip install rewardprobe
- run: rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --ci
Exit code 1 on critical findings. Add --deep with ANTHROPIC_API_KEY secret for AI analysis in CI.
Python API
from rewardprobe import Probe
# Quick check
report = Probe().test_fn(my_reward, tasks)
print(report.passed) # True / False
# Deep analysis
report = Probe(deep=True).test_fn(my_reward, tasks)
# Simulate
from rewardprobe.simulator import simulate, print_simulation
from rewardprobe.tier2.client import get_client
from rewardprobe.adapters.auto import auto_adapt
env = auto_adapt(my_reward, tasks)
result = simulate(env, get_client("sonnet"), n_tasks=5)
print_simulation(result)
How It Works
Quick Check generates adversarial inputs (empty strings, format tricks, parser exploits, wrong-but-formatted answers) and tests your reward function against them. 30 probes across 6 families, all deterministic, all on CPU.
Deep Analysis uses Claude to read your reward function's Python source code. It understands what the function checks, identifies logic bugs, and generates realistic wrong completions that a model might produce during training. Each completion is actually run against your function — only real exploits are reported.
Simulate uses Claude to generate 10 diverse completions per task, each representing a different strategy a model might learn (perfect, lazy, shortcut, hedging, garbage, etc). Scores them all against your reward function. The strategy scoreboard shows which behaviors your reward function actually incentivizes.
What rewardprobe Is NOT
- Not a training monitor. We run before training starts.
- Not a formal prover. We find bugs empirically with concrete inputs.
- Not a guarantee. A clean report means "we tested these patterns and found nothing." The nastiest reward hacks are novel and environment-specific.
Contributing
See CLAUDE.md for architecture, how to add attacks, and how the simulator works.
git clone https://github.com/rewardprobe/rewardprobe && cd rewardprobe
uv sync --extra dev && pytest tests/
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rewardprobe-0.1.0.tar.gz.
File metadata
- Download URL: rewardprobe-0.1.0.tar.gz
- Upload date:
- Size: 60.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
698ec92a8498437a1a9b4c00fd9fa08701e86b63c7df79e32d07fd65500b71ff
|
|
| MD5 |
cf32b22f6cd2a7e4bc15c841be44e840
|
|
| BLAKE2b-256 |
b6245eb545ba1dfd7a379f00cd8d16d3a97910a5da4c5faf30db49d13ffe46df
|
File details
Details for the file rewardprobe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rewardprobe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 61.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b776c3ab69d30c9f991456b0766d41bf7a648d4f81e54122616eb261312b889
|
|
| MD5 |
39d78d8e4cfeebbb43826d97aa15cdd4
|
|
| BLAKE2b-256 |
21ed42aaebc63876095b713f16e9142493dca0d9f0791975383eaa9020fb26ed
|