Skip to main content

Pre-training stress-testing for reward functions. Find bugs in minutes on CPU instead of days into a $10K training run.

Project description

rewardprobe

Know what your model will learn — before you train.

PyPI License Python


You write a reward function. You're about to spend $10K on a GRPO training run. rewardprobe tells you what the model will actually learn to do:

rewardprobe simulate — production_math_rlvr
  50 completions across 5 tasks

  2 critical found

  1.  critical
     'Shortcut' strategy scores 0.71
     A model using the shortcut strategy earns 103% of what a correct
     answer earns. It will learn to skip computation and take shortcuts
     because that's easier AND scores higher.

  2.  critical
     'Lazy correct' strategy scores only 0.07
     A correct answer without formatting scores near zero. Your reward
     function punishes correct-but-unformatted answers more than it
     punishes wrong-but-formatted ones.

  Strategy scoreboard:
    perfect              ████████████████████ 1.00
    correct_verbose      ████████████████████ 1.00
    shortcut             ██████████████░░░░░░ 0.71  ← problem
    near_miss            █████░░░░░░░░░░░░░░░ 0.29
    format_only          █████░░░░░░░░░░░░░░░ 0.29
    garbage              ███░░░░░░░░░░░░░░░░░ 0.18
    correct_lazy         █░░░░░░░░░░░░░░░░░░░ 0.07  ← problem

The strategy scoreboard shows exactly how your reward function scores different model behaviors. If a lazy or wrong strategy scores close to a correct one, the model will learn the lazy path. You see this in 30 seconds instead of discovering it 3 days into training.


The Problem

You write a reward function for RL training. It looks correct. You start training. Days later, the model is gaming the reward — outputting shortcuts, copying format without thinking, or guessing. OpenAI documented this happening with exit(0) and raise SkipTest. METR found frontier models monkey-patching their own graders.

The fix is to test reward functions before training, the same way you test code before deploying.


Install

pip install rewardprobe

Three Modes

1. Quick Check (free, instant, no API key)

30 deterministic probes. Catches parser bugs, edge cases, format tricks. Runs in under a second on CPU.

rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl
rewardprobe — my_reward

  1 critical, 2 warning found

  1.  critical
     Correct answer in reasoning section scores 1.0 even when the
     answer field contains a wrong answer.

  2.  warning
     Different scores depending on answer tag order.

  28/30 checks passed.

2. Deep Analysis (needs API key)

Claude reads your source code, understands what each function does, and generates realistic adversarial completions. Finds bugs that static probes can't.

export ANTHROPIC_API_KEY=sk-...
rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --deep

This adds:

  • Code analysis — Claude identifies logic bugs by reading your Python code
  • Adversarial completions — wrong-but-plausible model outputs tested against your function
  • False positive filtering — classifies each function (correctness/format/auxiliary) so findings are precise

3. Simulate (needs API key)

The flagship feature. Generates diverse completions spanning the full range of what a model might produce during training — from perfect solutions to garbage — and maps the reward landscape.

rewardprobe simulate my_reward.py::my_fn --dataset tasks.jsonl

The strategy scoreboard shows you at a glance:

  • Green strategies (perfect, correct_lazy, correct_verbose) — what you WANT the model to learn
  • Red strategies (shortcut, format_only, hedge, garbage) — what you DON'T want

If a red strategy scores close to or higher than a green one, your reward function has a problem.


What We Found

We ran rewardprobe against reward functions from 4 major RL codebases plus 3 non-math domains. Results:

Codebase Domain Key Finding
verifiers/gsm8k (Prime Intellect) Math Model can skip reasoning — correct_lazy scores 1.0
Open-R1 (HuggingFace) Math first_match mode lets models hedge with multiple answers
verl (ByteDance) Math format_score parameter can reward wrong answers
willccbb GRPO gist Math Returns 2.0 (outside [0,1]); rejects "42.0" for "42"
Custom code reward Code Off-by-one bugs score 0.83 — substring matching misses logic errors
Sentiment classifier Text Reasoned answers score 0.0, bare labels score 1.0

Works With Any Framework

Auto-detects your reward function's signature. No configuration.

# Any of these just work:
def my_reward(completion, answer): ...                     # Raw Python
def accuracy_reward(completions, solution, **kwargs): ...  # TRL / GRPO
def correctness(prompts, completions, answer, **kwargs): ... # TRL with prompts
async def correct_answer(completion, answer): ...          # verifiers
def compute_score(solution_str, ground_truth): ...         # ByteDance verl
rewardprobe test file.py::fn --dataset tasks.jsonl    # Just works
rewardprobe test environments/gsm8k.py                # verifiers environments too

GitHub Action

- run: pip install rewardprobe
- run: rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --ci

Exit code 1 on critical findings. Add --deep with ANTHROPIC_API_KEY secret for AI analysis in CI.


Python API

from rewardprobe import Probe

# Quick check
report = Probe().test_fn(my_reward, tasks)
print(report.passed)  # True / False

# Deep analysis
report = Probe(deep=True).test_fn(my_reward, tasks)

# Simulate
from rewardprobe.simulator import simulate, print_simulation
from rewardprobe.tier2.client import get_client
from rewardprobe.adapters.auto import auto_adapt

env = auto_adapt(my_reward, tasks)
result = simulate(env, get_client("sonnet"), n_tasks=5)
print_simulation(result)

How It Works

Quick Check generates adversarial inputs (empty strings, format tricks, parser exploits, wrong-but-formatted answers) and tests your reward function against them. 30 probes across 6 families, all deterministic, all on CPU.

Deep Analysis uses Claude to read your reward function's Python source code. It understands what the function checks, identifies logic bugs, and generates realistic wrong completions that a model might produce during training. Each completion is actually run against your function — only real exploits are reported.

Simulate uses Claude to generate 10 diverse completions per task, each representing a different strategy a model might learn (perfect, lazy, shortcut, hedging, garbage, etc). Scores them all against your reward function. The strategy scoreboard shows which behaviors your reward function actually incentivizes.


What rewardprobe Is NOT

  • Not a training monitor. We run before training starts.
  • Not a formal prover. We find bugs empirically with concrete inputs.
  • Not a guarantee. A clean report means "we tested these patterns and found nothing." The nastiest reward hacks are novel and environment-specific.

Contributing

See CLAUDE.md for architecture, how to add attacks, and how the simulator works.

git clone https://github.com/rewardprobe/rewardprobe && cd rewardprobe
uv sync --extra dev && pytest tests/

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rewardprobe-0.1.0.tar.gz (60.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

rewardprobe-0.1.0-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file rewardprobe-0.1.0.tar.gz.

File metadata

  • Download URL: rewardprobe-0.1.0.tar.gz
  • Upload date:
  • Size: 60.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rewardprobe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 698ec92a8498437a1a9b4c00fd9fa08701e86b63c7df79e32d07fd65500b71ff
MD5 cf32b22f6cd2a7e4bc15c841be44e840
BLAKE2b-256 b6245eb545ba1dfd7a379f00cd8d16d3a97910a5da4c5faf30db49d13ffe46df

See more details on using hashes here.

File details

Details for the file rewardprobe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: rewardprobe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 61.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rewardprobe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b776c3ab69d30c9f991456b0766d41bf7a648d4f81e54122616eb261312b889
MD5 39d78d8e4cfeebbb43826d97aa15cdd4
BLAKE2b-256 21ed42aaebc63876095b713f16e9142493dca0d9f0791975383eaa9020fb26ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page