Pre-training stress-testing for reward functions. Find bugs in minutes on CPU instead of days into a $10K training run.

These details have not been verified by PyPI

Project links

Project description

rewardprobe

Know what your model will learn — before you train.

You write a reward function. You're about to spend $10K on a GRPO training run. rewardprobe tells you what the model will actually learn to do:

rewardprobe simulate — production_math_rlvr
  50 completions across 5 tasks

  2 critical found

  1.  critical
     'Shortcut' strategy scores 0.71
     A model using the shortcut strategy earns 103% of what a correct
     answer earns. It will learn to skip computation and take shortcuts
     because that's easier AND scores higher.

  2.  critical
     'Lazy correct' strategy scores only 0.07
     A correct answer without formatting scores near zero. Your reward
     function punishes correct-but-unformatted answers more than it
     punishes wrong-but-formatted ones.

  Strategy scoreboard:
    perfect              ████████████████████ 1.00
    correct_verbose      ████████████████████ 1.00
    shortcut             ██████████████░░░░░░ 0.71  ← problem
    near_miss            █████░░░░░░░░░░░░░░░ 0.29
    format_only          █████░░░░░░░░░░░░░░░ 0.29
    garbage              ███░░░░░░░░░░░░░░░░░ 0.18
    correct_lazy         █░░░░░░░░░░░░░░░░░░░ 0.07  ← problem

The strategy scoreboard shows exactly how your reward function scores different model behaviors. If a lazy or wrong strategy scores close to a correct one, the model will learn the lazy path. You see this in 30 seconds instead of discovering it 3 days into training.

The Problem

You write a reward function for RL training. It looks correct. You start training. Days later, the model is gaming the reward — outputting shortcuts, copying format without thinking, or guessing. OpenAI documented this happening with exit(0) and raise SkipTest. METR found frontier models monkey-patching their own graders.

The fix is to test reward functions before training, the same way you test code before deploying.

Install

pip install rewardprobe

Three Modes

1. Quick Check (free, instant, no API key)

30 deterministic probes. Catches parser bugs, edge cases, format tricks. Runs in under a second on CPU.

rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl

rewardprobe — my_reward

  1 critical, 2 warning found

  1.  critical
     Correct answer in reasoning section scores 1.0 even when the
     answer field contains a wrong answer.

  2.  warning
     Different scores depending on answer tag order.

  28/30 checks passed.

2. Deep Analysis (needs API key)

Claude reads your source code, understands what each function does, and generates realistic adversarial completions. Finds bugs that static probes can't.

export ANTHROPIC_API_KEY=sk-...
rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --deep

This adds:

Code analysis — Claude identifies logic bugs by reading your Python code
Adversarial completions — wrong-but-plausible model outputs tested against your function
False positive filtering — classifies each function (correctness/format/auxiliary) so findings are precise

3. Simulate (needs API key)

The flagship feature. Generates diverse completions spanning the full range of what a model might produce during training — from perfect solutions to garbage — and maps the reward landscape.

rewardprobe simulate my_reward.py::my_fn --dataset tasks.jsonl

The strategy scoreboard shows you at a glance:

Green strategies (perfect, correct_lazy, correct_verbose) — what you WANT the model to learn
Red strategies (shortcut, format_only, hedge, garbage) — what you DON'T want

If a red strategy scores close to or higher than a green one, your reward function has a problem.

What We Found

We ran rewardprobe against reward functions from 4 major RL codebases plus 3 non-math domains. Results:

Codebase	Domain	Key Finding
verifiers/gsm8k (Prime Intellect)	Math	Model can skip reasoning — `correct_lazy` scores 1.0
Open-R1 (HuggingFace)	Math	`first_match` mode lets models hedge with multiple answers
verl (ByteDance)	Math	`format_score` parameter can reward wrong answers
willccbb GRPO gist	Math	Returns 2.0 (outside [0,1]); rejects "42.0" for "42"
Custom code reward	Code	Off-by-one bugs score 0.83 — substring matching misses logic errors
Sentiment classifier	Text	Reasoned answers score 0.0, bare labels score 1.0

Works With Any Framework

Auto-detects your reward function's signature. No configuration.

# Any of these just work:
def my_reward(completion, answer): ...                     # Raw Python
def accuracy_reward(completions, solution, **kwargs): ...  # TRL / GRPO
def correctness(prompts, completions, answer, **kwargs): ... # TRL with prompts
async def correct_answer(completion, answer): ...          # verifiers
def compute_score(solution_str, ground_truth): ...         # ByteDance verl

rewardprobe test file.py::fn --dataset tasks.jsonl    # Just works
rewardprobe test environments/gsm8k.py                # verifiers environments too

GitHub Action

- run: pip install rewardprobe
- run: rewardprobe test my_reward.py::my_fn --dataset tasks.jsonl --ci

Exit code 1 on critical findings. Add --deep with ANTHROPIC_API_KEY secret for AI analysis in CI.

Python API

from rewardprobe import Probe

# Quick check
report = Probe().test_fn(my_reward, tasks)
print(report.passed)  # True / False

# Deep analysis
report = Probe(deep=True).test_fn(my_reward, tasks)

# Simulate
from rewardprobe.simulator import simulate, print_simulation
from rewardprobe.tier2.client import get_client
from rewardprobe.adapters.auto import auto_adapt

env = auto_adapt(my_reward, tasks)
result = simulate(env, get_client("sonnet"), n_tasks=5)
print_simulation(result)

How It Works

Quick Check generates adversarial inputs (empty strings, format tricks, parser exploits, wrong-but-formatted answers) and tests your reward function against them. 30 probes across 6 families, all deterministic, all on CPU.

Deep Analysis uses Claude to read your reward function's Python source code. It understands what the function checks, identifies logic bugs, and generates realistic wrong completions that a model might produce during training. Each completion is actually run against your function — only real exploits are reported.

Simulate uses Claude to generate 10 diverse completions per task, each representing a different strategy a model might learn (perfect, lazy, shortcut, hedging, garbage, etc). Scores them all against your reward function. The strategy scoreboard shows which behaviors your reward function actually incentivizes.

What rewardprobe Is NOT

Not a training monitor. We run before training starts.
Not a formal prover. We find bugs empirically with concrete inputs.
Not a guarantee. A clean report means "we tested these patterns and found nothing." The nastiest reward hacks are novel and environment-specific.

Contributing

See CLAUDE.md for architecture, how to add attacks, and how the simulator works.

git clone https://github.com/rewardprobe/rewardprobe && cd rewardprobe
uv sync --extra dev && pytest tests/

Apache 2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rewardprobe-0.1.0.tar.gz (60.1 kB view details)

Uploaded Mar 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rewardprobe-0.1.0-py3-none-any.whl (61.7 kB view details)

Uploaded Mar 17, 2026 Python 3

File details

Details for the file rewardprobe-0.1.0.tar.gz.

File metadata

Download URL: rewardprobe-0.1.0.tar.gz
Upload date: Mar 17, 2026
Size: 60.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rewardprobe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`698ec92a8498437a1a9b4c00fd9fa08701e86b63c7df79e32d07fd65500b71ff`
MD5	`cf32b22f6cd2a7e4bc15c841be44e840`
BLAKE2b-256	`b6245eb545ba1dfd7a379f00cd8d16d3a97910a5da4c5faf30db49d13ffe46df`

See more details on using hashes here.

File details

Details for the file rewardprobe-0.1.0-py3-none-any.whl.

File metadata

Download URL: rewardprobe-0.1.0-py3-none-any.whl
Upload date: Mar 17, 2026
Size: 61.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for rewardprobe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b776c3ab69d30c9f991456b0766d41bf7a648d4f81e54122616eb261312b889`
MD5	`39d78d8e4cfeebbb43826d97aa15cdd4`
BLAKE2b-256	`21ed42aaebc63876095b713f16e9142493dca0d9f0791975383eaa9020fb26ed`

See more details on using hashes here.

rewardprobe 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

rewardprobe

The Problem

Install

Three Modes

1. Quick Check (free, instant, no API key)

2. Deep Analysis (needs API key)

3. Simulate (needs API key)

What We Found

Works With Any Framework

GitHub Action

Python API

How It Works

What rewardprobe Is NOT

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes