Catch reward traps before training. Named after Goodhart's Law.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

audieleon

These details have not been verified by PyPI

Project description

goodhart

Paper: Catching Goodhart's Law Before Training: Static Reward Analysis with Formal Guarantees (Sheridan, 2026)

"When a measure becomes a target, it ceases to be a good measure." -- Charles Goodhart (1975), generalized by Marilyn Strathern (1997)

Catch reward traps before training. Goodhart runs 40 composable analysis rules on your RL reward configuration and reports degenerate equilibria, perverse incentives, and exploitable reward structures -- before you spend compute. 17 rules are backed by machine-verified LEAN 4 proofs (zero sorry), including formalizations of Ng 1999 and Skalse 2022.

Installation

pip install goodhart

# Or install from source
pip install git+https://github.com/audieleon/goodhart.git

# Optional: visualization and Gymnasium auto-detection
pip install goodhart[all]

Quick Start

# Check a sparse reward config
goodhart --goal 1.0 --penalty -0.01 --steps 500
# -> CRITICAL: death beats survival by 9.6x

# Try a preset from a published paper
goodhart --preset coast-runners
# -> CRITICAL: loop EV (+800) beats goal (+100)

# List all available presets
goodhart --preset

# Interactive mode (asks questions)
goodhart

Usage

CLI

# Quick check with training params
goodhart --goal 1.0 --penalty -0.001 --steps 400 --gamma 0.999 \
  --actors 64 --budget 10000000 --lr 1e-4 --specialists 3 --floor 0.10

# From a config file (YAML, JSON, or TOML)
goodhart --config my_experiment.yaml

# From an annotated Python reward function
goodhart --check my_env.py:compute_reward

# With educational explanations
goodhart --preset humanoid --verbose

# Deep-dive on a specific rule
goodhart --explain idle_exploit

# Diagnose and suggest fixes
goodhart --doctor --goal 1.0 --penalty -0.01 --steps 500

# CI integration (exit code 1 on critical issues)
goodhart --quiet --exit-on-critical --config experiment.yaml

Python API

# Quick check (prints report, returns bool)
from goodhart import check
passed = check(goal=1.0, penalty=-0.01, max_steps=500)  # False if criticals

# Programmatic analysis (no printing, returns typed Result)
from goodhart import analyze
result = analyze(goal=1.0, penalty=-0.01, max_steps=500, gamma=0.999)
print(result.passed)       # True/False
print(result.criticals)    # list of Verdict objects
print(result.to_dict())    # JSON-serializable dict

Decorator (annotate a Python reward function)

from goodhart import reward_function, RewardSource, RewardType

ALIVE_BONUS = 1.0
VELOCITY_SCALE = 0.5
CTRL_COST = -0.001

@reward_function(
    max_steps=1000, gamma=0.99, n_actions=8, action_type="continuous",
    sources=[
        RewardSource("alive", RewardType.PER_STEP, ALIVE_BONUS,
                     requires_action=False, intentional=True),
        RewardSource("velocity", RewardType.PER_STEP, VELOCITY_SCALE,
                     intentional=True, state_dependent=True),
        RewardSource("ctrl", RewardType.PER_STEP, CTRL_COST,
                     requires_action=True),
    ],
)
def compute_reward(obs, action, info):
    return ALIVE_BONUS + obs["velocity"] * VELOCITY_SCALE + CTRL_COST * sum(a**2 for a in action)

# The function works normally AND carries analysis metadata
compute_reward(obs, action, info)        # returns reward
compute_reward.goodhart_check()          # prints full report
assert compute_reward.goodhart_passed()  # CI gate

Constants are defined once and shared between the decorator and the function body -- no duplication, no drift.

AI Assistant (Claude Code, Cursor)

If you use an AI coding assistant, goodhart can run automatically when you discuss reward design. Add to your MCP config (one-time setup):

{
  "mcpServers": {
    "goodhart": {
      "command": "python",
      "args": ["-m", "goodhart.mcp_server"]
    }
  }
}

Claude Code: add to ~/.claude/settings.json Cursor: add to .cursor/mcp.json

Then just describe your reward in conversation — the assistant calls goodhart_check automatically and explains the findings. 8 tools available: check, doctor, explain rules, browse presets and examples.

YAML Configuration

# my_experiment.yaml
environment:
  name: "MiniHack-Navigation"
  max_steps: 500
  gamma: 0.999
  reward_sources:
    - name: goal
      type: terminal
      value: 1.0
      discovery_probability: 0.05
    - name: step penalty
      type: per_step
      value: -0.001

training:
  algorithm: APPO
  lr: 0.0002
  entropy_coeff: 0.0001
  num_envs: 256
  total_steps: 10000000

Presets

23 presets from published papers, with hyperparameters sourced from the original publications:

goodhart --preset              # list all presets
goodhart --preset coast-runners  # run CoastRunners (loop exploit)
goodhart --preset humanoid       # run Humanoid (idle exploit)
goodhart --preset cartpole       # run CartPole (clean pass)

Rules

40 composable rules in four categories:

goodhart --rules      # list all with descriptions
goodhart --explain X  # deep-dive on rule X

15 reward rules: penalty dominance, death incentive, idle exploit, exploration threshold, respawning exploit, death reset, shaping loops, shaping safety (Ng 1999), proxy hackability (Skalse 2022), intrinsic sufficiency, budget sufficiency, compound traps, staged plateaus, reward dominance, exponential saturation
13 training rules: learning rate regime (all algorithms), critic LR ratio, entropy regime, clip fraction risk (PPO), expert collapse, batch size interaction, parallelism effect, memory capacity, replay buffer ratio (off-policy), target network update (DQN), epsilon schedule (DQN), soft update rate (SAC/DDPG/TD3), SAC alpha
4 architecture rules: embedding capacity, routing floor necessity, recurrence type, actor count effect
8 blind-spot advisories: pattern-based hints about failure modes static analysis cannot detect (physics exploits, goal misgeneralization, credit assignment depth, constrained RL, non-stationarity, learned rewards, missing constraints, aggregation traps)

Reward structure rules (15) are algorithm-agnostic — they analyze the MDP reward regardless of training algorithm. Training rules (13) cover PPO, APPO, DQN, SAC, DDPG, TD3, IMPALA, and A2C with algorithm-specific thresholds and checks.

What it catches vs. what it can't

Catches (from configuration alone):

Degenerate equilibria (standing still, dying fast)
Respawning reward loops (CoastRunners, YouTube watch time)
Death-as-reset exploits (Road Runner level replay)
Shaping reward cycles vs. potential-based shaping (Ng 1999)
Reward deserts (no gradient signal, e.g., Mountain Car)
Proxy reward hackability (Skalse 2022)
Expert collapse, entropy issues, budget insufficiency

Cannot catch (emits advisory hints when config patterns match):

Physics engine exploits (box surfing, leg hooking)
Goal misgeneralization (CoinRun "go right")
Learned reward model gaming (RLHF overoptimization)
Missing reward terms (tokamak coil balance)
Non-stationarity in self-play
Episode-level aggregation traps (Sharpe ratio)

Examples

57 cookbook examples spanning 40+ published papers from 1983-2025:

goodhart --examples              # list all
goodhart --example coast_runners # run one

Examples include documented failures (CoastRunners, Humanoid, Mountain Car), positive design patterns (Pendulum, CartPole, Breakout), industrial applications (YouTube, data center cooling, tokamak plasma, sepsis treatment), and honest limitation cases showing what static analysis cannot detect.

Formal Proofs

17 rules link to machine-verified LEAN 4 theorems (92 theorems, zero sorry). Each link has a strength level:

VERIFIED (9 rules): The Python check is a direct instance of the theorem.
GROUNDED (3 rules): The theorem proves the core. Python extends with discounting and thresholds.
MOTIVATED (5 rules): The theorem proves WHY the issue matters. Python checks a structural heuristic.

Key formalizations:

Ng 1999 Theorem 1: Potential-based reward shaping preserves V* (sufficiency, necessity, general policy version, undiscounted extension). Full MDP with Bellman contraction via Banach fixed point theorem.
Skalse 2022 Theorems 1-3: Hackability impossibility on open sets, existence of unhackable pairs, simplification characterization. Includes a machine-verified proof that Theorem 2's non-trivial witness construction requires |Pi| >= 3; for |Pi| = 2 only trivial witnesses exist (documented edge case, see proofs/GoodhartProofs/Skalse.lean).

cd proofs
lake build  # requires LEAN 4 + Mathlib
# Should complete with zero sorry, zero errors

Auto-Detection

Automatically detect reward structure from a Gymnasium environment:

pip install goodhart[detect]
goodhart --detect CartPole-v1
goodhart --detect MountainCar-v0

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

audieleon

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.2.1

May 4, 2026

1.1.0

May 2, 2026

1.0.0

May 2, 2026

This version

0.1.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodhart-0.1.0.tar.gz (155.5 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goodhart-0.1.0-py3-none-any.whl (176.5 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file goodhart-0.1.0.tar.gz.

File metadata

Download URL: goodhart-0.1.0.tar.gz
Upload date: Apr 27, 2026
Size: 155.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for goodhart-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4a623a3439e4a6c376a5cb898577eeebe2138b3eb31e45d7c7fbdb7a16ac5553`
MD5	`35d6dbc3c1b7e51cebd7ac687c2dce53`
BLAKE2b-256	`4c8c3cf2489fc65bf4e3f9442fe012a88385481042b840a0db606b543889a210`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goodhart-0.1.0.tar.gz:

Publisher: publish.yml on audieleon/goodhart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goodhart-0.1.0.tar.gz
- Subject digest: 4a623a3439e4a6c376a5cb898577eeebe2138b3eb31e45d7c7fbdb7a16ac5553
- Sigstore transparency entry: 1395256688
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: audieleon/goodhart@b5b909574491aa86d37b2e7c2e5b75d923e47750
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/audieleon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b5b909574491aa86d37b2e7c2e5b75d923e47750
- Trigger Event: release

File details

Details for the file goodhart-0.1.0-py3-none-any.whl.

File metadata

Download URL: goodhart-0.1.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 176.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for goodhart-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4ae645b9c8a40f50a2da8fafd67661b83f59f6f90888fbafda4ae203e859936e`
MD5	`8fdc1fa0a5f5ff7122a872a42ea8ad06`
BLAKE2b-256	`a81f7dd2f871ca437d4407c6964b924d22f7c17402e57b03dd9703f9de156794`

See more details on using hashes here.

Provenance

The following attestation bundles were made for goodhart-0.1.0-py3-none-any.whl:

Publisher: publish.yml on audieleon/goodhart

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: goodhart-0.1.0-py3-none-any.whl
- Subject digest: 4ae645b9c8a40f50a2da8fafd67661b83f59f6f90888fbafda4ae203e859936e
- Sigstore transparency entry: 1395256692
- Sigstore integration time: Apr 27, 2026
Source repository:
- Permalink: audieleon/goodhart@b5b909574491aa86d37b2e7c2e5b75d923e47750
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/audieleon
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b5b909574491aa86d37b2e7c2e5b75d923e47750
- Trigger Event: release

goodhart 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

goodhart

Installation

Quick Start

Usage

CLI

Python API

Decorator (annotate a Python reward function)

AI Assistant (Claude Code, Cursor)

YAML Configuration

Presets

Rules

What it catches vs. what it can't

Examples

Formal Proofs

Auto-Detection

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance