Mechanistic interpretability as reward signal for RL training of LLMs
Project description
mechreward
Mechanistic interpretability as a reward signal for RL training of LLMs.
Most RL-for-reasoning methods reward the output: "did the final answer match?" (outcome reward), "did each step look correct?" (PRM), "did a judge like it?" (LLM-as-judge).
mechreward rewards the process inside the model. Using sparse-autoencoder (SAE) features from interpretability research, we ask a fundamentally different question:
Is the model actually doing the cognitive work we want it to do, at the circuit level?
A model trained against a feature-reward like +1 × fact_retrieval_active - 0.5 × hedging can't trivially game the reward without actually activating those circuits — which requires actually doing retrieval and not hedging. The gradient signal is grounded in the model's internal state, not just its text output.
Status
Alpha (v0.1.0). API is subject to change. Tested against Gemma-2-9B with Gemma Scope SAEs. Integrations with trl (GRPO), openrlhf, and verl available.
See RESEARCH.md for the scientific context, prior art (SARM, SparseRM, CRL, YaPO), and what makes mechreward different from existing work.
Install
# Core
pip install mechreward
# With SAE support (sae_lens integration)
pip install "mechreward[sae]"
# With TRL integration (GRPOTrainer hook)
pip install "mechreward[sae,trl]"
# Everything
pip install "mechreward[all]"
Quickstart — feature reward in 10 lines
import mechreward as mr
from trl import GRPOConfig, GRPOTrainer
# 1. Load a pre-trained SAE (Gemma Scope from Google DeepMind)
sae = mr.load_sae(
release="gemma-scope-9b-pt-res-canonical",
sae_id="layer_22/width_16k/canonical",
)
# 2. Build a feature reward from a named pack
reward = mr.FeatureReward.from_pack(
"gemma-2-9b/reasoning_pack",
sae=sae,
aggregation="mean_last_32_tokens",
)
# 3. Combine with an outcome reward (math verifier)
composite = mr.CompositeReward(
rewards=[
reward,
mr.OutcomeReward(verifier=mr.verifiers.math_boxed),
],
weights=[0.3, 1.0],
)
# 4. Plug into TRL GRPOTrainer (unchanged API)
trainer = GRPOTrainer(
model="google/gemma-2-9b",
args=GRPOConfig(output_dir="./out", num_generations=8),
train_dataset=my_dataset,
reward_funcs=composite,
)
trainer.train()
That's it. The feature-reward runs alongside outcome-reward during each GRPO step, with anti-hacking detection and KL regularization enabled by default.
Why this could work
Every published post-training technique for reasoning today rewards either (a) the final answer or (b) a human-labeled intermediate step. Both have brittle failure modes:
- Outcome reward gives sparse signal, doesn't distinguish lucky guesses from real reasoning.
- Process reward models get hacked within a few thousand steps — DeepSeek-R1 explicitly abandoned them.
- LLM-as-judge is adversarially fragile (arxiv:2507.08794 — "One Token to Fool LLM-as-a-Judge").
Meanwhile, mechanistic interpretability research has shown that specific SAE features reliably light up during specific cognitive operations:
- fact retrieval — arxiv:2408.05147 (Gemma Scope)
- confidence vs hedging — arxiv:2411.11296 (Microsoft refusal steering)
- chain-of-reasoning — well-documented in Anthropic's Claude 3 Sonnet interpretability work
If we reward the internal pattern instead of the output token, we're reaching a different layer of the stack — one that's harder to game at the surface, and that lines up more directly with what we actually want the model to learn.
What makes this different from SARM / CRL / SparseRM
There are several excellent papers using SAE features around reward modeling:
| Method | What it does | What mechreward adds |
|---|---|---|
| SARM (AAAI 26) | SAE features → linear head → reward model, used in offline RLHF | Online GRPO use; multi-objective; composability with outcome verifier |
| SparseRM | Preference modeling via frequency-diff features | Reward is trajectory-level, not pairwise |
| CRL | Token-level feature amplification via RL | Reward is feature activation, not action selection |
| YaPO | SAE-sparse steering vectors | We don't modify inference-time activations |
| Wilhelm et al. | SAE features detect reward hacking | We use the same probes during training to prevent it |
The novel contribution is the combination: online GRPO + SAE feature reward + anti-hacking dual verification + composability with standard outcome verifiers, shipped as a reusable library. Nobody has built this as a plug-in library before.
See RESEARCH.md for the full positioning and verified prior-art audit.
Anti-Goodhart is built in
The central risk of any reward signal is Goodhart's law: the model learns to maximize the measure without doing the underlying work. Feature reward is especially vulnerable because SAE features are effectively linear probes, and linear probes are trivially gameable in the limit.
mechreward addresses this with dual verification:
from mechreward.hacking import DualVerifier, AdversarialSuite
# A second, independent signal (a linear probe trained on real examples
# of the behavior) checks whether the feature activation is "honest".
dual = DualVerifier(
feature_reward=reward,
independent_probe=mr.load_probe("gemma-2-9b/fact_retrieval_probe"),
disagreement_threshold=0.3, # if they disagree >30%, downweight
)
# And an adversarial red-team suite flags suspicious rollouts during training.
detector = AdversarialSuite.from_preset("standard")
Each GRPO step runs the detector in parallel with the main reward computation. If it fires, the affected rollouts are downweighted or dropped. See src/mechreward/hacking/ for the full framework.
Supported models
| Model | SAE source | Status |
|---|---|---|
| Gemma-2-9B | Gemma Scope (Google DeepMind, all layers) | ✅ Primary target |
| Gemma-2-2B | Gemma Scope | ✅ Quick experiments |
| Gemma-2-27B | Gemma Scope (selected layers) | ✅ Supported |
| Llama-3.1-8B-Base | Llama Scope | ✅ Supported |
| Llama-3.1-8B-Instruct | Goodfire SAE L19 | ✅ Supported |
| Llama-3.3-70B-Instruct | Goodfire SAE L50 | ⚠️ Experimental (compute-heavy) |
| Qwen, Mistral, DeepSeek | no public SAE | ❌ Needs custom training |
To add a new model, see docs/training_new_sae.md for the sae_lens-based training recipe.
Repository layout
mechreward/
├── src/mechreward/
│ ├── sae/ # SAE loading, caching, batched encoding
│ ├── features/ # Feature catalogs, Neuronpedia client, auto-interp
│ ├── reward/ # FeatureReward core, aggregation, composition
│ ├── hacking/ # Dual verification, adversarial, regularization
│ ├── probes/ # Linear probe baseline + training utilities
│ ├── rollout/ # HF and vLLM integration with hidden-state capture
│ └── integrations/ # TRL, OpenRLHF, verl adapters
├── catalogs/ # Pre-validated feature packs (JSON)
├── experiments/ # The 7 reference experiments from the research plan
├── benchmarks/ # Evaluation harnesses
└── tests/ # Unit + integration tests
The 7 reference experiments
experiments/ contains a full research pipeline:
- 01_baseline_outcome_only.py — outcome-reward GRPO baseline on GSM8K+MATH
- 02_mechreward_only.py — the tenuous experiment: mechreward alone, no outcome
- 03_hybrid_outcome_plus_mech.py — the commercially relevant combination
- 04_sarm_reproduction.py — reproduces Liu et al. 2508.08746 for comparison
- 05_crl_reproduction.py — reproduces Cho/Wu/Koshiyama 2602.10437
- 06_adversarial_hacking_suite.py — red-team suite + detection
- 07_capability_preservation.py — MMLU/HellaSwag pre/post RL
Run any of them after install:
python experiments/03_hybrid_outcome_plus_mech.py --config configs/hybrid.yaml
How it talks to TRL
The tricky part of integrating feature rewards with a GRPO trainer is that the standard reward_funcs API only gets strings and token IDs — not hidden states. mechreward solves this by providing a TRL-compatible wrapper that registers a forward hook on the policy and extracts the residual stream at the target layer during the reward computation:
from mechreward.integrations.trl_grpo import MechRewardGRPOTrainer
trainer = MechRewardGRPOTrainer(
model="google/gemma-2-9b",
reward_funcs=[feature_reward, outcome_reward],
...,
)
MechRewardGRPOTrainer wraps trl.GRPOTrainer and adds:
- Forward-hook registration on the SAE layer
- Residual-stream capture during rollout
- SAE encoding of hidden states
- Feature-reward computation from activations
- Hacking detection on the side
The rest of GRPO (policy gradient, KL, advantage computation) is unchanged.
Testing
pip install "mechreward[dev]"
pytest
Integration tests require small SAEs and use Gemma-2-2B by default to stay under laptop compute.
Research context
This library exists because of a specific empirical observation: fine-tuning Qwen3.5-9B on ProcessFlow v1.7 (108k synthetic reasoning samples) gave a 93% loss reduction but zero PFE-Eval improvement. The model learned the format, not the skill. Neither Full FT nor LoRA moved the Judge delta by more than 0.005.
The hypothesis: we need reward signals that point at the cognitive circuits we want to strengthen, not at the output distribution. Mech interp gives us a handle on those circuits. This library is the infrastructure to test that hypothesis.
If it doesn't work, that's also a publishable result — a systematic negative result on "feature rewards fail to transfer to reasoning" is a contribution.
Contributing
This is alpha software. Issues and PRs welcome, but expect rapid breakage. See CONTRIBUTING.md.
Citation
If you use mechreward in research, please cite:
@software{mechreward2026,
author = {Vicentino, Caio},
title = {mechreward: Mechanistic interpretability as reward signal for RL},
year = {2026},
url = {https://github.com/caiovicentino/mechreward}
}
License
Apache 2.0. See LICENSE.
Related projects
- SAE Lens — SAE training and loading
- Gemma Scope — pre-trained SAEs for Gemma
- TransformerLens — interpretability primitives
- nnsight — model internals API
- TRL — HuggingFace RL library
- OpenRLHF — scalable RLHF
- verl — ByteDance reasoning RL
- Neuronpedia — interactive SAE feature explorer
- Delphi — automated interpretability
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mechreward-0.1.0.tar.gz.
File metadata
- Download URL: mechreward-0.1.0.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0846e3a4438f597e93ddb8b76cd365e9867d2a925c8c16f6fffb4a93bd9f99a7
|
|
| MD5 |
0f5b9c1e6ffa2b27ae5486613208a143
|
|
| BLAKE2b-256 |
33cc43a488e86b1d87ea8b4693ee2170b8a63677491b1242d49a34a54799be56
|
File details
Details for the file mechreward-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mechreward-0.1.0-py3-none-any.whl
- Upload date:
- Size: 57.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c51deb26ce378accee3bec2c06bcae07b2621860718c2e9002d1ae3c3bf7158a
|
|
| MD5 |
dad301084669361cd2a5d8af10f0c8d4
|
|
| BLAKE2b-256 |
069c11a420280c5bd09eaaa4a8404fb3e420a44035d79930c7c2389541fd584e
|