Mechanistic interpretability as reward signal for RL training of LLMs

These details have not been verified by PyPI

Project links

Project description

mechreward

Mechanistic interpretability as a reward signal for RL training of LLMs.

Most RL-for-reasoning methods reward the output: "did the final answer match?" (outcome reward), "did each step look correct?" (PRM), "did a judge like it?" (LLM-as-judge).

mechreward rewards the process inside the model. Using sparse-autoencoder (SAE) features from interpretability research, we ask a fundamentally different question:

Is the model actually doing the cognitive work we want it to do, at the circuit level?

A model trained against a feature-reward like +1 × fact_retrieval_active - 0.5 × hedging can't trivially game the reward without actually activating those circuits — which requires actually doing retrieval and not hedging. The gradient signal is grounded in the model's internal state, not just its text output.

Status

Alpha (v0.1.0). API is subject to change. Tested against Gemma-2-9B with Gemma Scope SAEs. Integrations with trl (GRPO), openrlhf, and verl available.

See RESEARCH.md for the scientific context, prior art (SARM, SparseRM, CRL, YaPO), and what makes mechreward different from existing work.

Install

# Core
pip install mechreward

# With SAE support (sae_lens integration)
pip install "mechreward[sae]"

# With TRL integration (GRPOTrainer hook)
pip install "mechreward[sae,trl]"

# Everything
pip install "mechreward[all]"

Quickstart — feature reward in 10 lines

import mechreward as mr
from trl import GRPOConfig, GRPOTrainer

# 1. Load a pre-trained SAE (Gemma Scope from Google DeepMind)
sae = mr.load_sae(
    release="gemma-scope-9b-pt-res-canonical",
    sae_id="layer_22/width_16k/canonical",
)

# 2. Build a feature reward from a named pack
reward = mr.FeatureReward.from_pack(
    "gemma-2-9b/reasoning_pack",
    sae=sae,
    aggregation="mean_last_32_tokens",
)

# 3. Combine with an outcome reward (math verifier)
composite = mr.CompositeReward(
    rewards=[
        reward,
        mr.OutcomeReward(verifier=mr.verifiers.math_boxed),
    ],
    weights=[0.3, 1.0],
)

# 4. Plug into TRL GRPOTrainer (unchanged API)
trainer = GRPOTrainer(
    model="google/gemma-2-9b",
    args=GRPOConfig(output_dir="./out", num_generations=8),
    train_dataset=my_dataset,
    reward_funcs=composite,
)
trainer.train()

That's it. The feature-reward runs alongside outcome-reward during each GRPO step, with anti-hacking detection and KL regularization enabled by default.

Why this could work

Every published post-training technique for reasoning today rewards either (a) the final answer or (b) a human-labeled intermediate step. Both have brittle failure modes:

Outcome reward gives sparse signal, doesn't distinguish lucky guesses from real reasoning.
Process reward models get hacked within a few thousand steps — DeepSeek-R1 explicitly abandoned them.
LLM-as-judge is adversarially fragile (arxiv:2507.08794 — "One Token to Fool LLM-as-a-Judge").

Meanwhile, mechanistic interpretability research has shown that specific SAE features reliably light up during specific cognitive operations:

fact retrieval — arxiv:2408.05147 (Gemma Scope)
confidence vs hedging — arxiv:2411.11296 (Microsoft refusal steering)
chain-of-reasoning — well-documented in Anthropic's Claude 3 Sonnet interpretability work

If we reward the internal pattern instead of the output token, we're reaching a different layer of the stack — one that's harder to game at the surface, and that lines up more directly with what we actually want the model to learn.

What makes this different from SARM / CRL / SparseRM

There are several excellent papers using SAE features around reward modeling:

Method	What it does	What `mechreward` adds
SARM (AAAI 26)	SAE features → linear head → reward model, used in offline RLHF	Online GRPO use; multi-objective; composability with outcome verifier
SparseRM	Preference modeling via frequency-diff features	Reward is trajectory-level, not pairwise
CRL	Token-level feature amplification via RL	Reward is feature activation, not action selection
YaPO	SAE-sparse steering vectors	We don't modify inference-time activations
Wilhelm et al.	SAE features detect reward hacking	We use the same probes during training to prevent it

The novel contribution is the combination: online GRPO + SAE feature reward + anti-hacking dual verification + composability with standard outcome verifiers, shipped as a reusable library. Nobody has built this as a plug-in library before.

See RESEARCH.md for the full positioning and verified prior-art audit.

Anti-Goodhart is built in

The central risk of any reward signal is Goodhart's law: the model learns to maximize the measure without doing the underlying work. Feature reward is especially vulnerable because SAE features are effectively linear probes, and linear probes are trivially gameable in the limit.

mechreward addresses this with dual verification:

from mechreward.hacking import DualVerifier, AdversarialSuite

# A second, independent signal (a linear probe trained on real examples
# of the behavior) checks whether the feature activation is "honest".
dual = DualVerifier(
    feature_reward=reward,
    independent_probe=mr.load_probe("gemma-2-9b/fact_retrieval_probe"),
    disagreement_threshold=0.3,  # if they disagree >30%, downweight
)

# And an adversarial red-team suite flags suspicious rollouts during training.
detector = AdversarialSuite.from_preset("standard")

Each GRPO step runs the detector in parallel with the main reward computation. If it fires, the affected rollouts are downweighted or dropped. See src/mechreward/hacking/ for the full framework.

Supported models

Model	SAE source	Status
Gemma-2-9B	Gemma Scope (Google DeepMind, all layers)	✅ Primary target
Gemma-2-2B	Gemma Scope	✅ Quick experiments
Gemma-2-27B	Gemma Scope (selected layers)	✅ Supported
Llama-3.1-8B-Base	Llama Scope	✅ Supported
Llama-3.1-8B-Instruct	Goodfire SAE L19	✅ Supported
Llama-3.3-70B-Instruct	Goodfire SAE L50	⚠️ Experimental (compute-heavy)
Qwen, Mistral, DeepSeek	no public SAE	❌ Needs custom training

To add a new model, see docs/training_new_sae.md for the sae_lens-based training recipe.

Repository layout

mechreward/
├── src/mechreward/
│   ├── sae/            # SAE loading, caching, batched encoding
│   ├── features/       # Feature catalogs, Neuronpedia client, auto-interp
│   ├── reward/         # FeatureReward core, aggregation, composition
│   ├── hacking/        # Dual verification, adversarial, regularization
│   ├── probes/         # Linear probe baseline + training utilities
│   ├── rollout/        # HF and vLLM integration with hidden-state capture
│   └── integrations/   # TRL, OpenRLHF, verl adapters
├── catalogs/           # Pre-validated feature packs (JSON)
├── experiments/        # The 7 reference experiments from the research plan
├── benchmarks/         # Evaluation harnesses
└── tests/              # Unit + integration tests

The 7 reference experiments

experiments/ contains a full research pipeline:

01_baseline_outcome_only.py — outcome-reward GRPO baseline on GSM8K+MATH
02_mechreward_only.py — the tenuous experiment: mechreward alone, no outcome
03_hybrid_outcome_plus_mech.py — the commercially relevant combination
04_sarm_reproduction.py — reproduces Liu et al. 2508.08746 for comparison
05_crl_reproduction.py — reproduces Cho/Wu/Koshiyama 2602.10437
06_adversarial_hacking_suite.py — red-team suite + detection
07_capability_preservation.py — MMLU/HellaSwag pre/post RL

Run any of them after install:

python experiments/03_hybrid_outcome_plus_mech.py --config configs/hybrid.yaml

How it talks to TRL

The tricky part of integrating feature rewards with a GRPO trainer is that the standard reward_funcs API only gets strings and token IDs — not hidden states. mechreward solves this by providing a TRL-compatible wrapper that registers a forward hook on the policy and extracts the residual stream at the target layer during the reward computation:

from mechreward.integrations.trl_grpo import MechRewardGRPOTrainer

trainer = MechRewardGRPOTrainer(
    model="google/gemma-2-9b",
    reward_funcs=[feature_reward, outcome_reward],
    ...,
)

MechRewardGRPOTrainer wraps trl.GRPOTrainer and adds:

Forward-hook registration on the SAE layer
Residual-stream capture during rollout
SAE encoding of hidden states
Feature-reward computation from activations
Hacking detection on the side

The rest of GRPO (policy gradient, KL, advantage computation) is unchanged.

Testing

pip install "mechreward[dev]"
pytest

Integration tests require small SAEs and use Gemma-2-2B by default to stay under laptop compute.

Research context

This library exists because of a specific empirical observation: fine-tuning Qwen3.5-9B on ProcessFlow v1.7 (108k synthetic reasoning samples) gave a 93% loss reduction but zero PFE-Eval improvement. The model learned the format, not the skill. Neither Full FT nor LoRA moved the Judge delta by more than 0.005.

The hypothesis: we need reward signals that point at the cognitive circuits we want to strengthen, not at the output distribution. Mech interp gives us a handle on those circuits. This library is the infrastructure to test that hypothesis.

If it doesn't work, that's also a publishable result — a systematic negative result on "feature rewards fail to transfer to reasoning" is a contribution.

Contributing

This is alpha software. Issues and PRs welcome, but expect rapid breakage. See CONTRIBUTING.md.

Citation

If you use mechreward in research, please cite:

@software{mechreward2026,
  author = {Vicentino, Caio},
  title = {mechreward: Mechanistic interpretability as reward signal for RL},
  year = {2026},
  url = {https://github.com/caiovicentino/mechreward}
}

License

Apache 2.0. See LICENSE.

Related projects

SAE Lens — SAE training and loading
Gemma Scope — pre-trained SAEs for Gemma
TransformerLens — interpretability primitives
nnsight — model internals API
TRL — HuggingFace RL library
OpenRLHF — scalable RLHF
verl — ByteDance reasoning RL
Neuronpedia — interactive SAE feature explorer
Delphi — automated interpretability

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mechreward-0.1.0.tar.gz (44.0 kB view details)

Uploaded Apr 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mechreward-0.1.0-py3-none-any.whl (57.1 kB view details)

Uploaded Apr 15, 2026 Python 3

File details

Details for the file mechreward-0.1.0.tar.gz.

File metadata

Download URL: mechreward-0.1.0.tar.gz
Upload date: Apr 15, 2026
Size: 44.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mechreward-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0846e3a4438f597e93ddb8b76cd365e9867d2a925c8c16f6fffb4a93bd9f99a7`
MD5	`0f5b9c1e6ffa2b27ae5486613208a143`
BLAKE2b-256	`33cc43a488e86b1d87ea8b4693ee2170b8a63677491b1242d49a34a54799be56`

See more details on using hashes here.

File details

Details for the file mechreward-0.1.0-py3-none-any.whl.

File metadata

Download URL: mechreward-0.1.0-py3-none-any.whl
Upload date: Apr 15, 2026
Size: 57.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mechreward-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c51deb26ce378accee3bec2c06bcae07b2621860718c2e9002d1ae3c3bf7158a`
MD5	`dad301084669361cd2a5d8af10f0c8d4`
BLAKE2b-256	`069c11a420280c5bd09eaaa4a8404fb3e420a44035d79930c7c2389541fd584e`

See more details on using hashes here.

mechreward 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

mechreward

Status

Install

Quickstart — feature reward in 10 lines

Why this could work

What makes this different from SARM / CRL / SparseRM

Anti-Goodhart is built in

Supported models

Repository layout

The 7 reference experiments

How it talks to TRL

Testing

Research context

Contributing

Citation

License

Related projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes