Skip to main content

Mechanistic interpretability as reward signal for RL training of LLMs

Project description

mechreward

Mechanistic interpretability as a reward signal for RL training of LLMs.

Most RL-for-reasoning methods reward the output: "did the final answer match?" (outcome reward), "did each step look correct?" (PRM), "did a judge like it?" (LLM-as-judge).

mechreward rewards the process inside the model. Using sparse-autoencoder (SAE) features from interpretability research, we ask a fundamentally different question:

Is the model actually doing the cognitive work we want it to do, at the circuit level?

A model trained against a feature-reward like +1 × fact_retrieval_active - 0.5 × hedging can't trivially game the reward without actually activating those circuits — which requires actually doing retrieval and not hedging. The gradient signal is grounded in the model's internal state, not just its text output.

Status

Alpha (v0.1.0). API is subject to change. Tested against Gemma-2-9B with Gemma Scope SAEs. Integrations with trl (GRPO), openrlhf, and verl available.

See RESEARCH.md for the scientific context, prior art (SARM, SparseRM, CRL, YaPO), and what makes mechreward different from existing work.

Install

# Core
pip install mechreward

# With SAE support (sae_lens integration)
pip install "mechreward[sae]"

# With TRL integration (GRPOTrainer hook)
pip install "mechreward[sae,trl]"

# Everything
pip install "mechreward[all]"

Quickstart — feature reward in 10 lines

import mechreward as mr
from trl import GRPOConfig, GRPOTrainer

# 1. Load a pre-trained SAE (Gemma Scope from Google DeepMind)
sae = mr.load_sae(
    release="gemma-scope-9b-pt-res-canonical",
    sae_id="layer_22/width_16k/canonical",
)

# 2. Build a feature reward from a named pack
reward = mr.FeatureReward.from_pack(
    "gemma-2-9b/reasoning_pack",
    sae=sae,
    aggregation="mean_last_32_tokens",
)

# 3. Combine with an outcome reward (math verifier)
composite = mr.CompositeReward(
    rewards=[
        reward,
        mr.OutcomeReward(verifier=mr.verifiers.math_boxed),
    ],
    weights=[0.3, 1.0],
)

# 4. Plug into TRL GRPOTrainer (unchanged API)
trainer = GRPOTrainer(
    model="google/gemma-2-9b",
    args=GRPOConfig(output_dir="./out", num_generations=8),
    train_dataset=my_dataset,
    reward_funcs=composite,
)
trainer.train()

That's it. The feature-reward runs alongside outcome-reward during each GRPO step, with anti-hacking detection and KL regularization enabled by default.

Why this could work

Every published post-training technique for reasoning today rewards either (a) the final answer or (b) a human-labeled intermediate step. Both have brittle failure modes:

  • Outcome reward gives sparse signal, doesn't distinguish lucky guesses from real reasoning.
  • Process reward models get hacked within a few thousand steps — DeepSeek-R1 explicitly abandoned them.
  • LLM-as-judge is adversarially fragile (arxiv:2507.08794 — "One Token to Fool LLM-as-a-Judge").

Meanwhile, mechanistic interpretability research has shown that specific SAE features reliably light up during specific cognitive operations:

  • fact retrieval — arxiv:2408.05147 (Gemma Scope)
  • confidence vs hedging — arxiv:2411.11296 (Microsoft refusal steering)
  • chain-of-reasoning — well-documented in Anthropic's Claude 3 Sonnet interpretability work

If we reward the internal pattern instead of the output token, we're reaching a different layer of the stack — one that's harder to game at the surface, and that lines up more directly with what we actually want the model to learn.

What makes this different from SARM / CRL / SparseRM

There are several excellent papers using SAE features around reward modeling:

Method What it does What mechreward adds
SARM (AAAI 26) SAE features → linear head → reward model, used in offline RLHF Online GRPO use; multi-objective; composability with outcome verifier
SparseRM Preference modeling via frequency-diff features Reward is trajectory-level, not pairwise
CRL Token-level feature amplification via RL Reward is feature activation, not action selection
YaPO SAE-sparse steering vectors We don't modify inference-time activations
Wilhelm et al. SAE features detect reward hacking We use the same probes during training to prevent it

The novel contribution is the combination: online GRPO + SAE feature reward + anti-hacking dual verification + composability with standard outcome verifiers, shipped as a reusable library. Nobody has built this as a plug-in library before.

See RESEARCH.md for the full positioning and verified prior-art audit.

Anti-Goodhart is built in

The central risk of any reward signal is Goodhart's law: the model learns to maximize the measure without doing the underlying work. Feature reward is especially vulnerable because SAE features are effectively linear probes, and linear probes are trivially gameable in the limit.

mechreward addresses this with dual verification:

from mechreward.hacking import DualVerifier, AdversarialSuite

# A second, independent signal (a linear probe trained on real examples
# of the behavior) checks whether the feature activation is "honest".
dual = DualVerifier(
    feature_reward=reward,
    independent_probe=mr.load_probe("gemma-2-9b/fact_retrieval_probe"),
    disagreement_threshold=0.3,  # if they disagree >30%, downweight
)

# And an adversarial red-team suite flags suspicious rollouts during training.
detector = AdversarialSuite.from_preset("standard")

Each GRPO step runs the detector in parallel with the main reward computation. If it fires, the affected rollouts are downweighted or dropped. See src/mechreward/hacking/ for the full framework.

Supported models

Model SAE source Status
Gemma-2-9B Gemma Scope (Google DeepMind, all layers) ✅ Primary target
Gemma-2-2B Gemma Scope ✅ Quick experiments
Gemma-2-27B Gemma Scope (selected layers) ✅ Supported
Llama-3.1-8B-Base Llama Scope ✅ Supported
Llama-3.1-8B-Instruct Goodfire SAE L19 ✅ Supported
Llama-3.3-70B-Instruct Goodfire SAE L50 ⚠️ Experimental (compute-heavy)
Qwen, Mistral, DeepSeek no public SAE ❌ Needs custom training

To add a new model, see docs/training_new_sae.md for the sae_lens-based training recipe.

Repository layout

mechreward/
├── src/mechreward/
│   ├── sae/            # SAE loading, caching, batched encoding
│   ├── features/       # Feature catalogs, Neuronpedia client, auto-interp
│   ├── reward/         # FeatureReward core, aggregation, composition
│   ├── hacking/        # Dual verification, adversarial, regularization
│   ├── probes/         # Linear probe baseline + training utilities
│   ├── rollout/        # HF and vLLM integration with hidden-state capture
│   └── integrations/   # TRL, OpenRLHF, verl adapters
├── catalogs/           # Pre-validated feature packs (JSON)
├── experiments/        # The 7 reference experiments from the research plan
├── benchmarks/         # Evaluation harnesses
└── tests/              # Unit + integration tests

The 7 reference experiments

experiments/ contains a full research pipeline:

  1. 01_baseline_outcome_only.py — outcome-reward GRPO baseline on GSM8K+MATH
  2. 02_mechreward_only.py — the tenuous experiment: mechreward alone, no outcome
  3. 03_hybrid_outcome_plus_mech.py — the commercially relevant combination
  4. 04_sarm_reproduction.py — reproduces Liu et al. 2508.08746 for comparison
  5. 05_crl_reproduction.py — reproduces Cho/Wu/Koshiyama 2602.10437
  6. 06_adversarial_hacking_suite.py — red-team suite + detection
  7. 07_capability_preservation.py — MMLU/HellaSwag pre/post RL

Run any of them after install:

python experiments/03_hybrid_outcome_plus_mech.py --config configs/hybrid.yaml

How it talks to TRL

The tricky part of integrating feature rewards with a GRPO trainer is that the standard reward_funcs API only gets strings and token IDs — not hidden states. mechreward solves this by providing a TRL-compatible wrapper that registers a forward hook on the policy and extracts the residual stream at the target layer during the reward computation:

from mechreward.integrations.trl_grpo import MechRewardGRPOTrainer

trainer = MechRewardGRPOTrainer(
    model="google/gemma-2-9b",
    reward_funcs=[feature_reward, outcome_reward],
    ...,
)

MechRewardGRPOTrainer wraps trl.GRPOTrainer and adds:

  • Forward-hook registration on the SAE layer
  • Residual-stream capture during rollout
  • SAE encoding of hidden states
  • Feature-reward computation from activations
  • Hacking detection on the side

The rest of GRPO (policy gradient, KL, advantage computation) is unchanged.

Testing

pip install "mechreward[dev]"
pytest

Integration tests require small SAEs and use Gemma-2-2B by default to stay under laptop compute.

Research context

This library exists because of a specific empirical observation: fine-tuning Qwen3.5-9B on ProcessFlow v1.7 (108k synthetic reasoning samples) gave a 93% loss reduction but zero PFE-Eval improvement. The model learned the format, not the skill. Neither Full FT nor LoRA moved the Judge delta by more than 0.005.

The hypothesis: we need reward signals that point at the cognitive circuits we want to strengthen, not at the output distribution. Mech interp gives us a handle on those circuits. This library is the infrastructure to test that hypothesis.

If it doesn't work, that's also a publishable result — a systematic negative result on "feature rewards fail to transfer to reasoning" is a contribution.

Contributing

This is alpha software. Issues and PRs welcome, but expect rapid breakage. See CONTRIBUTING.md.

Citation

If you use mechreward in research, please cite:

@software{mechreward2026,
  author = {Vicentino, Caio},
  title = {mechreward: Mechanistic interpretability as reward signal for RL},
  year = {2026},
  url = {https://github.com/caiovicentino/mechreward}
}

License

Apache 2.0. See LICENSE.

Related projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mechreward-0.1.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mechreward-0.1.0-py3-none-any.whl (57.1 kB view details)

Uploaded Python 3

File details

Details for the file mechreward-0.1.0.tar.gz.

File metadata

  • Download URL: mechreward-0.1.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mechreward-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0846e3a4438f597e93ddb8b76cd365e9867d2a925c8c16f6fffb4a93bd9f99a7
MD5 0f5b9c1e6ffa2b27ae5486613208a143
BLAKE2b-256 33cc43a488e86b1d87ea8b4693ee2170b8a63677491b1242d49a34a54799be56

See more details on using hashes here.

File details

Details for the file mechreward-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: mechreward-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 57.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for mechreward-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c51deb26ce378accee3bec2c06bcae07b2621860718c2e9002d1ae3c3bf7158a
MD5 dad301084669361cd2a5d8af10f0c8d4
BLAKE2b-256 069c11a420280c5bd09eaaa4a8404fb3e420a44035d79930c7c2389541fd584e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page