Pre-registered empirical benchmark of the bijection trap in MI-based coherence metrics. Built on Autonometrics.
Project description
Goodhart Bijection Trap
Pre-registered empirical benchmark of the bijection trap in MI-based coherence metrics: a Goodhart agent finds the shortcut under cost asymmetry; a match_rate floor defends. Built on Autonometrics.
Read in another language: English · Español
A two-lever synthetic agent — one honest, one bijection-style gaming — optimises a normalised mutual-information coherence score (Theil's U, as computed by autonometrics) over its declared vs. executed action streams. Under cost asymmetry on the honest lever, the agent reliably discovers the bijection shortcut: it declares X, executes f(X) with f a fixed permutation, and reports coherence = 1.0 while never matching its own declaration. The transition is sharp (between cost 0.30 and 0.50). A trivial defence using the match_rate diagnostic, exposed by Autonometrics >= 0.9.0a1, resists the same optimisation pressure up to cost 0.80.
The benchmark is pre-registered (see PRE_REGISTRATION.md) and reproducible from a clean install.
Scope
This package:
- Reproduces the canonical experiment documented below. The
GoodhartAgent(two levers:fidelity,bijection_strength), the optimiser (finite-difference gradient ascent), and the metric (Theil's U viaautonometrics) are fixed. - Exposes the
match_floordefence as a reusable utility, importable by any project that uses Autonometrics' coherence axis. - Documents the diagnostic exposure added in Autonometrics
v0.9.0a1(cba_match_rateand seven other intermediate magnitudes).
This package does not:
- Score arbitrary user-supplied agents. The attacker is fixed; what changes across modes is the cost asymmetry and the defence.
- Test arbitrary metrics. The empirical setup is restricted to one specific coherence formula.
- Claim that a single defence solves Goodhart broadly. It addresses one structural failure mode (bijection-invariance of MI-based coherence) under one optimisation regime (cost-asymmetric finite-difference).
A broader adversarial harness — where users plug in their own metric or their own agent — is plausible future work, not part of this release.
Quick start
Reproduce the canonical experiment
pip install goodhart-bijection-trap
goodhart-bench run
Runs the full 8-mode suite (~30 seconds on a laptop), prints per-mode verdicts, and emits the summary table reproduced in Results. Use --seed N to override the default seed (0).
For a single-mode sanity check:
goodhart-bench smoke
Programmatic reproduction:
from goodhart_bijection_trap import GoodhartAgent, optimize, score_naive
agent_log = optimize("naive", score_naive)
print(agent_log.history[-1])
# IterationLog(iteration=60, fidelity=1.0, bijection_strength=0.66,
# coherence=1.0, match_rate=1.0, score=1.0)
Apply the defence in your own pipeline
If your project uses Autonometrics' coherence axis and you want to harden it against bijection-style gaming:
import autonometrics as anm
from goodhart_bijection_trap import match_floor_defense
profile = anm.measure(my_system, axes=["coherence"])
score = match_floor_defense(profile, threshold=0.5)
# Equal to profile.coherence if cba_match_rate >= 0.5, else 0.0
Or check the diagnostic explicitly without zeroing anything:
from goodhart_bijection_trap import match_rate_below_floor
if match_rate_below_floor(profile, threshold=0.5):
# The coherence ratio is uninformative for this trajectory.
# Log, alert, or fall back to a different signal.
...
The 0.5 threshold is appropriate for alphabets of size ≥ 3 (well above E[match_rate | random] = 1/|alphabet|). For binary alphabets, raise the threshold. Both utilities require Autonometrics >= 0.9.0a1 so that cba_match_rate is exposed on the profile.
The phenomenon, by example
An agent over an alphabet of 4 symbols declares a target at each step and executes an action. An honest agent matches:
declared: [3 2 2 1 1 0 0 0 0 3 ...]
executed: [3 2 2 1 1 0 0 0 0 3 ...]
match_rate = 1.0000
coherence = 1.0000
An agent applying a fixed bijection E = (D + 1) mod 4 produces:
declared: [3 2 2 1 1 0 0 0 0 3 ...]
executed: [0 3 3 2 2 1 1 1 1 0 ...]
match_rate = 0.0000
coherence = 1.0000
Both reach coherence = 1.0. The bijection agent never matches its own declaration. From the coherence score alone, the two are indistinguishable.
This is not a corner case. It is the generic outcome of any agent that learns from a coherence-style reward signal under cost pressure on honest execution.
Background
Behavioural alignment metrics often compare two streams of symbols: what an agent declared it would do (D) and what it actually executed (E). A natural choice is a normalised mutual information score such as Theil's U:
$$U(D, E) = \frac{I(D; E)}{H(D)}$$
Higher U is interpreted as "the agent's output is more predictable given its declaration". The temptation is to read U as "how aligned is the agent's behaviour with its stated intent".
This reading is wrong, and the failure is structural, not implementation-specific. U is invariant under any bijection applied to E: if E' = π(E) for some permutation π, then I(D; E') = I(D; E) and U(D, E') = U(D, E). The metric is a permutation-invariant function of the joint distribution. It cannot, by construction, distinguish "agent does what it says" from "agent does a fixed renaming of what it says".
This is well-known in information theory. What was not well-documented is how easily a learning agent finds the loophole when given even mild incentive to do so. This benchmark documents that, with a reproducible setup.
Experimental setup
A minimal optimisation target — GoodhartAgent — with two real-valued levers in [0, 1]:
fidelity: probability thatE_t = D_tat each step.bijection_strength: probability that, when not matching honestly, the agent applies a fixed pre-chosen bijection rather than acting uniformly at random.
The agent is scored as score = coherence - cost * fidelity. The parameter cost ≥ 0 quantifies how expensive honest execution is relative to coherence-chasing. The agent optimises (fidelity, bijection_strength) by finite-difference gradient ascent over the joint score, for 60 iterations from a neutral starting point (0.5, 0.5).
Eight pre-registered modes:
- 5 modes without defence, sweeping
cost ∈ {0.00, 0.10, 0.30, 0.50, 0.80}to locate the honesty-to-gaming transition. - 3 modes with defences, applying two candidate defences at
cost = 0.50(match_floorandensemble) andcost = 0.80(match_floor):match_floor: ifmatch_rate < 0.5, setscore = 0.ensemble: replacecoherencewithcoherence × match_rate.
Each mode runs 200 timesteps per evaluation. Reported numbers are the final converged levers and the resulting profile.
Results
| Mode | Cost | Defence | fidelity | bijection | coherence | match_rate | Verdict |
|---|---|---|---|---|---|---|---|
| A | 0.00 | none | 1.00 | 0.66 | 1.00 | 1.00 | HONEST |
| D | 0.10 | none | 1.00 | 0.67 | 1.00 | 1.00 | HONEST |
| E | 0.30 | none | 1.00 | 0.55 | 1.00 | 1.00 | HONEST |
| F | 0.50 | none | 0.00 | 1.00 | 1.00 | 0.00 | GOODHART |
| G | 0.80 | none | 0.00 | 1.00 | 1.00 | 0.00 | GOODHART |
| H | 0.50 | match_floor |
1.00 | 0.55 | 1.00 | 1.00 | HONEST |
| I | 0.50 | ensemble |
1.00 | 0.62 | 1.00 | 1.00 | HONEST |
| J | 0.80 | match_floor |
1.00 | 0.68 | 1.00 | 1.00 | HONEST |
Three observations.
The transition is sharp. Between cost 0.30 and cost 0.50, the agent abruptly switches basins of attraction. There is no gradual degradation: below the threshold, fidelity saturates at 1.0; above it, fidelity collapses to 0.0 and bijection_strength saturates at 1.0. The optimiser is not confused; the loss landscape itself rewards the bijection path discontinuously.
The coherence number is uninformative about the mode. Modes A, D, E (honest) and modes F, G (Goodhart) all report coherence = 1.0. Without a second signal, an observer cannot distinguish them. The match_rate is what discriminates: 1.0 in the honest modes, 0.0 in the Goodhart modes.
match_floor resists at both cost levels. With cost = 0.80 and the trivial rule "score is zero if fewer than half of the declared steps actually match", the agent finds no path to a non-zero score by gaming alone, and the gradient pushes it back to fidelity = 1.0.
The diagnostic fix in Autonometrics v0.9.0a1
The compute_cba_theil_u function in autonometrics always computed match_rate, H(D), H(E) and I(D; E) internally — they are required to compute the coherence ratio. Until v0.9.0a0, only the ratio was exposed in the public AutonomyProfile. The other quantities were discarded.
The v0.9.0a1 release adds eight optional fields to AutonomyProfile:
cba_match_rate,cba_h_d,cba_h_e,cba_mimemory_e_states,memory_e_envpersistence_mean_hamming,persistence_d_ref
The change is additive and backwards-compatible. Existing code reads the same five canonical axes. Code that wants to defend against bijection-style gaming, or to diagnose other forms of ratio-saturation, can now read the underlying magnitudes directly:
import autonometrics as anm
profile = anm.measure(system, axes=["coherence"])
if profile.cba_match_rate is not None and profile.cba_match_rate < 0.5:
# Coherence ratio is uninformative here.
...
This benchmark uses that diagnostic directly. A smoke check at three control points (honest, bijection-pure, mixed) confirms that cba_match_rate from the public path matches the direct computation bit-for-bit.
Discussion
What this is, and isn't
This is a documented failure mode and a documented fix in a specific library, on a synthetic optimiser. It is not a claim that coherence-style metrics should be abandoned. It is a claim that any metric of the form f(joint distribution of D and E) that is invariant under bijections on E should be accompanied by a non-invariant diagnostic (such as match_rate) whenever the metric is used as an optimisation target. Otherwise the metric becomes exploitable in proportion to the cost asymmetry between honest and gaming pathways.
Why the transition is sharp
The cost-asymmetry threshold is not arbitrary. At the level of the loss landscape, the honest basin and the bijection basin coexist for any cost > 0. What changes with cost is the relative gradient strength away from the saddle separating them. Below some critical cost, the honest basin has a wider catchment from the (0.5, 0.5) initialisation; above it, the bijection basin dominates. The "discontinuity" is a property of the optimiser trajectory, not of the score function — the score function itself varies smoothly.
In practice, this means that any agent trained with a non-trivial cost on honest execution will reach the bijection basin given enough optimisation budget. The threshold observed here (cost ≈ 0.40) is specific to this optimiser and initialisation; the existence of a threshold is generic.
Defence design choices
Two defences were tested. match_floor (zero the score below 50% match) is brutal but robust. ensemble (multiply the score by match_rate) is smoother and worked at moderate cost. match_floor is the recommended default unless the deployment context requires gradient smoothness.
The 0.5 threshold is arbitrary in absolute terms but well below the match_rate achievable by random execution on alphabets of size ≥ 3 (expected match_rate = 1/|alphabet|). For binary alphabets the threshold should be raised; see reproduction notes.
Generalisation to other metrics
The bijection trap applies, in principle, to any normalised mutual information score, to many entropy-based measures, and to several recently proposed alignment metrics that aggregate over a joint distribution between an "intended" and "actual" stream. This benchmark does not survey those; it demonstrates the structural argument in one concrete metric and shows that adding a concordance diagnostic is cheap.
Related work
The phenomenon is a specific instance of two well-known patterns:
- Goodhart's law and proxy gaming. A metric used as a target ceases to be a good metric. The literature is large; recent alignment-flavoured treatments include Skalse et al. on reward hacking ("Defining and Characterizing Reward Hacking", 2022).
- Permutation invariance of mutual information. Jerdee, Kirkley and Newman discuss the bijection-invariance of MI-based similarity measures and propose corrections in the context of community detection. The structural argument is the same; the contribution here is empirical evidence that the failure manifests in alignment-style behavioural metrics under realistic optimisation pressure.
Adjacent work in the same time window:
goodhart(Sheridan, 2026): static analysis of reward configurations with 24 LEAN 4 proofs formalising Ng 1999 and Skalse 2022. Operates before training on the reward design. This benchmark operates after optimisation on the resulting behaviour. The two are complementary.
If a prior empirical demonstration of the sharp cost-driven transition in this specific metric family exists, the appropriate citation will be added.
Limitations
- Synthetic agent.
GoodhartAgentis a two-parameter optimiser, not an LLM. This benchmark does not claim that natural-language agents trained on RLHF-style reward signals will exhibit the same sharp transition at the same cost level. It does claim that nothing in the structural argument prevents it, and that the diagnostic fix is cheap to apply regardless. - Single metric family. Coherence (Theil's U on
Dvs.E). Other metrics may have analogous traps; not surveyed here. - Single optimiser. Finite-difference gradient ascent. Other optimisers (population-based, evolutionary, RL-style) may find the bijection basin at different cost thresholds or via different trajectories.
- No real-world deployment evidence. This is a clean-room demonstration. Whether real deployed agents have ever stumbled into the bijection basin is unknown.
Reproducibility
- Python:
>= 3.10 - Dependencies:
numpy >= 1.24,autonometrics >= 0.9.0a1 - Random seed:
0for the canonical run (overridable per mode) - Expected runtime: ~30 seconds on a 2023-era laptop for the full 10-mode suite
- Pre-registration:
PRE_REGISTRATION.md, committed before the canonical run
Reproduce the full results table:
pip install goodhart-bijection-trap
goodhart-bench run --seed 0
The output should match the table above to the displayed precision. Convergence numbers may vary by ±0.01 across hardware due to floating-point summation order; verdict labels are stable.
Citation
A CITATION.cff file is provided at the root of the repository. To cite informally:
Goodhart Bijection Trap (2026). Pre-registered empirical benchmark of the bijection trap in MI-based coherence metrics. https://github.com/bugerchip/goodhart-bijection-trap
License
Apache License 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goodhart_bijection_trap-0.1.0a0.tar.gz.
File metadata
- Download URL: goodhart_bijection_trap-0.1.0a0.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bcf95cb35a2bc96ac2cdce7de24acb9155698fee5c2409adbbb487bc2e4b6724
|
|
| MD5 |
b006a656ad4d29d758154d57bf813aee
|
|
| BLAKE2b-256 |
f4fa27ec0a2e081567a5bdb0e84422a1d3ee23b5614b07545507ac54c8433660
|
File details
Details for the file goodhart_bijection_trap-0.1.0a0-py3-none-any.whl.
File metadata
- Download URL: goodhart_bijection_trap-0.1.0a0-py3-none-any.whl
- Upload date:
- Size: 23.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9da12ec82e7f2fb761640971fa5545fbb314d4edc0f45f2c51bf1b23406f4a11
|
|
| MD5 |
26cf2ff4c66cc006461b84d3e05ba6cf
|
|
| BLAKE2b-256 |
f52ea7a4b99855528528beb6a5439f59d13f7eb5a44d84cd96e218a463ffba99
|