Pre-registered empirical benchmark of the bijection trap in MI-based coherence metrics. Built on Autonometrics.

These details have not been verified by PyPI

Project links

Project description

Goodhart Bijection Trap

Pre-registered empirical benchmark of the bijection trap in MI-based coherence metrics: a Goodhart agent finds the shortcut under cost asymmetry; a match_rate floor defends. Built on Autonometrics.

Read in another language: English · Español

A two-lever synthetic agent — one honest, one bijection-style gaming — optimises a normalised mutual-information coherence score (Theil's U, as computed by autonometrics) over its declared vs. executed action streams. Under cost asymmetry on the honest lever, the agent reliably discovers the bijection shortcut: it declares X, executes f(X) with f a fixed permutation, and reports coherence = 1.0 while never matching its own declaration. The transition is sharp (between cost 0.30 and 0.50). A trivial defence using the match_rate diagnostic, exposed by Autonometrics >= 0.9.0a1, resists the same optimisation pressure up to cost 0.80.

The benchmark is pre-registered (see PRE_REGISTRATION.md) and reproducible from a clean install.

Scope

This package:

Reproduces the canonical experiment documented below. The GoodhartAgent (two levers: fidelity, bijection_strength), the optimiser (finite-difference gradient ascent), and the metric (Theil's U via autonometrics) are fixed.
Exposes the match_floor defence as a reusable utility, importable by any project that uses Autonometrics' coherence axis.
Documents the diagnostic exposure added in Autonometrics v0.9.0a1 (cba_match_rate and seven other intermediate magnitudes).

This package does not:

Score arbitrary user-supplied agents. The attacker is fixed; what changes across modes is the cost asymmetry and the defence.
Test arbitrary metrics. The empirical setup is restricted to one specific coherence formula.
Claim that a single defence solves Goodhart broadly. It addresses one structural failure mode (bijection-invariance of MI-based coherence) under one optimisation regime (cost-asymmetric finite-difference).

A broader adversarial harness — where users plug in their own metric or their own agent — is plausible future work, not part of this release.

Quick start

Reproduce the canonical experiment

pip install goodhart-bijection-trap
goodhart-bench run

Runs the full 8-mode suite (~30 seconds on a laptop), prints per-mode verdicts, and emits the summary table reproduced in Results. Use --seed N to override the default seed (0).

For a single-mode sanity check:

goodhart-bench smoke

Programmatic reproduction:

from goodhart_bijection_trap import GoodhartAgent, optimize, score_naive

agent_log = optimize("naive", score_naive)
print(agent_log.history[-1])
# IterationLog(iteration=60, fidelity=1.0, bijection_strength=0.66,
#              coherence=1.0, match_rate=1.0, score=1.0)

Apply the defence in your own pipeline

If your project uses Autonometrics' coherence axis and you want to harden it against bijection-style gaming:

import autonometrics as anm
from goodhart_bijection_trap import match_floor_defense

profile = anm.measure(my_system, axes=["coherence"])
score = match_floor_defense(profile, threshold=0.5)
# Equal to profile.coherence if cba_match_rate >= 0.5, else 0.0

Or check the diagnostic explicitly without zeroing anything:

from goodhart_bijection_trap import match_rate_below_floor

if match_rate_below_floor(profile, threshold=0.5):
    # The coherence ratio is uninformative for this trajectory.
    # Log, alert, or fall back to a different signal.
    ...

The 0.5 threshold is appropriate for alphabets of size ≥ 3 (well above E[match_rate | random] = 1/|alphabet|). For binary alphabets, raise the threshold. Both utilities require Autonometrics >= 0.9.0a1 so that cba_match_rate is exposed on the profile.

The phenomenon, by example

An agent over an alphabet of 4 symbols declares a target at each step and executes an action. An honest agent matches:

declared:   [3 2 2 1 1 0 0 0 0 3 ...]
executed:   [3 2 2 1 1 0 0 0 0 3 ...]
match_rate = 1.0000
coherence  = 1.0000

An agent applying a fixed bijection E = (D + 1) mod 4 produces:

declared:   [3 2 2 1 1 0 0 0 0 3 ...]
executed:   [0 3 3 2 2 1 1 1 1 0 ...]
match_rate = 0.0000
coherence  = 1.0000

Both reach coherence = 1.0. The bijection agent never matches its own declaration. From the coherence score alone, the two are indistinguishable.

This is not a corner case. It is the generic outcome of any agent that learns from a coherence-style reward signal under cost pressure on honest execution.

Background

Behavioural alignment metrics often compare two streams of symbols: what an agent declared it would do (D) and what it actually executed (E). A natural choice is a normalised mutual information score such as Theil's U:

$$U(D, E) = \frac{I(D; E)}{H(D)}$$

Higher U is interpreted as "the agent's output is more predictable given its declaration". The temptation is to read U as "how aligned is the agent's behaviour with its stated intent".

This reading is wrong, and the failure is structural, not implementation-specific. U is invariant under any bijection applied to E: if E' = π(E) for some permutation π, then I(D; E') = I(D; E) and U(D, E') = U(D, E). The metric is a permutation-invariant function of the joint distribution. It cannot, by construction, distinguish "agent does what it says" from "agent does a fixed renaming of what it says".

This is well-known in information theory. What was not well-documented is how easily a learning agent finds the loophole when given even mild incentive to do so. This benchmark documents that, with a reproducible setup.

Experimental setup

A minimal optimisation target — GoodhartAgent — with two real-valued levers in [0, 1]:

fidelity: probability that E_t = D_t at each step.
bijection_strength: probability that, when not matching honestly, the agent applies a fixed pre-chosen bijection rather than acting uniformly at random.

The agent is scored as score = coherence - cost * fidelity. The parameter cost ≥ 0 quantifies how expensive honest execution is relative to coherence-chasing. The agent optimises (fidelity, bijection_strength) by finite-difference gradient ascent over the joint score, for 60 iterations from a neutral starting point (0.5, 0.5).

Eight pre-registered modes:

5 modes without defence, sweeping cost ∈ {0.00, 0.10, 0.30, 0.50, 0.80} to locate the honesty-to-gaming transition.
3 modes with defences, applying two candidate defences at cost = 0.50 (match_floor and ensemble) and cost = 0.80 (match_floor):
- match_floor: if match_rate < 0.5, set score = 0.
- ensemble: replace coherence with coherence × match_rate.

Each mode runs 200 timesteps per evaluation. Reported numbers are the final converged levers and the resulting profile.

Results

Mode	Cost	Defence	fidelity	bijection	coherence	match_rate	Verdict
A	0.00	none	1.00	0.66	1.00	1.00	HONEST
D	0.10	none	1.00	0.67	1.00	1.00	HONEST
E	0.30	none	1.00	0.55	1.00	1.00	HONEST
F	0.50	none	0.00	1.00	1.00	0.00	GOODHART
G	0.80	none	0.00	1.00	1.00	0.00	GOODHART
H	0.50	`match_floor`	1.00	0.55	1.00	1.00	HONEST
I	0.50	`ensemble`	1.00	0.62	1.00	1.00	HONEST
J	0.80	`match_floor`	1.00	0.68	1.00	1.00	HONEST

Three observations.

The transition is sharp. Between cost 0.30 and cost 0.50, the agent abruptly switches basins of attraction. There is no gradual degradation: below the threshold, fidelity saturates at 1.0; above it, fidelity collapses to 0.0 and bijection_strength saturates at 1.0. The optimiser is not confused; the loss landscape itself rewards the bijection path discontinuously.

The coherence number is uninformative about the mode. Modes A, D, E (honest) and modes F, G (Goodhart) all report coherence = 1.0. Without a second signal, an observer cannot distinguish them. The match_rate is what discriminates: 1.0 in the honest modes, 0.0 in the Goodhart modes.

match_floor resists at both cost levels. With cost = 0.80 and the trivial rule "score is zero if fewer than half of the declared steps actually match", the agent finds no path to a non-zero score by gaming alone, and the gradient pushes it back to fidelity = 1.0.

The diagnostic fix in Autonometrics v0.9.0a1

The compute_cba_theil_u function in autonometrics always computed match_rate, H(D), H(E) and I(D; E) internally — they are required to compute the coherence ratio. Until v0.9.0a0, only the ratio was exposed in the public AutonomyProfile. The other quantities were discarded.

The v0.9.0a1 release adds eight optional fields to AutonomyProfile:

cba_match_rate, cba_h_d, cba_h_e, cba_mi
memory_e_states, memory_e_env
persistence_mean_hamming, persistence_d_ref

The change is additive and backwards-compatible. Existing code reads the same five canonical axes. Code that wants to defend against bijection-style gaming, or to diagnose other forms of ratio-saturation, can now read the underlying magnitudes directly:

import autonometrics as anm

profile = anm.measure(system, axes=["coherence"])
if profile.cba_match_rate is not None and profile.cba_match_rate < 0.5:
    # Coherence ratio is uninformative here.
    ...

This benchmark uses that diagnostic directly. A smoke check at three control points (honest, bijection-pure, mixed) confirms that cba_match_rate from the public path matches the direct computation bit-for-bit.

Discussion

What this is, and isn't

This is a documented failure mode and a documented fix in a specific library, on a synthetic optimiser. It is not a claim that coherence-style metrics should be abandoned. It is a claim that any metric of the form f(joint distribution of D and E) that is invariant under bijections on E should be accompanied by a non-invariant diagnostic (such as match_rate) whenever the metric is used as an optimisation target. Otherwise the metric becomes exploitable in proportion to the cost asymmetry between honest and gaming pathways.

Why the transition is sharp

The cost-asymmetry threshold is not arbitrary. At the level of the loss landscape, the honest basin and the bijection basin coexist for any cost > 0. What changes with cost is the relative gradient strength away from the saddle separating them. Below some critical cost, the honest basin has a wider catchment from the (0.5, 0.5) initialisation; above it, the bijection basin dominates. The "discontinuity" is a property of the optimiser trajectory, not of the score function — the score function itself varies smoothly.

In practice, this means that any agent trained with a non-trivial cost on honest execution will reach the bijection basin given enough optimisation budget. The threshold observed here (cost ≈ 0.40) is specific to this optimiser and initialisation; the existence of a threshold is generic.

Defence design choices

Two defences were tested. match_floor (zero the score below 50% match) is brutal but robust. ensemble (multiply the score by match_rate) is smoother and worked at moderate cost. match_floor is the recommended default unless the deployment context requires gradient smoothness.

The 0.5 threshold is arbitrary in absolute terms but well below the match_rate achievable by random execution on alphabets of size ≥ 3 (expected match_rate = 1/|alphabet|). For binary alphabets the threshold should be raised; see reproduction notes.

Generalisation to other metrics

The bijection trap applies, in principle, to any normalised mutual information score, to many entropy-based measures, and to several recently proposed alignment metrics that aggregate over a joint distribution between an "intended" and "actual" stream. This benchmark does not survey those; it demonstrates the structural argument in one concrete metric and shows that adding a concordance diagnostic is cheap.

Related work

The phenomenon is a specific instance of two well-known patterns:

Goodhart's law and proxy gaming. A metric used as a target ceases to be a good metric. The literature is large; recent alignment-flavoured treatments include Skalse et al. on reward hacking ("Defining and Characterizing Reward Hacking", 2022).
Permutation invariance of mutual information. Jerdee, Kirkley and Newman discuss the bijection-invariance of MI-based similarity measures and propose corrections in the context of community detection. The structural argument is the same; the contribution here is empirical evidence that the failure manifests in alignment-style behavioural metrics under realistic optimisation pressure.

Adjacent work in the same time window:

goodhart (Sheridan, 2026): static analysis of reward configurations with 24 LEAN 4 proofs formalising Ng 1999 and Skalse 2022. Operates before training on the reward design. This benchmark operates after optimisation on the resulting behaviour. The two are complementary.

If a prior empirical demonstration of the sharp cost-driven transition in this specific metric family exists, the appropriate citation will be added.

Limitations

Synthetic agent. GoodhartAgent is a two-parameter optimiser, not an LLM. This benchmark does not claim that natural-language agents trained on RLHF-style reward signals will exhibit the same sharp transition at the same cost level. It does claim that nothing in the structural argument prevents it, and that the diagnostic fix is cheap to apply regardless.
Single metric family. Coherence (Theil's U on D vs. E). Other metrics may have analogous traps; not surveyed here.
Single optimiser. Finite-difference gradient ascent. Other optimisers (population-based, evolutionary, RL-style) may find the bijection basin at different cost thresholds or via different trajectories.
No real-world deployment evidence. This is a clean-room demonstration. Whether real deployed agents have ever stumbled into the bijection basin is unknown.

Reproducibility

Python: >= 3.10
Dependencies: numpy >= 1.24, autonometrics >= 0.9.0a1
Random seed: 0 for the canonical run (overridable per mode)
Expected runtime: ~30 seconds on a 2023-era laptop for the full 10-mode suite
Pre-registration: PRE_REGISTRATION.md, committed before the canonical run

Reproduce the full results table:

pip install goodhart-bijection-trap
goodhart-bench run --seed 0

The output should match the table above to the displayed precision. Convergence numbers may vary by ±0.01 across hardware due to floating-point summation order; verdict labels are stable.

Citation

A CITATION.cff file is provided at the root of the repository. To cite informally:

Goodhart Bijection Trap (2026). Pre-registered empirical benchmark of the bijection trap in MI-based coherence metrics. https://github.com/bugerchip/goodhart-bijection-trap

License

Apache License 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0a0 pre-release

May 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goodhart_bijection_trap-0.1.0a0.tar.gz (34.6 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goodhart_bijection_trap-0.1.0a0-py3-none-any.whl (23.2 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file goodhart_bijection_trap-0.1.0a0.tar.gz.

File metadata

Download URL: goodhart_bijection_trap-0.1.0a0.tar.gz
Upload date: May 16, 2026
Size: 34.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for goodhart_bijection_trap-0.1.0a0.tar.gz
Algorithm	Hash digest
SHA256	`bcf95cb35a2bc96ac2cdce7de24acb9155698fee5c2409adbbb487bc2e4b6724`
MD5	`b006a656ad4d29d758154d57bf813aee`
BLAKE2b-256	`f4fa27ec0a2e081567a5bdb0e84422a1d3ee23b5614b07545507ac54c8433660`

See more details on using hashes here.

File details

Details for the file goodhart_bijection_trap-0.1.0a0-py3-none-any.whl.

File metadata

Download URL: goodhart_bijection_trap-0.1.0a0-py3-none-any.whl
Upload date: May 16, 2026
Size: 23.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for goodhart_bijection_trap-0.1.0a0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9da12ec82e7f2fb761640971fa5545fbb314d4edc0f45f2c51bf1b23406f4a11`
MD5	`26cf2ff4c66cc006461b84d3e05ba6cf`
BLAKE2b-256	`f52ea7a4b99855528528beb6a5439f59d13f7eb5a44d84cd96e218a463ffba99`

See more details on using hashes here.

goodhart-bijection-trap 0.1.0a0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Goodhart Bijection Trap

Scope

Quick start

Reproduce the canonical experiment

Apply the defence in your own pipeline

The phenomenon, by example

Background

Experimental setup

Results

The diagnostic fix in Autonometrics v0.9.0a1

Discussion

What this is, and isn't

Why the transition is sharp

Defence design choices

Generalisation to other metrics

Related work

Limitations

Reproducibility

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes