Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).

These details have not been verified by PyPI

Project links

Project description

gavagai

“The very fact of the indeterminacy of translation is a finding about meaning, not a failure of method.” — paraphrased after W. V. O. Quine, Ontological Relativity (1968)

gavagai quantifies translation indeterminacy between two Sparse Autoencoder (SAE) feature dictionaries — how many empirically valid feature-to-feature alignments exist, not which single alignment is "the right one". It is a Mechanistic Interpretability tool grounded in Quine's philosophy of language.

🇯🇵 日本語の説明は docs/README.ja.md にあります。

Why this exists

Cross-SAE alignment tools (Universal SAE, SPARC, Sparse Crosscoders) ask: what is the correct mapping between two SAEs' features? Quine's gavagai thought experiment suggests this question is structurally underdetermined: the observational data fixes an equivalence class of translations, not a single one. gavagai does not solve that underdetermination — it measures it.

Concretely:

Train two SAEs (different seed, different model checkpoint, different layer) on aligned activations.
Run gavagai_score(sae_a, sae_b).
Get a number in [0, 1]: 0 = deterministic alignment exists; 1 = radical indeterminacy (many empirically valid alignments).

The score drops into CI as a regression gate: refuse model pushes whose indeterminacy with the baseline exceeds a threshold.

Phase 1.5: Cross-Layer Drift (`gavagai.cross_layer_drift`)

The gavagai.cross_layer_drift module extends the core indeterminacy score to measure how feature representations drift across consecutive transformer layers within the same SAE zoo. Introduced in v0.2.0 and reclassified as Phase 1.5 (architecture design 2026-05-20) because it depends on HookedSAEBundle from Phase 1 (from_pretrained).

from gavagai.cross_layer_drift import cross_layer_drift_report, pairwise_drift_matrix

# bundle is a HookedSAEBundle from gavagai.backends.saelens_adapter.from_pretrained
report = cross_layer_drift_report(bundle)
# report.rows: list of DriftRow(layer_a, layer_b, epsilon_net_distance, cosine_drift)

matrix = pairwise_drift_matrix(bundle)
# matrix.shape == (n_layers, n_layers)

Key functions:

epsilon_net_distance(bundle, layer_a, layer_b) — fraction of features in layer A with no ε-equivalent in layer B.
cosine_drift(bundle, layer_a, layer_b) — mean 1 - max_cosine_sim across features in A.
cross_layer_drift_report(bundle) — runs both metrics for every consecutive layer pair.
pairwise_drift_matrix(bundle) — n_layers × n_layers drift matrix.

Requires pip install "gavagai[saelens]".

Install

pip install gavagai

Optional extras:

pip install "gavagai[saelens]"    # SAELens SAE objects
pip install "gavagai[behavior]"   # downstream-KL behavior equivalence
pip install "gavagai[holism]"     # circuit-tracer integration (v0.2)

Quick start

import numpy as np
from gavagai import gavagai_score

# Decoder matrices: shape (n_features, d_model). gavagai also accepts
# SAELens SAE instances and {"W_dec": ndarray} dicts.
sae_a = np.random.default_rng(0).standard_normal((1024, 768))
sae_b = np.random.default_rng(1).standard_normal((1024, 768))

score = gavagai_score(sae_a, sae_b)
print(f"indeterminacy: {score:.4f}")

# With diagnostics
score, details = gavagai_score(sae_a, sae_b, return_details=True)
print(f"  candidates : {details.n_equivalent_translations}")
print(f"  95% CI     : [{details.ci_low:.4f}, {details.ci_high:.4f}]")

CI gate (`gavagai-lint`)

The kill-app. Drop into your pre-push hook or GitHub Action:

gavagai-lint \
    --before sae_baseline.npz \
    --after  sae_after_abliteration.npz \
    --threshold 0.3

Exit 0 if the indeterminacy is below threshold (acceptable drift), 1 otherwise. Designed for gating Hugging Face uploads, abliteration patches, and post-train fine-tuning steps where the model's feature semantics may silently re-arrange.

GitHub Action (composite):

- uses: hinanohart/gavagai/.github/actions/gavagai@v0.1.0
  with:
    before: artifacts/baseline.npz
    after:  artifacts/candidate.npz
    threshold: 0.3

Equivalence relations

The score is relative to a choice of equivalence relation — this is the Ontological Relativity commitment, made explicit:

`equivalence=`	What "two features are equivalent" means	Needs
`"cosine"`	decoder directions within ε cosine distance	decoder matrices
`"activation"`	overlapping token-firing patterns (Jaccard ≥ 1−ε)	`activations_*` arrays
`"behavior"`	similar downstream KL when ablated (`1/(1+kl) ≥ 1−ε`)	`ablation_kl` matrix

score = gavagai_score(sae_a, sae_b, equivalence="cosine", epsilon=0.1)

Different relations yield different scores. That is the point: there is no relation-independent "true" indeterminacy.

Caveat (v0.1): the same epsilon is applied across all three relations despite their differing scales (cosine ∈ [−1,1], Jaccard ∈ [0,1], KL-derived ∈ (0,1]). Comparison across relations should be qualitative, not threshold-equal. v0.1.x will add per-relation epsilon normalization.

How it works

Extract decoder matrices W_A, W_B.
Compute similarity matrix S under the chosen relation.
Threshold by ε to get a candidate adjacency A_ε.
Count valid bipartite matchings of A_ε (DFS with backtracking, capped at cap=1000). Empty adjacency ⇒ cap (radical indeterminacy).
Compress matching count to [0, 1] via 1 − 1 / (1 + log(n)).
Bootstrap a 95% CI over feature-row resamples.

Step 4 is the Quinean heart: we never collapse the candidate space to a single bijection.

Roadmap

version	adds
v0.1.0	scalar `gavagai_score`, CLI gate, 3 equivalence relations
v0.1.x	per-relation ε normalization, `coverage` diagnostic for sparse adjacency
v0.2.0	holism propagator (Duhem-Quine) via `circuit-tracer`
v0.3.0	ontological commitment detector (AlignSAE binding)
v1.0.0	cross-paradigm translator (probe ↔ SAE ↔ patching)

Known limitations

gavagai is a research probe, not an alignment tool. Please read the following before drawing conclusions from its output:

Limitation	What it means
Feature absorption	A single human-interpretable concept may be split across multiple SAE features, or merged with an unrelated one. `gavagai_score` measures geometric indeterminacy between feature dictionaries; it cannot detect or correct absorption.
Faithfulness decay	Reconstruction error (reported by `from_pretrained`) grows with layer depth in GPT-2-small. The per-layer MSE is estimated on probe prompts and may differ substantially from the SAELens training distribution.
ε-sensitivity	Results depend on the choice of `epsilon`. No single value is "correct"; treat scores as relative, not absolute.
No semantic grounding	Cosine similarity between decoder directions is a geometric proxy for feature equivalence, not a semantic one. Two features with cosine 0.95 may represent entirely different concepts if the SAEs were trained on different distributions.
Research probe	Do not use `gavagai_score` as a safety or alignment guarantee. The library measures indeterminacy; it does not resolve it.

Anti-goals

Not a SAE trainer. Use SAELens.
Not a circuit visualizer. Use circuit-tracer.
Not a universal feature library. We measure indeterminacy; we do not pretend to eliminate it.
Not an alignment tool. See Known limitations above.

Reading

W. V. O. Quine, Word and Object (1960), ch. 2.
W. V. O. Quine, Ontological Relativity and Other Essays (1968).
Marks et al., Sparse Feature Circuits, ICLR 2025 (arXiv:2403.19647).
Bricken et al., Towards Monosemanticity, Anthropic 2023.
Arditi et al., Refusal direction is mediated by a single direction, NeurIPS 2024 (arXiv:2406.11717).
Mechanistic Interpretability Needs Philosophy (arXiv:2506.18852).

License

MIT. See LICENSE.

The name draws on Quine's philosophical work as a scholarly reference. No endorsement or affiliation is claimed or implied.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

May 20, 2026

0.2.0

May 20, 2026

0.1.3

May 19, 2026

0.1.2

May 17, 2026

0.1.1

May 17, 2026

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gavagai-0.2.1.tar.gz (39.0 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gavagai-0.2.1-py3-none-any.whl (29.2 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file gavagai-0.2.1.tar.gz.

File metadata

Download URL: gavagai-0.2.1.tar.gz
Upload date: May 20, 2026
Size: 39.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gavagai-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`cf6c06faf00757c78cb68645e481db451fff811a5bee8e3c8a59057a5ad762d8`
MD5	`bffe90676a18fbc187b0c3fb219111fd`
BLAKE2b-256	`934a519a702c4742cd025f6cb67c18e929173254e820c59d6e55bc655ffb1dfd`

See more details on using hashes here.

File details

Details for the file gavagai-0.2.1-py3-none-any.whl.

File metadata

Download URL: gavagai-0.2.1-py3-none-any.whl
Upload date: May 20, 2026
Size: 29.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gavagai-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3d3c315c8838f88397a8b79d00604c048388b9235e2ec189d5f83e83bbbde0f`
MD5	`5c1fe330407cb6224d02d8801ec490ce`
BLAKE2b-256	`52a17ca706c4e82d754719cca48478357a13999a2fa69e8b6fa4a35bb5b2ab32`

See more details on using hashes here.

gavagai 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gavagai

Why this exists

Phase 1.5: Cross-Layer Drift (`gavagai.cross_layer_drift`)

Install

Quick start

CI gate (`gavagai-lint`)

Equivalence relations

How it works

Roadmap

Known limitations

Anti-goals

Reading

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

gavagai 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gavagai

Why this exists

Phase 1.5: Cross-Layer Drift (gavagai.cross_layer_drift)

Install

Quick start

CI gate (gavagai-lint)

Equivalence relations

How it works

Roadmap

Known limitations

Anti-goals

Reading

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Phase 1.5: Cross-Layer Drift (`gavagai.cross_layer_drift`)

CI gate (`gavagai-lint`)