Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).
Project description
gavagai
“The very fact of the indeterminacy of translation is a finding about meaning, not a failure of method.” — paraphrased after W. V. O. Quine, Ontological Relativity (1968)
gavagai quantifies translation indeterminacy between two Sparse Autoencoder (SAE) feature dictionaries — how many empirically valid feature-to-feature alignments exist, not which single alignment is "the right one". It is a Mechanistic Interpretability tool grounded in Quine's philosophy of language.
🇯🇵 日本語の説明は
docs/README.ja.mdにあります。
Why this exists
Cross-SAE alignment tools (Universal SAE, SPARC, Sparse Crosscoders) ask:
what is the correct mapping between two SAEs' features? Quine's gavagai
thought experiment suggests this question is structurally
underdetermined: the observational data fixes an equivalence class of
translations, not a single one. gavagai does not solve that
underdetermination — it measures it.
Concretely:
- Train two SAEs (different seed, different model checkpoint, different layer) on aligned activations.
- Run
gavagai_score(sae_a, sae_b). - Get a number in
[0, 1]: 0 = deterministic alignment exists; 1 = radical indeterminacy (many empirically valid alignments).
The score drops into CI as a regression gate: refuse model pushes whose indeterminacy with the baseline exceeds a threshold.
Install
pip install gavagai
Optional extras:
pip install "gavagai[saelens]" # SAELens SAE objects
pip install "gavagai[behavior]" # downstream-KL behavior equivalence
pip install "gavagai[holism]" # circuit-tracer integration (v0.2)
Quick start
import numpy as np
from gavagai import gavagai_score
# Decoder matrices: shape (n_features, d_model). gavagai also accepts
# SAELens SAE instances and {"W_dec": ndarray} dicts.
sae_a = np.random.default_rng(0).standard_normal((1024, 768))
sae_b = np.random.default_rng(1).standard_normal((1024, 768))
score = gavagai_score(sae_a, sae_b)
print(f"indeterminacy: {score:.4f}")
# With diagnostics
score, details = gavagai_score(sae_a, sae_b, return_details=True)
print(f" candidates : {details.n_equivalent_translations}")
print(f" 95% CI : [{details.ci_low:.4f}, {details.ci_high:.4f}]")
CI gate (gavagai-lint)
The kill-app. Drop into your pre-push hook or GitHub Action:
gavagai-lint \
--before sae_baseline.npz \
--after sae_after_abliteration.npz \
--threshold 0.3
Exit 0 if the indeterminacy is below threshold (acceptable drift), 1
otherwise. Designed for gating Hugging Face uploads, abliteration patches,
and post-train fine-tuning steps where the model's feature semantics may
silently re-arrange.
GitHub Action (composite):
- uses: hinanohart/gavagai/.github/actions/gavagai@v0.1.0
with:
before: artifacts/baseline.npz
after: artifacts/candidate.npz
threshold: 0.3
Equivalence relations
The score is relative to a choice of equivalence relation — this is the Ontological Relativity commitment, made explicit:
equivalence= |
What "two features are equivalent" means | Needs |
|---|---|---|
"cosine" |
decoder directions within ε cosine distance | decoder matrices |
"activation" |
overlapping token-firing patterns (Jaccard ≥ 1−ε) | activations_* arrays |
"behavior" |
similar downstream KL when ablated (1/(1+kl) ≥ 1−ε) |
ablation_kl matrix |
score = gavagai_score(sae_a, sae_b, equivalence="cosine", epsilon=0.1)
Different relations yield different scores. That is the point: there is no relation-independent "true" indeterminacy.
Caveat (v0.1): the same
epsilonis applied across all three relations despite their differing scales (cosine ∈ [−1,1], Jaccard ∈ [0,1], KL-derived ∈ (0,1]). Comparison across relations should be qualitative, not threshold-equal. v0.1.x will add per-relation epsilon normalization.
How it works
- Extract decoder matrices
W_A,W_B. - Compute similarity matrix
Sunder the chosen relation. - Threshold by
εto get a candidate adjacencyA_ε. - Count valid bipartite matchings of
A_ε(DFS with backtracking, capped atcap=1000). Empty adjacency ⇒ cap (radical indeterminacy). - Compress matching count to
[0, 1]via1 − 1 / (1 + log(n)). - Bootstrap a 95% CI over feature-row resamples.
Step 4 is the Quinean heart: we never collapse the candidate space to a single bijection.
Roadmap
| version | adds |
|---|---|
| v0.1.0 | scalar gavagai_score, CLI gate, 3 equivalence relations |
| v0.1.x | per-relation ε normalization, coverage diagnostic for sparse adjacency |
| v0.2.0 | holism propagator (Duhem-Quine) via circuit-tracer |
| v0.3.0 | ontological commitment detector (AlignSAE binding) |
| v1.0.0 | cross-paradigm translator (probe ↔ SAE ↔ patching) |
Anti-goals
- Not a SAE trainer. Use SAELens.
- Not a circuit visualizer. Use circuit-tracer.
- Not a universal feature library. We measure indeterminacy; we do not pretend to eliminate it.
Reading
- W. V. O. Quine, Word and Object (1960), ch. 2.
- W. V. O. Quine, Ontological Relativity and Other Essays (1968).
- Marks et al., Sparse Feature Circuits, ICLR 2025 (arXiv:2403.19647).
- Bricken et al., Towards Monosemanticity, Anthropic 2023.
- Arditi et al., Refusal direction is mediated by a single direction, NeurIPS 2024 (arXiv:2406.11717).
- Mechanistic Interpretability Needs Philosophy (arXiv:2506.18852).
License
MIT. See LICENSE.
The name draws on Quine's philosophical work as a scholarly reference. No endorsement or affiliation is claimed or implied.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gavagai-0.1.3.tar.gz.
File metadata
- Download URL: gavagai-0.1.3.tar.gz
- Upload date:
- Size: 27.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e31e26f911306323d3381684d8f09337d328f71bed6ee2b846bc5857ed73c511
|
|
| MD5 |
b67f36c8830669e6fcbfdec6a9f7c5e0
|
|
| BLAKE2b-256 |
562ecdacc2528241ace00085708408626d0c8352212d56e66a0b71cd8aeebec2
|
File details
Details for the file gavagai-0.1.3-py3-none-any.whl.
File metadata
- Download URL: gavagai-0.1.3-py3-none-any.whl
- Upload date:
- Size: 22.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a19f0e884832591e3851dda2d45f6b564205a18e9dd1d04bfa5e36f3d42415d
|
|
| MD5 |
438be04c44d0856b6d414c405a57a815
|
|
| BLAKE2b-256 |
8686839c3c9dbac5b4ad80a9834cd2737695a182ad0e9eaaffc8aa43318e48f2
|