Skip to main content

Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).

Project description

gavagai

“The very fact of the indeterminacy of translation is a finding about meaning, not a failure of method.” — paraphrased after W. V. O. Quine, Ontological Relativity (1968)

CI PyPI License Python

gavagai quantifies translation indeterminacy between two Sparse Autoencoder (SAE) feature dictionaries — how many empirically valid feature-to-feature alignments exist, not which single alignment is "the right one". It is a Mechanistic Interpretability tool grounded in Quine's philosophy of language.

🇯🇵 日本語の説明は docs/README.ja.md にあります。

Why this exists

Cross-SAE alignment tools (Universal SAE, SPARC, Sparse Crosscoders) ask: what is the correct mapping between two SAEs' features? Quine's gavagai thought experiment suggests this question is structurally underdetermined: the observational data fixes an equivalence class of translations, not a single one. gavagai does not solve that underdetermination — it measures it.

Concretely:

  • Train two SAEs (different seed, different model checkpoint, different layer) on aligned activations.
  • Run gavagai_score(sae_a, sae_b).
  • Get a number in [0, 1]: 0 = deterministic alignment exists; 1 = radical indeterminacy (many empirically valid alignments).

The score drops into CI as a regression gate: refuse model pushes whose indeterminacy with the baseline exceeds a threshold.

Install

pip install gavagai

Optional extras:

pip install "gavagai[saelens]"    # SAELens SAE objects
pip install "gavagai[behavior]"   # downstream-KL behavior equivalence
pip install "gavagai[holism]"     # circuit-tracer integration (v0.2)

Quick start

import numpy as np
from gavagai import gavagai_score

# Decoder matrices: shape (n_features, d_model). gavagai also accepts
# SAELens SAE instances and {"W_dec": ndarray} dicts.
sae_a = np.random.default_rng(0).standard_normal((1024, 768))
sae_b = np.random.default_rng(1).standard_normal((1024, 768))

score = gavagai_score(sae_a, sae_b)
print(f"indeterminacy: {score:.4f}")

# With diagnostics
score, details = gavagai_score(sae_a, sae_b, return_details=True)
print(f"  candidates : {details.n_equivalent_translations}")
print(f"  95% CI     : [{details.ci_low:.4f}, {details.ci_high:.4f}]")

CI gate (gavagai-lint)

The kill-app. Drop into your pre-push hook or GitHub Action:

gavagai-lint \
    --before sae_baseline.npz \
    --after  sae_after_abliteration.npz \
    --threshold 0.3

Exit 0 if the indeterminacy is below threshold (acceptable drift), 1 otherwise. Designed for gating Hugging Face uploads, abliteration patches, and post-train fine-tuning steps where the model's feature semantics may silently re-arrange.

GitHub Action (composite):

- uses: hinanohart/gavagai/.github/actions/gavagai@v0.1.0
  with:
    before: artifacts/baseline.npz
    after:  artifacts/candidate.npz
    threshold: 0.3

Equivalence relations

The score is relative to a choice of equivalence relation — this is the Ontological Relativity commitment, made explicit:

equivalence= What "two features are equivalent" means Needs
"cosine" decoder directions within ε cosine distance decoder matrices
"activation" overlapping token-firing patterns (Jaccard ≥ 1−ε) activations_* arrays
"behavior" similar downstream KL when ablated (1/(1+kl) ≥ 1−ε) ablation_kl matrix
score = gavagai_score(sae_a, sae_b, equivalence="cosine", epsilon=0.1)

Different relations yield different scores. That is the point: there is no relation-independent "true" indeterminacy.

Caveat (v0.1): the same epsilon is applied across all three relations despite their differing scales (cosine ∈ [−1,1], Jaccard ∈ [0,1], KL-derived ∈ (0,1]). Comparison across relations should be qualitative, not threshold-equal. v0.1.x will add per-relation epsilon normalization.

How it works

  1. Extract decoder matrices W_A, W_B.
  2. Compute similarity matrix S under the chosen relation.
  3. Threshold by ε to get a candidate adjacency A_ε.
  4. Count valid bipartite matchings of A_ε (DFS with backtracking, capped at cap=1000). Empty adjacency ⇒ cap (radical indeterminacy).
  5. Compress matching count to [0, 1] via 1 − 1 / (1 + log(n)).
  6. Bootstrap a 95% CI over feature-row resamples.

Step 4 is the Quinean heart: we never collapse the candidate space to a single bijection.

Roadmap

version adds
v0.1.0 scalar gavagai_score, CLI gate, 3 equivalence relations
v0.1.x per-relation ε normalization, coverage diagnostic for sparse adjacency
v0.2.0 holism propagator (Duhem-Quine) via circuit-tracer
v0.3.0 ontological commitment detector (AlignSAE binding)
v1.0.0 cross-paradigm translator (probe ↔ SAE ↔ patching)

Anti-goals

  • Not a SAE trainer. Use SAELens.
  • Not a circuit visualizer. Use circuit-tracer.
  • Not a universal feature library. We measure indeterminacy; we do not pretend to eliminate it.

Reading

  • W. V. O. Quine, Word and Object (1960), ch. 2.
  • W. V. O. Quine, Ontological Relativity and Other Essays (1968).
  • Marks et al., Sparse Feature Circuits, ICLR 2025 (arXiv:2403.19647).
  • Bricken et al., Towards Monosemanticity, Anthropic 2023.
  • Arditi et al., Refusal direction is mediated by a single direction, NeurIPS 2024 (arXiv:2406.11717).
  • Mechanistic Interpretability Needs Philosophy (arXiv:2506.18852).

License

MIT. See LICENSE.

The name draws on Quine's philosophical work as a scholarly reference. No endorsement or affiliation is claimed or implied.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gavagai-0.1.3.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gavagai-0.1.3-py3-none-any.whl (22.7 kB view details)

Uploaded Python 3

File details

Details for the file gavagai-0.1.3.tar.gz.

File metadata

  • Download URL: gavagai-0.1.3.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gavagai-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e31e26f911306323d3381684d8f09337d328f71bed6ee2b846bc5857ed73c511
MD5 b67f36c8830669e6fcbfdec6a9f7c5e0
BLAKE2b-256 562ecdacc2528241ace00085708408626d0c8352212d56e66a0b71cd8aeebec2

See more details on using hashes here.

File details

Details for the file gavagai-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: gavagai-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 22.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gavagai-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5a19f0e884832591e3851dda2d45f6b564205a18e9dd1d04bfa5e36f3d42415d
MD5 438be04c44d0856b6d414c405a57a815
BLAKE2b-256 8686839c3c9dbac5b4ad80a9834cd2737695a182ad0e9eaaffc8aa43318e48f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page