Skip to main content

Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).

Project description

gavagai

“The very fact of the indeterminacy of translation is a finding about meaning, not a failure of method.” — paraphrased after W. V. O. Quine, Ontological Relativity (1968)

PyPI License Python

gavagai quantifies translation indeterminacy between two Sparse Autoencoder (SAE) feature dictionaries — how many empirically valid feature-to-feature alignments exist, not which single alignment is "the right one". It is a Mechanistic Interpretability tool grounded in Quine's philosophy of language.

🇯🇵 日本語の説明は docs/README.ja.md にあります。

Why this exists

Cross-SAE alignment tools (Universal SAE, SPARC, Sparse Crosscoders) ask: what is the correct mapping between two SAEs' features? Quine's gavagai thought experiment suggests this question is structurally underdetermined: the observational data fixes an equivalence class of translations, not a single one. gavagai does not solve that underdetermination — it measures it.

Concretely:

  • Train two SAEs (different seed, different model checkpoint, different layer) on aligned activations.
  • Run gavagai_score(sae_a, sae_b).
  • Get a number in [0, 1]: 0 = deterministic alignment exists; 1 = radical indeterminacy (many empirically valid alignments).

The score drops into CI as a regression gate: refuse model pushes whose indeterminacy with the baseline exceeds a threshold.

Install

pip install gavagai

Optional extras:

pip install "gavagai[saelens]"    # SAELens SAE objects
pip install "gavagai[behavior]"   # downstream-KL behavior equivalence
pip install "gavagai[holism]"     # circuit-tracer integration (v0.2)

Quick start

import numpy as np
from gavagai import gavagai_score

# Decoder matrices: shape (n_features, d_model). gavagai also accepts
# SAELens SAE instances and {"W_dec": ndarray} dicts.
sae_a = np.random.default_rng(0).standard_normal((1024, 768))
sae_b = np.random.default_rng(1).standard_normal((1024, 768))

score = gavagai_score(sae_a, sae_b)
print(f"indeterminacy: {score:.4f}")

# With diagnostics
score, details = gavagai_score(sae_a, sae_b, return_details=True)
print(f"  candidates : {details.n_equivalent_translations}")
print(f"  95% CI     : [{details.ci_low:.4f}, {details.ci_high:.4f}]")

CI gate (gavagai-lint)

The kill-app. Drop into your pre-push hook or GitHub Action:

gavagai-lint \
    --before sae_baseline.npz \
    --after  sae_after_abliteration.npz \
    --threshold 0.3

Exit 0 if the indeterminacy is below threshold (acceptable drift), 1 otherwise. Designed for gating Hugging Face uploads, abliteration patches, and post-train fine-tuning steps where the model's feature semantics may silently re-arrange.

GitHub Action (composite):

- uses: hinanohart/gavagai/.github/actions/gavagai@v0.1.0
  with:
    before: artifacts/baseline.npz
    after:  artifacts/candidate.npz
    threshold: 0.3

Equivalence relations

The score is relative to a choice of equivalence relation — this is the Ontological Relativity commitment, made explicit:

equivalence= What "two features are equivalent" means Needs
"cosine" decoder directions within ε cosine distance decoder matrices
"activation" overlapping token-firing patterns (Jaccard ≥ 1−ε) activations_* arrays
"behavior" similar downstream KL when ablated (1/(1+kl) ≥ 1−ε) ablation_kl matrix
score = gavagai_score(sae_a, sae_b, equivalence="cosine", epsilon=0.1)

Different relations yield different scores. That is the point: there is no relation-independent "true" indeterminacy.

Caveat (v0.1): the same epsilon is applied across all three relations despite their differing scales (cosine ∈ [−1,1], Jaccard ∈ [0,1], KL-derived ∈ (0,1]). Comparison across relations should be qualitative, not threshold-equal. v0.1.x will add per-relation epsilon normalization.

How it works

  1. Extract decoder matrices W_A, W_B.
  2. Compute similarity matrix S under the chosen relation.
  3. Threshold by ε to get a candidate adjacency A_ε.
  4. Count valid bipartite matchings of A_ε (DFS with backtracking, capped at cap=1000). Empty adjacency ⇒ cap (radical indeterminacy).
  5. Compress matching count to [0, 1] via 1 − 1 / (1 + log(n)).
  6. Bootstrap a 95% CI over feature-row resamples.

Step 4 is the Quinean heart: we never collapse the candidate space to a single bijection.

Roadmap

version adds
v0.1.0 scalar gavagai_score, CLI gate, 3 equivalence relations
v0.1.x per-relation ε normalization, coverage diagnostic for sparse adjacency
v0.2.0 holism propagator (Duhem-Quine) via circuit-tracer
v0.3.0 ontological commitment detector (AlignSAE binding)
v1.0.0 cross-paradigm translator (probe ↔ SAE ↔ patching)

Anti-goals

  • Not a SAE trainer. Use SAELens.
  • Not a circuit visualizer. Use circuit-tracer.
  • Not a universal feature library. We measure indeterminacy; we do not pretend to eliminate it.

Reading

  • W. V. O. Quine, Word and Object (1960), ch. 2.
  • W. V. O. Quine, Ontological Relativity and Other Essays (1968).
  • Marks et al., Sparse Feature Circuits, ICLR 2025 (arXiv:2403.19647).
  • Bricken et al., Towards Monosemanticity, Anthropic 2023.
  • Arditi et al., Refusal direction is mediated by a single direction, NeurIPS 2024 (arXiv:2406.11717).
  • Mechanistic Interpretability Needs Philosophy (arXiv:2506.18852).

License

Apache 2.0. See LICENSE and NOTICE.

The name draws on Quine's philosophical work as a scholarly reference. No endorsement or affiliation is claimed or implied.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gavagai-0.1.1.tar.gz (28.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gavagai-0.1.1-py3-none-any.whl (25.0 kB view details)

Uploaded Python 3

File details

Details for the file gavagai-0.1.1.tar.gz.

File metadata

  • Download URL: gavagai-0.1.1.tar.gz
  • Upload date:
  • Size: 28.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gavagai-0.1.1.tar.gz
Algorithm Hash digest
SHA256 4485d593db556fd65b4b79d9777f9fb27ca671a0801ed3887d9b00c936a3cc84
MD5 85f0a73ec0b52c7948cb4989eb670e6a
BLAKE2b-256 52b244eeab117fe5655984f99c9c79ad0368d74901561ed977b36299cbb4e1c9

See more details on using hashes here.

File details

Details for the file gavagai-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gavagai-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 25.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gavagai-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 728bb8d1dede4f9d79484b6ffefeb5a831828c63063c4a187fb2fda46b4c2bd0
MD5 edc7c183212d880181d8c5814ac5c587
BLAKE2b-256 eafb39d2ae9289c2cf4b958b7767e3957c2c28b992a3125fb2aa25279c4ef9f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page