Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).

These details have not been verified by PyPI

Project links

Project description

gavagai

“The very fact of the indeterminacy of translation is a finding about meaning, not a failure of method.” — paraphrased after W. V. O. Quine, Ontological Relativity (1968)

gavagai quantifies translation indeterminacy between two Sparse Autoencoder (SAE) feature dictionaries — how many empirically valid feature-to-feature alignments exist, not which single alignment is "the right one". It is a Mechanistic Interpretability tool grounded in Quine's philosophy of language.

🇯🇵 日本語の説明は docs/README.ja.md にあります。

Why this exists

Cross-SAE alignment tools (Universal SAE, SPARC, Sparse Crosscoders) ask: what is the correct mapping between two SAEs' features? Quine's gavagai thought experiment suggests this question is structurally underdetermined: the observational data fixes an equivalence class of translations, not a single one. gavagai does not solve that underdetermination — it measures it.

Concretely:

Train two SAEs (different seed, different model checkpoint, different layer) on aligned activations.
Run gavagai_score(sae_a, sae_b).
Get a number in [0, 1]: 0 = deterministic alignment exists; 1 = radical indeterminacy (many empirically valid alignments).

The score drops into CI as a regression gate: refuse model pushes whose indeterminacy with the baseline exceeds a threshold.

Install

pip install gavagai

Optional extras:

pip install "gavagai[saelens]"    # SAELens SAE objects
pip install "gavagai[behavior]"   # downstream-KL behavior equivalence
pip install "gavagai[holism]"     # circuit-tracer integration (v0.2)

Quick start

import numpy as np
from gavagai import gavagai_score

# Decoder matrices: shape (n_features, d_model). gavagai also accepts
# SAELens SAE instances and {"W_dec": ndarray} dicts.
sae_a = np.random.default_rng(0).standard_normal((1024, 768))
sae_b = np.random.default_rng(1).standard_normal((1024, 768))

score = gavagai_score(sae_a, sae_b)
print(f"indeterminacy: {score:.4f}")

# With diagnostics
score, details = gavagai_score(sae_a, sae_b, return_details=True)
print(f"  candidates : {details.n_equivalent_translations}")
print(f"  95% CI     : [{details.ci_low:.4f}, {details.ci_high:.4f}]")

CI gate (`gavagai-lint`)

The kill-app. Drop into your pre-push hook or GitHub Action:

gavagai-lint \
    --before sae_baseline.npz \
    --after  sae_after_abliteration.npz \
    --threshold 0.3

Exit 0 if the indeterminacy is below threshold (acceptable drift), 1 otherwise. Designed for gating Hugging Face uploads, abliteration patches, and post-train fine-tuning steps where the model's feature semantics may silently re-arrange.

GitHub Action (composite):

- uses: hinanohart/gavagai/.github/actions/gavagai@v0.1.0
  with:
    before: artifacts/baseline.npz
    after:  artifacts/candidate.npz
    threshold: 0.3

Equivalence relations

The score is relative to a choice of equivalence relation — this is the Ontological Relativity commitment, made explicit:

`equivalence=`	What "two features are equivalent" means	Needs
`"cosine"`	decoder directions within ε cosine distance	decoder matrices
`"activation"`	overlapping token-firing patterns (Jaccard ≥ 1−ε)	`activations_*` arrays
`"behavior"`	similar downstream KL when ablated (`1/(1+kl) ≥ 1−ε`)	`ablation_kl` matrix

score = gavagai_score(sae_a, sae_b, equivalence="cosine", epsilon=0.1)

Different relations yield different scores. That is the point: there is no relation-independent "true" indeterminacy.

Caveat (v0.1): the same epsilon is applied across all three relations despite their differing scales (cosine ∈ [−1,1], Jaccard ∈ [0,1], KL-derived ∈ (0,1]). Comparison across relations should be qualitative, not threshold-equal. v0.1.x will add per-relation epsilon normalization.

How it works

Extract decoder matrices W_A, W_B.
Compute similarity matrix S under the chosen relation.
Threshold by ε to get a candidate adjacency A_ε.
Count valid bipartite matchings of A_ε (DFS with backtracking, capped at cap=1000). Empty adjacency ⇒ cap (radical indeterminacy).
Compress matching count to [0, 1] via 1 − 1 / (1 + log(n)).
Bootstrap a 95% CI over feature-row resamples.

Step 4 is the Quinean heart: we never collapse the candidate space to a single bijection.

Roadmap

version	adds
v0.1.0	scalar `gavagai_score`, CLI gate, 3 equivalence relations
v0.1.x	per-relation ε normalization, `coverage` diagnostic for sparse adjacency
v0.2.0	holism propagator (Duhem-Quine) via `circuit-tracer`
v0.3.0	ontological commitment detector (AlignSAE binding)
v1.0.0	cross-paradigm translator (probe ↔ SAE ↔ patching)

Anti-goals

Not a SAE trainer. Use SAELens.
Not a circuit visualizer. Use circuit-tracer.
Not a universal feature library. We measure indeterminacy; we do not pretend to eliminate it.

Reading

W. V. O. Quine, Word and Object (1960), ch. 2.
W. V. O. Quine, Ontological Relativity and Other Essays (1968).
Marks et al., Sparse Feature Circuits, ICLR 2025 (arXiv:2403.19647).
Bricken et al., Towards Monosemanticity, Anthropic 2023.
Arditi et al., Refusal direction is mediated by a single direction, NeurIPS 2024 (arXiv:2406.11717).
Mechanistic Interpretability Needs Philosophy (arXiv:2506.18852).

License

Apache 2.0. See LICENSE and NOTICE.

The name draws on Quine's philosophical work as a scholarly reference. No endorsement or affiliation is claimed or implied.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

May 19, 2026

0.1.2

May 17, 2026

This version

0.1.1

May 17, 2026

0.1.0

May 17, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gavagai-0.1.1.tar.gz (28.8 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gavagai-0.1.1-py3-none-any.whl (25.0 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file gavagai-0.1.1.tar.gz.

File metadata

Download URL: gavagai-0.1.1.tar.gz
Upload date: May 17, 2026
Size: 28.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gavagai-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`4485d593db556fd65b4b79d9777f9fb27ca671a0801ed3887d9b00c936a3cc84`
MD5	`85f0a73ec0b52c7948cb4989eb670e6a`
BLAKE2b-256	`52b244eeab117fe5655984f99c9c79ad0368d74901561ed977b36299cbb4e1c9`

See more details on using hashes here.

File details

Details for the file gavagai-0.1.1-py3-none-any.whl.

File metadata

Download URL: gavagai-0.1.1-py3-none-any.whl
Upload date: May 17, 2026
Size: 25.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gavagai-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`728bb8d1dede4f9d79484b6ffefeb5a831828c63063c4a187fb2fda46b4c2bd0`
MD5	`edc7c183212d880181d8c5814ac5c587`
BLAKE2b-256	`eafb39d2ae9289c2cf4b958b7767e3957c2c28b992a3125fb2aa25279c4ef9f9`

See more details on using hashes here.

gavagai 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gavagai

Why this exists

Install

Quick start

CI gate (`gavagai-lint`)

Equivalence relations

How it works

Roadmap

Anti-goals

Reading

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

gavagai 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gavagai

Why this exists

Install

Quick start

CI gate (gavagai-lint)

Equivalence relations

How it works

Roadmap

Anti-goals

Reading

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

CI gate (`gavagai-lint`)