Skip to main content

Python SDK + CLI for openinterp.org — Atlas search, SAE Traces, FabricationGuard hallucination detection, InterpScore. The operational layer for mechanistic interpretability.

Project description

openinterp

Python SDK + CLI for openinterp.org

Search the feature Atlas, generate Traces from your own SAE, rank against the public InterpScore leaderboard.

PyPI Python 3.10+ License Apache 2.0 openinterp.org Discussions


Install

pip install openinterp              # lite: Atlas + CLI (no torch, ~2 MB total)
pip install "openinterp[full]"      # + torch/transformers/safetensors for trace generation

Requires Python ≥ 3.10.


Part of a 5-repo ecosystem

Repo What's in it
.github Org profile + shared CoC + SECURITY
web Next.js site behind openinterp.org
notebooks 23 training + interpretability notebooks
cli (you are here) pip install openinterp — Python SDK
mechreward SAE features as dense RL reward

🚀 Quick start

Search the Atlas (offline, zero GPU)

$ openinterp atlas "overconfidence"
                    Atlas results: 'overconfidence'
┏━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━
┃ ID      ┃ Name                    ┃ Model             ┃ AUROC ┃ Description
┡━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━
│ f2503   │ overconfidence_pattern  │ Qwen/Qwen3.6-27B  │  0.54 │ Definitive…
│ f1847   │ urgency_assessment      │ Qwen/Qwen3.6-27B  │  0.68 │ Time-critic…
└─────────┴─────────────────────────┴───────────────────┴───────┴────────────
>>> from openinterp import search_features
>>> features = search_features("overconfidence", model="Qwen/Qwen3.6-27B")
>>> features[0].id
'f2503'

Generate a Trace from your own SAE

pip install "openinterp[full]"

openinterp trace \
    --model google/gemma-2-2b \
    --sae-repo YOUR_HF_USER/gemma2-2b-sae-first \
    --prompt "The capital of France is" \
    --layer 12 \
    --d-model 2304 --d-sae 16384 --k 64 \
    --out my_trace.json

This:

  1. Loads the base model in bf16 with SDPA (no flash-attn)
  2. Loads your SAE from HuggingFace (sae_lens safetensors format)
  3. Generates tokens, captures residuals at layer 12
  4. Applies the SAE, picks top-10 active features
  5. Writes a Trace JSON matching openinterp.org/observatory/trace byte-for-byte

Python API

from openinterp import generate_trace

trace = generate_trace(
    model_id="google/gemma-2-2b",
    sae_repo="YOUR_HF_USER/gemma2-2b-sae-first",
    prompt="The capital of France is",
    layer=12,
    d_model=2304,
    d_sae=16384,
    k=64,
)

print(trace.model_dump_json(indent=2))   # Trace Theater schema

With feature labels from notebook 04

# After running 04_discover_features.ipynb (emits feature_catalog.json):
openinterp trace ... --catalog feature_catalog.json

Trace features inherit names from your catalog.


<<<<<<< Updated upstream

🛡️ FabricationGuard (v0.2.0+)

FabricationGuard headline

Production hallucination probe on Qwen3.6-27B. AUROC 0.88 cross-task on SimpleQA, −88% confident-wrong reduction in mitigation mode, ~1ms scoring latency.

from openinterp import FabricationGuard

guard = FabricationGuard.from_pretrained("Qwen/Qwen3.6-27B")
output = guard.generate("Who won the 2003 Nobel Prize in Aerodynamics?", mode="abstain")
# → "I don't have reliable information to answer this confidently."

Methodology lineage: extends Anthropic's persona-vectors approach (Aug 2025, tested on 7-8B) to Qwen3.6-27B (3-4× larger) with formal cross-task AUROC + bootstrap CIs + mitigation-rate evaluation. Apache-2.0 production-grade implementation, not a proprietary platform. Probe artifact: caiovicentino1/FabricationGuard-linearprobe-qwen36-27b. Live demo: openinterp.org/products/fabricationguard.

🧠 ReasonGuard v0.1 (in registry)

ReasonGuard headline

Reasoning-faithfulness probe at L55 / mid_think on Qwen3.6-27B in thinking mode. Detects wrong-answer trajectories during the <think> block. Honest narrow scope: AUROC 0.888 within math reasoning (GSM8K), 0.605 cross-domain to commonsense (StrategyQA) — domain-bound, not generalized.

Layer × position interaction (novel): shallow layers (L31) favor end_question; deep layers (L55) favor mid_think. Position-of-faithfulness migrates with depth.

ProbeBench

ProbeBench is the public registry + leaderboard of small classifiers — linear probes, SAE-feature combinations, attention-circuit probes — that turn an LLM's internal activations into a calibrated risk score for hallucination, deception, eval-awareness, reward-hacking, and more. Every probe ships with a SHA-256-pinned artifact, a reproducer notebook, calibrated test-set metrics, and an OSI-approved license. The probebench SDK is the same surface we use to ship FabricationGuard v2 (AUROC 0.88 cross-task on SimpleQA) and is open to community submissions.

Stashed changes

from openinterp import probebench

<<<<<<< Updated upstream
probe = probebench.load("openinterp/reasonguard-qwen36-27b-l55-mid_think")
score = probe.score(activations)  # P(wrong-answer trajectory)

Both numbers (within + cross) registered honestly per ProbeBench's anti-Goodhart norms. Probe artifact: caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b. Live on openinterp.org/probebench.

🧬 ProbeBench (v0.2.0+)

The first categorical leaderboard for activation probes — 8 categories, 7-axis ProbeScore, anti-Goodhart by construction.

from openinterp import probebench

probes = probebench.list_probes(category="hallucination")
probe  = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2")
score  = probe.score(activations)
openinterp probebench list                       # show all registered probes
openinterp probebench load <probe-id>            # download + verify SHA-256
openinterp probebench validate ./my-probe/       # check artifact spec
openinterp probebench reproduce <probe-id>       # download reproducer notebook

Browse the leaderboard: openinterp.org/probebench.


📦 What's in v0.2.0

======= probe = probebench.load("openinterp/fabricationguard-qwen36-27b-l31-v2") scores = probe.score(activations) # numpy array of P(positive_class) print(probe.metadata.tagline, probe.probescore())


```bash
openinterp probebench list
openinterp probebench load openinterp/fabricationguard-qwen36-27b-l31-v2
openinterp probebench validate ./my-probe/

Browse the leaderboard at openinterp.org/probebench · contribute via probebench-registry.


📦 What's in v0.1.0

Stashed changes

Command Status What it does
openinterp atlas <query> ✅ live Feature search with offline fallback to curated demo features
openinterp trace ... ✅ live (needs [full]) Real SAE trace generation, sae_lens format, any HF model
openinterp guard ... ✅ live FabricationGuard scoring + abstain mode on Qwen3.6-27B
openinterp probebench {list,load,score,validate,reproduce,submit} ✅ live ProbeBench v0.0.1 SDK
openinterp info ✅ live Version + optional-dep status

Planned v0.3.0

  • openinterp upload-trace <trace.json> → shareable openinterp.org URL
  • openinterp score --sae-repo X → compute InterpScore (wraps notebook 18)
  • openinterp steer --sae-repo X --feature Y --alpha Z → intervention (wraps notebook 06)
  • openinterp circuit --sae-repo X --prompt Y → attribution graph JSON (wraps notebook 14/15)
  • openinterp publish <repo> → HuggingFace release with model card
  • ReasonGuard v0.2 — multi-bench training (math + commonsense) to fix cross-domain transfer

Open an issue on the tracker if you'd take one of these.


🛠️ Development

git clone https://github.com/OpenInterpretability/cli openinterp-cli
cd openinterp-cli
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev,full]"          # dev = pytest + ruff + build; full = torch + transformers
pytest -xvs                            # 5 tests, ~1s

Package layout

openinterp-cli/
├── pyproject.toml              # name='openinterp', hatchling build
├── openinterp/
│   ├── __init__.py             # public exports + __version__
│   ├── models.py               # pydantic types: AtlasFeature, Trace, TraceFeature
│   ├── atlas.py                # search_features() — HF API + curated fallback
│   ├── trace.py                # generate_trace() — real transformers-based impl
│   └── cli.py                  # click-based CLI: atlas / trace / info
├── tests/
│   ├── test_atlas.py
│   └── test_trace.py
├── CHANGELOG.md
├── CONTRIBUTING.md
└── README.md

Contribution recipe — add a new command

Full rules: CONTRIBUTING.md.

  1. Decide which notebook it wraps (score → 18, steer → 06, circuit → 14/15, publish → generic)
  2. Add a function to the matching file (openinterp/score.py, etc.). Keep it small — actual compute lives in the notebook.
  3. Expose it in __init__.py
  4. Add a @main.command() in cli.py with click decorators
  5. Add a smoke test in tests/test_<name>.py
  6. Update CHANGELOG.md under [Unreleased]
  7. PR title: Add openinterp <command>

Hard rules:

  • Python ≥ 3.10 syntax (PEP 604 unions OK)
  • dtype=torch.bfloat16, never torch_dtype= (transformers 5.x deprecated)
  • SDPA only, never flash-attn
  • New heavy deps (torch tier) → add to [full] extra, not base
  • Every new public function has type hints + docstring

🚢 Release process (maintainer)

# 1. Bump version in BOTH:
#    pyproject.toml          ([project] version = "X.Y.Z")
#    openinterp/__init__.py  (__version__ = "X.Y.Z")
# 2. Update CHANGELOG.md — move [Unreleased] → [X.Y.Z] — YYYY-MM-DD

source .venv/bin/activate
rm -rf dist/
python -m build
python -m twine check dist/*
python -m twine upload dist/*     # needs PyPI token in ~/.pypirc

git tag vX.Y.Z
git push --tags

CI

Every PR runs:

  • pytest -xvs across Python 3.10, 3.11, 3.12 (see .github/workflows/ci.yml)
  • ruff check . (warn-only for now)
  • python -m build + twine check

Green required to merge.


Community


Standing on the shoulders of


License

Apache-2.0. Built by Caio Vicentino + OpenInterpretability. 2026.

openinterp.org · github.com/OpenInterpretability · hi@openinterp.org

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openinterp-0.2.1.tar.gz (39.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openinterp-0.2.1-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file openinterp-0.2.1.tar.gz.

File metadata

  • Download URL: openinterp-0.2.1.tar.gz
  • Upload date:
  • Size: 39.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for openinterp-0.2.1.tar.gz
Algorithm Hash digest
SHA256 709e60b4f0a11d30632266be74c72ee86839ef5a83a3c18d31de6b98282d264c
MD5 4cb3bb7a2ab1c4d9949cd28a830094fc
BLAKE2b-256 d80247f490a68f3b7c7feb418ed59856b94ddb1815888f76078318b013eb6f52

See more details on using hashes here.

File details

Details for the file openinterp-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: openinterp-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 40.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for openinterp-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ebcaade904225b2a502bb924a13c1d7f0bf7cad73b7cc86934ae75e2aa21e544
MD5 2a1aedadecdfc51280ee66ec71c6b2c5
BLAKE2b-256 b32c23ede5c83069ea4bdd31ee3efb25bd9bffcc9df5f6506ff173077f6ab129

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page