Skip to main content

InterpreLens: A Lens for Interpreting Large Language Models based on Transformers architecture.

Project description

inteprelens

Ruff Publish PyPI

inteprelens is the extracted core package for transformer architecture analysis. It keeps the Causal Head Gating (CHG) workflow and adds tracing utilities for named transformer stages and internal logits.

Overview

Use inteprelens when you want to:

  • train CHG masks over attention heads
  • inspect necessary, sufficient, and facilitating heads
  • trace intermediate transformer states such as attention output projection inputs, block outputs, final norm states, and per-layer logits
  • score final-token logits, log-probabilities, and probabilities for calibration-style analysis
  • run gradient-based attribution through facilitating head circuits
  • probe gradients and kwargs at any layer via context-manager hook utilities

This repo is the reusable core package only. It does not include downstream task pipelines or visualization workflows.

Installation

Show

Install the runtime package:

pip install inteprelens

For local development:

uv sync --group dev
uv run pytest -q

For build and publish checks:

uv sync --group publish
uv run --group publish python -m build
uv run --group publish twine check dist/*

If you use a gated Hugging Face model such as meta-llama/Llama-3.2-1B, make sure HF_TOKEN is available in your environment or .env.

Quick Start

Show
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

results = analyzer.fit(
    texts=["What is the capital of France?"],
    targets=["Paris"],
    num_masks=1,
    num_updates=1,
    num_reg_updates=1,
    batch_size=1,
    verbose=False,
)

print(results.summary())
print(results.necessary_heads().head())

Usage Examples

Show all examples

CHG analysis

Show
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

results = analyzer.fit(
    texts=[
        "The capital of France is",
        "2 + 2 equals",
    ],
    targets=[
        "Paris",
        "4",
    ],
    num_masks=1,
    num_updates=1,
    num_reg_updates=1,
    batch_size=1,
    verbose=False,
)

necessary = results.necessary_heads()
taxonomy = results.head_taxonomy()

print(necessary.head())
print(taxonomy.head())

Trace transformer stages and logits

Show
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

trace = analyzer.trace(
    texts="Paris is the capital of",
    layers=[0],
    sites=["attn_o_proj_pre", "final_norm", "logits"],
)

print(trace.get("attn_o_proj_pre", 0).shape)
print(trace.get("logits", 0).shape)
print(trace.final_logits.shape)

trace.final_logits contains the model's final output logits for the traced batch.

Score final-token logits, log-probabilities, and probabilities

Show
from inteprelens import LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

scores = analyzer.score(
    texts=[
        "The capital of France is",
        "2 + 2 equals",
    ],
    temperature=1.0,
)

print(scores.logits.shape)
print(scores.log_probs.shape)
print(scores.probs.shape)

final_logits = scores.final_token_logits()
final_log_probs = scores.final_token_log_probs()
final_probs = scores.final_token_probs()

print(final_logits.shape)
print(final_log_probs.shape)
print(final_probs.shape)

Use logits as the canonical calibration output, derive log_probs for stable token scoring, and use probs when you need confidence-style metrics.

Gradient-based attribution through facilitating heads

Show
from inteprelens import CausalCircuitAttribution, LensAPI

analyzer = LensAPI.from_pretrained("meta-llama/Llama-3.2-1B")

results = analyzer.fit(
    texts=["The capital of France is"],
    targets=["Paris"],
    num_masks=1,
    num_updates=1,
    num_reg_updates=1,
    batch_size=1,
    verbose=False,
)

facilitating_mask = results.get_facilitating_mask()
attribution = CausalCircuitAttribution(analyzer.model, analyzer.tokenizer)

sentence_scores = attribution.compute_sentence_importance(
    document="Paris is the capital of France. It is one of Europe's largest cities.",
    sentences=[
        "Paris is the capital of France.",
        "It is one of Europe's largest cities.",
    ],
    summary="Paris is the capital of France.",
    facilitating_mask=facilitating_mask,
)

print(sentence_scores)

Hook utilities

Show

All hooks are context managers — handles are removed automatically on exit.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from inteprelens.models.adapters import get_adapter
from inteprelens import capture_gradients, capture_kwargs_hook, tensor_grad_hook

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B", torch_dtype=torch.bfloat16
).cuda()
adapter  = get_adapter(model)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
inputs = tokenizer("The Eiffel Tower is in", return_tensors="pt").to("cuda")
target_id = tokenizer(" Paris", add_special_tokens=False)["input_ids"][0]

# 1. capture_gradients — grad_output flowing through attention layer 0
with capture_gradients(adapter.get_attention_module(0), site="output") as grads:
    loss = -model(**inputs).logits[0, -1, target_id]
    loss.backward()
print(grads[0][0].shape)   # [batch, seq, hidden]

# 2. capture_kwargs_hook — inspect attention_mask kwargs at layer 0
with capture_kwargs_hook(adapter.get_layers()[0]) as kw:
    with torch.no_grad():
        model(**inputs)
print(kw[0].get("attention_mask"))

# 3. tensor_grad_hook — gradient probe on MLP activation
act_ref = []
h = adapter.get_mlp(0).register_forward_hook(lambda m, i, o: act_ref.append(o))
out = model(**inputs)
h.remove()
with tensor_grad_hook(lambda: act_ref[0]) as grads:
    loss = -out.logits[0, -1, target_id]
    loss.backward()
print(grads[0].shape)   # [batch, seq, hidden]

Public API

Show

Classes

Class Description
LensAPI High-level interface: CHG fitting, token scoring, named-site tracing.
TransformerTracer Lower-level tracer for direct transformer-site activation capture.
CausalCircuitAttribution Gradient attribution through facilitating CHG heads.
CHGDataset Helper for building CHG-ready datasets from prompt/target pairs.

Hook utilities

All functions are context managers exported from inteprelens directly.

Function PyTorch API Use case
capture_gradients register_full_backward_hook Capture grad_input / grad_output per module during backward
capture_kwargs_hook register_forward_pre_hook(with_kwargs=True) Inspect/modify kwargs (attention_mask, etc.) before forward (≥ PyTorch 2.0)
tensor_grad_hook Tensor.register_hook Gradient probe on any activation tensor
autograd_node_hook autograd.graph.Node.register_hook Low-level autograd graph traversal during backward (≥ PyTorch 2.0)
capture_backward_pre register_module_full_backward_pre_hook Capture grad_output entering a module before backward executes (≥ PyTorch 2.1)

Acknowledgements

Show

inteprelens builds on and adapts code and ideas from the Causal Head Gating project. The extracted core package keeps CHG support and extends it with transformer-stage tracing and internal-logit inspection for architecture analysis workflows.

License

This project is released under the MIT License. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

inteprelens-0.2.1.tar.gz (59.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

inteprelens-0.2.1-py3-none-any.whl (64.9 kB view details)

Uploaded Python 3

File details

Details for the file inteprelens-0.2.1.tar.gz.

File metadata

  • Download URL: inteprelens-0.2.1.tar.gz
  • Upload date:
  • Size: 59.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inteprelens-0.2.1.tar.gz
Algorithm Hash digest
SHA256 c77be51fa4fca0033637f9d9f4ead1f22fccd1e994a6c4f14355e15fdcbf372f
MD5 5998a20c3fbf4f768fc3b28e2fcdf5d0
BLAKE2b-256 320fe86a3c79de65f85de205344305011ccd6b03ef0164ee56cd58c47a8ced28

See more details on using hashes here.

Provenance

The following attestation bundles were made for inteprelens-0.2.1.tar.gz:

Publisher: publish.yml on Aisuko/inteprelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file inteprelens-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: inteprelens-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 64.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for inteprelens-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c9abb800e5f40aaa794c3b056a75eb52e8984a6cfe5b6bfa7f3748e795f10a35
MD5 f931c7f2601663839918961830c5a260
BLAKE2b-256 dcb0c1e8fe4f42ccc7501ff0309dcd3588c3b56222d59717795a858f90278eb8

See more details on using hashes here.

Provenance

The following attestation bundles were made for inteprelens-0.2.1-py3-none-any.whl:

Publisher: publish.yml on Aisuko/inteprelens

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page