Utils and mechanistic interpretability intervensions using nnsight

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Try the demo: / read the doc

nnterp

A unified interface for all transformer models that puts best NNsight practices for LLMs in everyone's hands.

Built on top of NNsight, nnterp provides a standardized interface for mechanistic interpretability research across all transformer architectures. Unlike transformer_lens which reimplements transformers, nnterp preserves the original HuggingFace implementations while solving the naming convention chaos through intelligent renaming.

Why nnterp?

The Problem: Every transformer model uses different naming conventions - GPT-2 uses transformer.h, LLaMA uses model.layers, OPT uses something else entirely. This makes mechanistic interpretability research painful as you can't just change the name of the model and expect the rest of your code to work.

The Solution: nnterp standardizes all models to use something close to the LLaMA naming convention:

StandardizedTransformer
├── layers
│   ├── self_attn
│   └── mlp
├── ln_final
└── lm_head

and include built-in properties like model.logits and model.next_token_probs.

Unlike other libraries that reimplement transformers, nnterp uses NNsight's renaming feature to work with the original HuggingFace implementations, ensuring perfect compatibility and preventing subtle bugs.

Installation

pip install nnterp - Basic installation
pip install nnterp[display] - Includes visualization dependencies

Key Features

1. StandardizedTransformer - One Interface, All Models

The StandardizedTransformer class provides a unified interface across all transformer architectures, with automatic device mapping and standardized naming.

from nnterp import StandardizedTransformer

# Load any model with standardized naming (with device_map="auto" by default)
model = StandardizedTransformer("gpt2")  # or "meta-llama/Llama-2-7b-hf", "google/gemma-2b", etc.

# Clean, intuitive access to model internals
with model.trace("Hello world"):
    # Direct attention/MLP access
    attn_output = model.attentions_output[3]
    mlp_output = model.mlps_output[3]

    # Easy layer I/O access
    layer_5_output = model.layers_output[5]
    layer_10_output = model.layers_output[10] = layer_5_output  # Skip layers 6-10
    
    
    # Built-in utilities
    logits = model.project_on_vocab(layer_5_output)  # Project to vocabulary
    model.steer(layers=[1, 3], steering_vector=vector, factor=0.5)  # Activation steering

2. Advanced Attention Access

Access and modify attention probabilities directly:

# Access attention probabilities during forward pass
with model.trace("The cat sat on the mat"):
    attn_probs = model.attention_probabilities[5].save()  # Layer 5 attention
    
    # Modify attention patterns
    attn_probs[:, :, :, 0] = 0  # Zero out attention to first token
    attn_probs /= attn_probs.sum(dim=-1, keepdim=True)  # Renormalize
    
    modified_logits = model.logits.save()

# Check what transformations are applied
model.attention_probabilities.print_source()

3. Efficient Activation Collection

Collect activations for a certain token position. (For all token use NNsight cache functionality):

from nnterp import get_token_activations, collect_token_activations_batched

# Single batch - automatically uses tracer.stop() for efficiency
prompts = ["The capital of France is", "The weather today is"]
activations = get_token_activations(model, prompts, idx=-1)  # Last token
# Shape: (num_layers, batch_size, hidden_size)

# Large-scale batched collection with memory optimization
large_prompts = ["Sample text " + str(i) for i in range(1000)]
batch_activations = collect_token_activations_batched(
    model, large_prompts, batch_size=16, layers=[3, 9, 11]  # Specific layers only
)

4. Prompt and completion tracking

Track probabilities for specific target tokens in the answer of a prompt. This will automatically track both the first token of "target string" and " target string" (with a space).

from nnterp import Prompt, run_prompts

# Create prompts with target tokens to track
# Automatically handles both "word" and " word" tokenization
prompts = [
    Prompt.from_strings(
        "The capital of France (not England or Spain) is",
        {
            "target": "Paris",
            "traps": ["London", "Madrid"],  # Track competing options
            "concept": ["France", "country"]  # Track related concepts
        },
        model.tokenizer
    )
]

# Get probabilities for all target categories
results = run_prompts(model, prompts, batch_size=2)
for category, probs in results.items():
    print(f"{category}: {probs.shape}")  # Shape: (batch_size,)

5. Mechanistic Interpretability Interventions

All the classic interventions with best practices built-in:

Logit Lens

from nnterp import logit_lens

prompts = ["The capital of France is", "The sun rises in the"]
layer_probs = logit_lens(model, prompts)  # (batch, layers, vocab_size)

# Combine with target tracking for precise analysis
results = run_prompts(model, prompts, get_probs_func=logit_lens)
# Returns probabilities for each target category across all layers

Patchscope

from nnterp import patchscope_lens, TargetPrompt, repeat_prompt

# Custom target prompt
target_prompt = TargetPrompt("The capital of France is", index_to_patch=-1)

# Or use the repeat prompt from the patchscopes paper
target_prompt = repeat_prompt(words=["king", "hello", "world"])

source_prompts = ["The capital of France is", "The capital of England is"]
patchscope_probs = patchscope_lens(
    model, source_prompts=source_prompts, target_patch_prompts=target_prompt
)

Activation Steering

from nnterp import steer
import torch

# Create steering vector (e.g., from activation differences between concepts)
steering_vector = torch.randn(model.config.hidden_size)

with model.trace("The weather today is"):
    # Apply steering at multiple layers
    steer(model, layers=[5, 10], steering_vector=steering_vector, factor=1.5)
    steered_output = model.logits.save()

# Or use the built-in method
with model.trace("Hello, how are you?"):
    model.steer(layers=[1, 3], steering_vector=steering_vector, factor=0.5)

6. A few visualization tools

Built-in plotting for immediate insights:

from nnterp.display import plot_topk_tokens, prompts_to_df

# Visualize top-k token probabilities across layers
probs = logit_lens(model, "The capital of France is")
fig = plot_topk_tokens(
    probs[0],  # First prompt
    model.tokenizer,
    k=5,
    title="Layer-wise Token Predictions",
    width=1000,
    height=600
)
fig.show()

# Convert prompts to DataFrame for analysis
df = prompts_to_df(prompts, model.tokenizer)
print(df.head())

Complete Research Example

from nnterp import StandardizedTransformer, Prompt, run_prompts, logit_lens
from nnterp.display import plot_topk_tokens
import torch
import plotly.graph_objects as go

# Load any model with unified interface
model = StandardizedTransformer("google/gemma-2-2b")

# Create research prompts with target tracking
prompts = [
    Prompt.from_strings(
        "The translation of 'car' in French is",
        {"target": ["voiture", "bagnole"], "english": "car"},
        model.tokenizer
    ),
    Prompt.from_strings(
        "The translation of 'cat' in Spanish is", 
        {"target": ["gato", "minino"], "english": "cat"},
        model.tokenizer
    )
]

# Run analysis with logit lens intervention
results = run_prompts(model, prompts, batch_size=2, get_probs_func=logit_lens)

# Create publication-ready visualization
mean_probs = {category: probs.mean(dim=0) for category, probs in results.items()}
fig = go.Figure()
for category, probs in mean_probs.items():
    fig.add_trace(go.Scatter(
        x=list(range(len(probs))), y=probs.tolist(),
        mode="lines+markers", name=category
    ))
fig.update_layout(
    title="Target Token Emergence Across Layers",
    xaxis_title="Layer", yaxis_title="Mean Probability"
)
fig.show()

# Advanced interventions
with model.trace("The weather today is"):
    # Skip early layers
    model.skip_layers(0, 5)
    
    # Apply steering
    steering_vector = torch.randn(model.config.hidden_size)
    model.steer(layers=[10, 15], steering_vector=steering_vector, factor=1.2)
    
    # Project intermediate states
    layer_20_logits = model.project_on_vocab(model.layers_output[20])
    final_logits = model.logits.save()

Automatic Testing & Validation

nnterp includes comprehensive automatic testing to prevent silent failures. When you load a model, fast tests automatically run to ensure:

Model renaming correctness: All modules are properly renamed to the standardized interface
Module output shapes: Layer outputs and attention probabilities have the expected shapes
Attention probabilities: Attention at a token position sums to 1

Also, when a new version of nnterp is released, tests are ran on most model architectures. Those tests are included in the package, and nnterp will warn you if the architecture you're using was tested, and if any of the tests failed. If you use a different version of NNsight or transformers or another model architecture, you can run the tests manually:

# Test specific models
python -m nnterp run_tests --model-names "gpt2" "meta-llama/Llama-2-7b-hf"

# Test using toy models (if available)
python -m nnterp run_tests --class-names "LlamaForCausalLM" "GPT2LMHeadModel"

Dependencies

The transformers version is pinned to 4.53.x as it's the only one that was tested for now. When a model is loaded in StandardizedTransformer, it will go through a series of checks to make sure that the model is still compatible.

Core: nnsight>=0.5.0, transformers==4.53.x
Visualization: plotly, pandas (install with [display] extra)

I found a bug!

Before opening an issue, make sure that you have a MWE (minimal working example) that reproduces the issue, and if possible, the equivalent code using NNsight.LanguageModel. If the NNsight MWE also fails, please open an issue on the NNsight repository. Also make sure that you can load the model with AutoModelForCausalLM from transformers.

Known issues:

You might encounter cryptic errors of the form With block not found at line xyz. In this case, restart your Notebook, and if it doesn't fix it delete your python cache e.g. using

find . -type d -name "__pycache__" -exec rm -rf {} + ; find . -name "*.pyc" -delete ; find . -name "*.pyo" -delete

Contributing

Contribution are welcome! If a functionality is missing, and you implemented it for your reasearch, please open a PR, so that people in the community can benefit from it. That include adding support for new models with custom renamings!

Next steps

Here are some nice features that could be cool to have, and for which I'd be happy to accept PRs (ordered by most to least useful imo):

Add helpers for getting gradients
Add support for vllm when NNsight supports it
Add helpers for NNsight's cache as it returns raw tuple outputs instead of nice vectors.
Add access to k/q/v
Fix typing

Development

Install the development environment with make dev or uv sync --all-extras. Add uv pip install flash-attn --no-build-isolation to support models like Phi that require flash-attn. You might encounter the error with block not found at line xyz when running the tests. In this case run make clean to remove the python cache and try again.
Create a git tag with the version number git tag vx.y.z; git push origin vx.y.z
Build with python -m build
Publish with e.g. twine upload dist/*x.y.z*
test with pytest --cache-clear. cache-clear is mandatory for now otherwise NNsight's source can break. It might not be sufficient, in which case you can do make clean to remove Python cache.

Citation

If you use nnterp in your research, you can cite it as:

@inproceedings{
dumas2025nnterp,
title={nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers},
author={Cl{\'e}ment Dumas},
booktitle={Mechanistic Interpretability Workshop at NeurIPS 2025},
year={2025},
url={https://openreview.net/forum?id=ACic3VDIHp}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.3.0

Feb 28, 2026

1.2.2

Nov 19, 2025

1.2.1

Nov 11, 2025

1.2.0

Nov 7, 2025

1.1.0

Oct 20, 2025

This version

1.0.1

Oct 15, 2025

1.0.0

Oct 6, 2025

1.0b2 pre-release

Jul 21, 2025

1.0b1 pre-release

Jul 16, 2025

1.0.0.dev1 pre-release yanked

Jul 15, 2025

Reason this release was yanked:

buggy release that would crash if you use the wrong nnsight version

0.4.9

Jun 2, 2025

0.4.8

Jun 2, 2025

0.4.7

Jun 2, 2025

0.4.6

Jun 2, 2025

0.4.5

Dec 12, 2024

0.4.4

Nov 25, 2024

0.4.3

Nov 25, 2024

0.4.1

Oct 18, 2024

0.3.1

Sep 25, 2024

0.3.0

Sep 24, 2024

0.2.1

Sep 23, 2024

0.2.0

Sep 4, 2024

0.1.0

Aug 9, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nnterp-1.0.1.tar.gz (49.1 kB view details)

Uploaded Oct 15, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nnterp-1.0.1-py3-none-any.whl (52.7 kB view details)

Uploaded Oct 15, 2025 Python 3

File details

Details for the file nnterp-1.0.1.tar.gz.

File metadata

Download URL: nnterp-1.0.1.tar.gz
Upload date: Oct 15, 2025
Size: 49.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for nnterp-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`ec53a1a0d770b28ca35db13466aaf4bcf346abc0b07f3c4fb8fcdd383a674096`
MD5	`7c8441b8d2c54103fb9be13e7e8ac67e`
BLAKE2b-256	`fcaf6b74ec321ce53b90f0ee79c4107cf5fe410aab5085f23e9a4fadb07f69e6`

See more details on using hashes here.

File details

Details for the file nnterp-1.0.1-py3-none-any.whl.

File metadata

Download URL: nnterp-1.0.1-py3-none-any.whl
Upload date: Oct 15, 2025
Size: 52.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.12

File hashes

Hashes for nnterp-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4f82994c3899f92d12b11177dacaddc28464d728c6f364ad2353c6feabd7e9d0`
MD5	`226fc708cb7d28c3e49e3eb6b0145ae3`
BLAKE2b-256	`79bea54ac3bf49e9d86de2af69ca035757cd1a1709fba5f1dd699cc0db46e267`

See more details on using hashes here.

nnterp 1.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

nnterp

Why nnterp?

Installation

Key Features

1. StandardizedTransformer - One Interface, All Models

2. Advanced Attention Access

3. Efficient Activation Collection

4. Prompt and completion tracking

5. Mechanistic Interpretability Interventions

Logit Lens

Patchscope

Activation Steering

6. A few visualization tools

Complete Research Example

Automatic Testing & Validation

Dependencies

I found a bug!

Known issues:

Contributing

Next steps

Development

Citation

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes