Skip to main content

Train probes on language model activations for AI safety monitoring

Project description

lmprobe Language Model Probe Library

PyPI version Python 3.10+ License: MIT Docs

This library supports the use of language model "activations" or "latents" to build text classifiers. The intent is to help detect and reduce misuse of AI - for example, chemical, biological, radiological and nuclear (CBRN) weapons development, social engineering at scale, and the development of novel cybersecurity attack vectors.

Linear and Simple Models for LLMs

"Linear Probes" have emerged as an effective and practical way to monitor large language model activity.

Background

First introduced by Alain & Bengio (2016) as "thermometers" for measuring what neural networks learn at each layer, linear probes have since been refined through work on probe design and selectivity and validated by evidence supporting the linear representation hypothesis. The Representation Engineering framework (Zou et al., 2023) demonstrated that probes can monitor safety-relevant properties like honesty and power-seeking. Recent AI safety research has shown promising results: Anthropic's work on detecting sleeper agents achieved >99% AUROC using simple linear classifiers, and Apollo Research's strategic deception detection work demonstrates that probes trained on simple contrast pairs can generalize to realistic scenarios like insider trading concealment and sandbagging on safety evaluations.

lmprobe Use Cases

The goal of lmprobe is to make text classifiers for language models easy to build, experiment on, and deploy during inference. While much of the research has focused on complex emergent risky behavior, the intended use of this library is for simpler use cases such as detection of the misuse of an AI chatbot by humans.

Compatibility

By default, lmprobe uses HuggingFace Transformers to manage models and extract latents during inference. The library also supports nnsight for remote execution on NDIF (National Deep Inference Fabric), allowing you to probe large models without local GPU resources.

Installation

pip install lmprobe

Optional extras:

pip install lmprobe[hub]         # HuggingFace Hub (activation datasets)
pip install lmprobe[s3]          # S3 cache backend
pip install lmprobe[nnsight]     # nnsight/NDIF remote execution
pip install lmprobe[embeddings]  # Sentence-transformers baselines
pip install lmprobe[auto]        # Automatic layer selection (Group Lasso)

Environment Setup

For remote execution (large models via nnsight/NDIF):

export NDIF_API_KEY="your-api-key-here"

Example Usage


from lmprobe import Probe

positive_prompts = [  # positive class: "dog" without saying "dog"
    "Who wants to go for a walk?",
    "My tail is wagging with delight.",
    "Fetch the ball!",
    "Good boy!",
    "Slobbering, chewing, growling, barking.",
]

negative_prompts = [  # negative class: "cat" without saying "cat"
    "Enjoys lounging in the sun beam all day.",
    "Purring, stalking, pouncing, scratching.",
    "Uses a litterbox, throws sand all over the room.",
    "Tail raised, back arched, eyes alert, whiskers forward.",
]

# Configure the probe
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,                              # int, list[int], or "all"
    pooling="last_token",                   # applies to both train and inference
    classifier="logistic_regression",       # or pass sklearn estimator
    device="auto",
    remote=False,                           # True for nnsight remote execution
    random_state=42,                        # for reproducibility
)

# Fit using contrastive prompts
probe.fit(positive_prompts, negative_prompts)

# Predict on new examples
test_prompts = [
    "Arf! Arf! Let's go outside!",
    "Knocking things off the counter for sport.",
]
predictions = probe.predict(test_prompts)          # [1, 0]
probabilities = probe.predict_proba(test_prompts)  # [[0.12, 0.88], [0.91, 0.09]]

# Evaluate
accuracy = probe.score(test_prompts, [1, 0])

# Save/load for deployment
probe.save("dog_vs_cat_probe.pkl")
loaded_probe = Probe.load("dog_vs_cat_probe.pkl")

Note: LinearProbe still works as an alias for Probe.


Remote Execution for Large Models

Use remote=True with backend="nnsight" to run inference on large models via nnsight's remote servers:

probe = Probe(
    model="meta-llama/Llama-3.1-70B-Instruct",
    layers="middle",
    backend="nnsight",
    remote=True,  # Requires NDIF_API_KEY
)

probe.fit(positive_prompts, negative_prompts)

# Override remote per-call (e.g., train remote, predict local)
predictions = probe.predict(new_prompts, remote=False)

Multi-Layer Probing

When selecting multiple layers, activations are concatenated along the hidden dimension:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=[14, 15, 16],  # 3 layers x 4096 dims = 12,288-dim input to classifier
)

Layer Sweep

Train an independent probe for each layer to find the most informative layers, without loading all layers into memory at once:

result = Probe.sweep_layers(
    model="meta-llama/Llama-3.1-8B-Instruct",
    positive_prompts=positive_prompts,
    negative_prompts=negative_prompts,
    layers="all",            # or a list of specific layers
    classifier="ridge",
)

# Score all layers
scores = result.score(test_prompts, test_labels)
# {0: 0.52, 1: 0.55, ..., 31: 0.78}

# Find the best layer
best = result.best_layer(test_prompts, test_labels)
print(f"Best layer: {best}")

# Predict with any single layer's probe
preds = result.probes[best].predict(test_prompts)

You can also use sweep as a layer spec string:

probe = Probe(model=model, layers="sweep")        # sweep all layers
probe = Probe(model=model, layers="sweep:10")      # sweep every 10th layer
probe = Probe(model=model, layers="sweep:55-65")   # sweep a specific range

Advanced: Different Train vs Inference Pooling

For real-time monitoring, train on a stable representation but score every token:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    pooling="last_token",          # base strategy
    inference_pooling="all",       # override: return per-token scores
)

probe.fit(positive_prompts, negative_prompts)

# Returns (batch, seq_len) - one score per token
token_scores = probe.predict_proba(["Wagging my tail happily!"])

For "flag if ANY token triggers" detection:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    pooling="last_token",          # base strategy
    inference_pooling="max",       # override: max score across tokens
)

Configuration Reference

Parameter Type Default Description
model str required HuggingFace model ID or local path
layers int | list[int] | str "middle" Which residual stream layers to probe
pooling str | callable "last_token" Token aggregation (train & inference)
train_pooling str | callable Override pooling for fit() only
inference_pooling str | callable Override pooling for predict() only
classifier str | sklearn estimator "logistic_regression" Classification model
task str "classification" "classification" or "regression"
device str "auto" "auto", "cuda:0", "cpu"
remote bool False Use nnsight remote execution (requires NDIF_API_KEY)
random_state int | None None Random seed for reproducibility (propagates to classifier)
batch_size int 8 Prompts per forward pass during extraction
backend str "local" "local" (HuggingFace) or "nnsight"
dtype str | None None Model dtype: "float32", "float16", "bfloat16"
normalize_layers bool | str True Per-layer normalization for multi-layer probes
preprocessing str | None None Pipeline before classifier: "standard", "pca", "standard+pca"
pca_components int | None None Number of PCA components
classifier_kwargs dict | None None Extra kwargs for classifier constructor
auto_candidates list[int] | list[float] | None None Candidate layers for layers="auto" (fractional = relative position)
auto_alpha float 0.01 Group Lasso regularization strength for layers="auto"
fast_auto_top_k int | None None Number of layers to select with layers="fast_auto"
mass_mean_augment bool False Augment features with projection onto mass-mean direction
max_retries int | None None Retry attempts with exponential backoff for transient failures

Layer Specifications

Spec Description
16 Single layer (negative indexing: -1 = last)
[14, 15, 16] Multiple layers (concatenated)
"middle" Middle third of layers
"last" Last layer
"all" All layers
"auto" Automatic selection via Group Lasso (requires pip install lmprobe[auto])
"fast_auto" Fast selection via coefficient importance
"sweep" Train independent probe per layer
"sweep:10" Sweep every 10th layer
"sweep:55-65" Sweep layers 55 through 65

Pooling Strategies

Strategy Training Inference Description
"last_token" Y Y Final token activation (default, matches RepE literature)
"mean" Y Y Mean across all tokens
"first_token" Y Y First token (e.g., [CLS])
"all" Y Y Each token independently
"max" Y Max score across tokens (post-probe)
"min" Y Min score across tokens (post-probe)

Pooling Stage Prefixes

Strategies can be prefixed with score: (post-probe) or activation: (pre-probe) to control when pooling happens:

  • Activation pooling (pre-probe): Reduces activations before classification — the classifier sees one vector per sequence.
  • Score pooling (post-probe): Classifies every token independently, then reduces the per-token scores.
# Post-probe: classify each token, then average probabilities
probe = Probe(inference_pooling="score:mean")

# Pre-probe: take max activation per dimension, then classify once
probe = Probe(inference_pooling="activation:max")

# Bare names use sensible defaults (backward compatible):
# "mean" → activation:mean, "max" → score:max

All base strategies (last_token, first_token, mean, max, min) can be used with either prefix.

Pooling Collision Rules

Explicit parameters override the base pooling value:

# pooling="mean", train_pooling="last_token" -> train=last_token, inference=mean
# pooling="mean", inference_pooling="max"    -> train=mean, inference=max

Classifier Options

lmprobe supports several built-in classifiers:

Classifier Description
"logistic_regression" Standard logistic regression (default)
"logistic_regression_cv" Logistic regression with cross-validated regularization
"ridge" Ridge classifier - fast, no predict_proba
"svm" Support Vector Machine with probability calibration
"lda" Linear Discriminant Analysis
"mass_mean" Mass-Mean Probing - uses direction between class centroids
"sgd" Stochastic Gradient Descent classifier
"ensemble" Ensemble of LogisticRegression with different regularization strengths
# Use Mass-Mean Probing (simple but effective)
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    classifier="mass_mean",
)

# Pass extra kwargs to the classifier
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    classifier="logistic_regression",
    classifier_kwargs={"C": 0.01, "solver": "liblinear", "max_iter": 5000},
)

Layer Importance Analysis

Identify which layers are most informative for your task:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="all",  # Extract all layers
    classifier="ridge",
)

probe.fit(positive_prompts, negative_prompts)

# Compute per-layer importance scores
# Returns np.ndarray of shape (n_layers,), normalized to sum to 1.0
importances = probe.compute_layer_importance(metric="l2")
best_idx = importances.argmax()
print(f"Most important layer: {probe.candidate_layers_[best_idx]}")

Fast Auto Layer Selection

Automatically select the most important layers using fast importance analysis:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="fast_auto",      # Auto-select best layers
    fast_auto_top_k=3,       # Use top 3 most important layers
    normalize_layers=True,   # Normalize before combining
)

probe.fit(positive_prompts, negative_prompts)
print(f"Selected layers: {probe.selected_layers_}")

Automatic Layer Selection via Group Lasso

Use structured sparsity to let the model choose which layers matter:

# Requires: pip install lmprobe[auto]
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="auto",
    auto_candidates=[0.25, 0.5, 0.75],  # Fractional positions or explicit indices
    auto_alpha=0.01,                     # Regularization strength
)

probe.fit(positive_prompts, negative_prompts)
print(f"Selected layers: {probe.selected_layers_}")

Evaluation

Beyond score(), the evaluate() method computes multiple metrics at once:

probe.fit(positive_prompts, negative_prompts)

metrics = probe.evaluate(test_prompts, test_labels)
# {"accuracy": 0.85, "f1": 0.85, "precision": 0.88, "recall": 0.82, "auroc": 0.91, ...}

Caching

Activation extraction is expensive, so lmprobe caches activations automatically. The cache is stored at ~/.cache/lmprobe/ by default (or set LMPROBE_CACHE_DIR).

Cache configuration

from lmprobe import cache_info, set_cache_backend, set_cache_dtype, set_cache_limit

# Inspect cache
info = cache_info()
print(info)

# Reduce disk usage with float16 caching
set_cache_dtype("float16")

# Set a max cache size (LRU eviction)
set_cache_limit(50)  # GB

# Use S3 for cross-machine cache sharing (requires: pip install lmprobe[s3])
set_cache_backend("s3://my-bucket/lmprobe-cache")

Warmup

Pre-cache activations for a set of prompts before running predictions:

probe.warmup(test_prompts, batch_size=16)

# Subsequent predict/score calls hit the cache
predictions = probe.predict(test_prompts)

Activation Datasets

Extract activations once from a large model, share them as a HuggingFace dataset, and let others train probes without ever loading the model locally. Requires pip install lmprobe[hub].

Push cached activations to HuggingFace

After extracting activations (via probe.fit(), probe.warmup(), or any extraction call), push the local cache to a HuggingFace Dataset repo:

from lmprobe import push_dataset

# Activations must already be cached locally for these prompts + model
url = push_dataset(
    repo_id="username/llama-safety-activations",
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    prompts=all_prompts,
    labels=all_labels,           # optional, stored in the Parquet index
    description="Safety probe activations for Llama-3.1-8B",
    private=False,
)
print(url)  # https://huggingface.co/datasets/username/llama-safety-activations

Train a probe from a dataset (no model required)

Once activations are on HuggingFace, anyone can train probes without loading the LLM:

from lmprobe import load_activations, Probe

# Only downloads the shards for layer 16 — fast and selective
acts, labels = load_activations(
    "username/llama-safety-activations",
    layers=[16],
    return_labels=True,
)

probe = Probe(classifier="logistic_regression", random_state=42)
probe.fit_from_activations(acts[16], labels)
predictions = probe.predict_from_activations(test_acts[16])

Pull a full dataset to local cache

Pre-download all shards before running experiments:

from lmprobe import pull_dataset

n = pull_dataset(
    repo_id="username/llama-safety-activations",
    layers=[16],          # only fetch the layers you need
)
print(f"Pulled {n} prompts")

Load raw tensors directly

For custom workflows that need the raw activation tensors:

from lmprobe import load_activation_dataset

tensors, info = load_activation_dataset(
    repo_id="username/llama-safety-activations",
    layers=[16],
)
# tensors["hidden.layer_16"]: shape (n_prompts, hidden_dim)

Preprocessing

Apply feature transformations between activation extraction and classification:

# StandardScaler before classification
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=[14, 15, 16],
    preprocessing="standard",
)

# PCA dimensionality reduction
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="all",
    preprocessing="pca",
    pca_components=50,
)

# Chained: standardize then PCA
probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers="all",
    preprocessing="standard+pca",
    pca_components=100,
)

Regression

Train probes for continuous targets instead of binary classification:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=16,
    task="regression",  # Uses Ridge regression by default
)

# fit() accepts labels as second argument (not negative_prompts)
probe.fit(prompts, labels)  # labels: list[float]

predictions = probe.predict(test_prompts)  # continuous values
r_squared = probe.score(test_prompts, test_labels)

Working with Pre-Computed Activations

Bypass the extraction pipeline and work directly with activation matrices:

import numpy as np

probe = Probe(classifier="logistic_regression", random_state=42)

# X: (n_samples, hidden_dim), y: (n_samples,)
probe.fit_from_activations(X_train, y_train)
predictions = probe.predict_from_activations(X_test)
accuracy = probe.score_from_activations(X_test, y_test)

Baseline Comparisons

Use baselines to validate that your probe is learning something beyond surface features.

Text-Only Baselines

from lmprobe import BaselineProbe

# Bag-of-words baseline
bow_baseline = BaselineProbe(method="bow", classifier="logistic_regression")
bow_baseline.fit(positive_prompts, negative_prompts)
bow_accuracy = bow_baseline.score(test_prompts, test_labels)

# TF-IDF baseline
tfidf_baseline = BaselineProbe(method="tfidf")
tfidf_baseline.fit(positive_prompts, negative_prompts)

# Sentence length baseline (surprisingly predictive for some tasks)
length_baseline = BaselineProbe(method="sentence_length")
length_baseline.fit(positive_prompts, negative_prompts)

# Sentence-transformers embeddings (requires: pip install lmprobe[embeddings])
st_baseline = BaselineProbe(method="sentence_transformers")
st_baseline.fit(positive_prompts, negative_prompts)

# Random baseline (sanity check - should be ~50%)
random_baseline = BaselineProbe(method="random")

# Majority class baseline
majority_baseline = BaselineProbe(method="majority")

Activation-Based Baselines

Test whether the learned probe direction is special compared to simpler approaches:

from lmprobe import ActivationBaseline

# Random direction baseline - project onto random unit vector
random_dir = ActivationBaseline(
    method="random_direction",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)
random_dir.fit(positive_prompts, negative_prompts)
random_accuracy = random_dir.score(test_prompts, test_labels)

# PCA baseline - classify using top principal components
pca_baseline = ActivationBaseline(
    method="pca",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
)

# Layer 0 baseline - use input embeddings instead of deep layers
layer0_baseline = ActivationBaseline(
    method="layer_0",
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,  # Compare layer 0 to this layer
)

Baseline Battery

Run all applicable baselines at once and compare to your probe:

from lmprobe import BaselineBattery

# Text-only baselines (no model required)
battery = BaselineBattery(model=None, random_state=42)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

print(results.summary())
# Baseline Results:
# ------------------------------------------------------------
#   sentence_transformers          0.7925  (fit: 1.23s, predict: 0.05s)
#   tfidf                          0.7547  (fit: 0.01s, predict: 0.00s)
#   bow                            0.6792  (fit: 0.01s, predict: 0.00s)
#   ...

# Get best baseline
best = results.get_best()[0]
print(f"Best baseline: {best.name} with {best.score:.2%} accuracy")

# With activation baselines (requires model)
battery = BaselineBattery(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    include=["bow", "tfidf", "random_direction", "pca"],  # Select specific baselines
)
results = battery.fit(positive_prompts, negative_prompts, test_prompts, test_labels)

Available Baseline Methods

Method Type Description
bow Text Bag-of-words + classifier
tfidf Text TF-IDF + classifier
random Text Random predictions (sanity check)
majority Text Always predict majority class
sentence_length Text Classify by text length
sentence_transformers Text Pretrained embeddings + classifier
shuffled_labels Text Train on permuted labels (overfitting check)
random_direction Activation Project onto random unit vector
pca Activation Top principal components
layer_0 Activation Input embeddings only
perplexity Activation Model's own token probabilities

Per-Layer Normalization

When combining multiple layers, normalize each layer's activations independently to prevent high-magnitude layers from dominating:

probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=[14, 15, 16],
    normalize_layers=True,          # Default: per-neuron standardization
    # normalize_layers="per_layer", # Alternative: one mean/std per layer
    # normalize_layers=False,       # Disable normalization
)

Probe Ensembles

Combine multiple probes into an ensemble for more robust predictions and uncertainty estimation.

Basic ensemble

from lmprobe import Probe, ProbeEnsemble

# Combine probes with different classifiers
p1 = Probe(model="meta-llama/Llama-3.1-8B-Instruct", layers=-1, classifier="logistic_regression")
p2 = Probe(model="meta-llama/Llama-3.1-8B-Instruct", layers=-1, classifier="svm")
p3 = Probe(model="meta-llama/Llama-3.1-8B-Instruct", layers=16, classifier="logistic_regression")

ensemble = ProbeEnsemble([p1, p2, p3], voting="soft")
ensemble.fit(positive_prompts, negative_prompts)

predictions = ensemble.predict(test_prompts)           # (n_samples,)
probabilities = ensemble.predict_proba(test_prompts)   # (n_samples, n_classes)
accuracy = ensemble.score(test_prompts, test_labels)

Factory construction

Create ensembles from config dicts sharing a common model:

ensemble = ProbeEnsemble.from_configs(
    model="meta-llama/Llama-3.1-8B-Instruct",
    configs=[
        {"layers": -1, "classifier": "logistic_regression"},
        {"layers": -1, "classifier": "svm"},
        {"layers": 16, "classifier": "ridge"},
    ],
    voting="hard",    # majority vote (required when using Ridge)
    device="auto",    # shared kwargs
)

Bootstrap stability analysis

Clone a single probe into N bootstrap resamples to measure prediction stability:

base_probe = Probe(
    model="meta-llama/Llama-3.1-8B-Instruct",
    layers=-1,
    classifier="logistic_regression",
)

ensemble = ProbeEnsemble.bootstrap(base_probe, n_resamples=10, random_state=42)
ensemble.fit(positive_prompts, negative_prompts)

# Per-sample uncertainty: high std = ensemble members disagree
uncertainty = ensemble.prediction_std(test_prompts)  # (n_samples,)

Bootstrap mode supports sample_weight and groups for group-balanced resampling:

ensemble.fit(
    positive_prompts, negative_prompts,
    sample_weight=weights,    # per-sample importance weights
    groups=group_labels,      # group-balanced bootstrap resampling
)

Save and load

ensemble.save("my_ensemble.pkl")
loaded = ProbeEnsemble.load("my_ensemble.pkl")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmprobe-0.10.2.tar.gz (577.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmprobe-0.10.2-py3-none-any.whl (239.8 kB view details)

Uploaded Python 3

File details

Details for the file lmprobe-0.10.2.tar.gz.

File metadata

  • Download URL: lmprobe-0.10.2.tar.gz
  • Upload date:
  • Size: 577.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lmprobe-0.10.2.tar.gz
Algorithm Hash digest
SHA256 4044adc6d5cff5f6399eb4e74424dcdb69c9b47cf72a5519587f611ce2612048
MD5 de774d018020a90497bf47ff11042c78
BLAKE2b-256 5867762d727f1b6b1ada068d3b34396944b679bdd91d04d06c14a6b1aa5f1820

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmprobe-0.10.2.tar.gz:

Publisher: release.yml on AlliedToasters/lmprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmprobe-0.10.2-py3-none-any.whl.

File metadata

  • Download URL: lmprobe-0.10.2-py3-none-any.whl
  • Upload date:
  • Size: 239.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for lmprobe-0.10.2-py3-none-any.whl
Algorithm Hash digest
SHA256 2389493d80c52f0b602a59ae96554c457a0a3a8e37c986f49c3840b85c8860d0
MD5 9dbffc776b7caa09594299b3de764aff
BLAKE2b-256 dcc94485b06bd5b5509504c58f3a04b3b892db1087547ceb4ee06636d560e63d

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmprobe-0.10.2-py3-none-any.whl:

Publisher: release.yml on AlliedToasters/lmprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page