A library for linear probe experimentation

These details have not been verified by PyPI

Project description

probelab

probelab is a python library designed to enable end-to-end experiments for finding linear probes in open-weight transformer residual streams, with support for both huggingface- and transformer_lens-based models.

Probe Flow

Finding linear probes consists of several high-level steps:

Figure out the concept of interest, e.g., "refusal" a la Arditi et al., "truth" a la Marks & Tegmark, or "code vulnerability" a la Yu et al.. Picking a concept is up to you and your taste.
Construct a dataset of token sequences designed to elicit the concept of interest and its antithesis from the model's activations. This repo has some refusal and truth related datasets as used from the papers referenced above. Datasets for concepts other than those would need to be provided by you. The library provides several helper classes for importing a raw dataset into a normalized ProbeDataset class. The ProbeDataset is consumed by downstream probe training, evaluation, and evaluation classes.
Load the model you want to probe. probelab supports both huggingface (via load_hf) and transformer_lens (via load_tl) backends behind a common ModelHandle.
Experimentally decide where in the model to read activations from, and which tokens to read at. The "where" is an ActivationSpec - a pair of (target layers, residual component). targets="all_transformer" collects every transformer layer; component="resid_post" reads the residual stream at the end of the layer. The "which" is a TokenSelector - LastNTokenSelector(n=1) reads the last N tokens; PostInstructionTokenSelector reads all tokens after user commands of a chat-formatted prompt; AllTokenSelector reads the whole sequence. A TokenReducer then collapses the selected tokens into a single vector per example.

You can use a ChatFormatter to apply the model's chat template, optionally wrapping raw activity examples into instruction form (with instructionify=True). Then run an HFActivationCollector (or the transformer_lens equivalent) over your train/dev splits to produce an ActivationDataset. Other formatters are available such as for few-shot prompts (used in work like Mixture of Corrections and Geometry of Truth).
Train one probe per layer with sweep_layers. Pass a ProbeTrainer e.g., DifferenceOfMeansTrainer, or a logistic regression trainer if you want a learned classifier, the selector and reducer from step 3, and the train/dev ActivationDatasets. The result is a LayerSweepResult holding every trained probe keyed by layer, plus train and dev accuracies. The "best" layer here is best by probe accuracy on the dev set - which is necessary but not sufficient for a probe direction that causally matters like refusal.
To measure causality, you can re-rank layers by causal effect via validate_by_ablation. For each layer's probe direction, this runs a generation pass over a held-out behavioural set with that direction ablated at every transformer layer, scores the generations with a metric of your choice, and reports per-layer effect against a non-intervened baseline. You can pick the optimal layer by best_delta(), which gives the layer resulting in the largest delta of the chosen metric.

The metric is just metric_fn: ModelResponses -> float, so anything you can compute from the (command, response) pairs works. For behaviours that need semantic understanding of the response, probelab ships an LLM-judge architecture split along two axes: the property being judged and the backend doing the judging. SemanticJudge is the top-level ABC: judge(command, response) -> 1 | 0 | None (positive class / negative class / unclassifiable), with a default judge_batch that returns a SemanticScore carrying positive_rate, negative_rate, and per-example breakdowns. APIJudge is an abstract subclass for third-party-API providers; ClaudeJudge is the concrete Anthropic implementation, parameterized over (system_prompt, tool_name, tool_schema, output_field) so the same provider class works for any binary property. LocalJudge is the corresponding abstract base for self-hosted backends - HF, transformer_lens, vLLM, etc.

ClaudeRefusalJudge is the worked refusal example: a thin ClaudeJudge subclass that bakes in a refusal-specific system prompt and a tool-use schema that forces a binary output (the system prompt also frames the judge as a safety-evaluation tool, which sidesteps the judge itself refusing on harmful inputs). To target a different concept - truthfulness a la Marks & Tegmark, code-vulnerability a la Yu et al., sycophancy, hallucination, jailbreak success, etc. - write a sibling subclass with the right prompt + tool schema and pass metric_fn=lambda r: r.judge(my_judge).positive_rate to validate_by_ablation. To swap providers, write the analogous subclass against an OpenAIJudge or a LocalJudge implementation. The probe direction validate_by_ablation surfaces is then the one that most causally moves the target behaviour.
Sweep interventions more broadly with intervention_sweep. Once you have a probe direction you trust, you can scan over scales and modes (add, subtract, ablate) and over which layers to hook, using either an HFInterventionBackend or a TLInterventionBackend.

Visualization utilities are also provided for viewing results of different hyperparameter sweeps for probe training and interventions.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

probelab_py-0.1.0.tar.gz (72.2 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

probelab_py-0.1.0-py3-none-any.whl (85.7 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file probelab_py-0.1.0.tar.gz.

File metadata

Download URL: probelab_py-0.1.0.tar.gz
Upload date: May 20, 2026
Size: 72.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for probelab_py-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8d750a66770697ca0aa07e9888e43cf7bf8501b34ba807ff55708df403f2fa22`
MD5	`91acc50a05ecd9b0e70542bbfcfc3bfd`
BLAKE2b-256	`b328e3d54e0fb4332a22ae91ba0eb169e0f78bcb454b7768f00225995d739097`

See more details on using hashes here.

File details

Details for the file probelab_py-0.1.0-py3-none-any.whl.

File metadata

Download URL: probelab_py-0.1.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 85.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for probelab_py-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89ca1872bc95c7b777af7e3d8828805609a9cd649a09297b5579f69e4566128a`
MD5	`6994a4a22252ca38fb896843cb705420`
BLAKE2b-256	`6ca404c51f5a6d860305e7c6fdae063e02d88f8a16a76b0ccb12df699159f025`

See more details on using hashes here.

probelab-py 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

probelab

Probe Flow

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes