A library for linear probe experimentation
Project description
probelab
probelab is a python library designed to enable end-to-end experiments for finding linear probes in open-weight transformer residual streams, with support for both huggingface- and transformer_lens-based models.
Probe Flow
Finding linear probes consists of several high-level steps:
-
Figure out the concept of interest, e.g., "refusal" a la Arditi et al., "truth" a la Marks & Tegmark, or "code vulnerability" a la Yu et al.. Picking a concept is up to you and your taste.
-
Construct a dataset of token sequences designed to elicit the concept of interest and its antithesis from the model's activations. This repo has some refusal and truth related datasets as used from the papers referenced above. Datasets for concepts other than those would need to be provided by you. The library provides several helper classes for importing a raw dataset into a normalized ProbeDataset class. The ProbeDataset is consumed by downstream probe training, evaluation, and evaluation classes.
-
Load the model you want to probe. probelab supports both huggingface (via load_hf) and transformer_lens (via load_tl) backends behind a common
ModelHandle. -
Experimentally decide where in the model to read activations from, and which tokens to read at. The "where" is an ActivationSpec - a pair of (target layers, residual component).
targets="all_transformer"collects every transformer layer;component="resid_post"reads the residual stream at the end of the layer. The "which" is a TokenSelector -LastNTokenSelector(n=1)reads the last N tokens;PostInstructionTokenSelectorreads all tokens after user commands of a chat-formatted prompt;AllTokenSelectorreads the whole sequence. ATokenReducerthen collapses the selected tokens into a single vector per example.You can use a ChatFormatter to apply the model's chat template, optionally wrapping raw
activityexamples into instruction form (withinstructionify=True). Then run an HFActivationCollector (or the transformer_lens equivalent) over your train/dev splits to produce an ActivationDataset. Other formatters are available such as for few-shot prompts (used in work like Mixture of Corrections and Geometry of Truth). -
Train one probe per layer with sweep_layers. Pass a
ProbeTrainere.g., DifferenceOfMeansTrainer, or a logistic regression trainer if you want a learned classifier, the selector and reducer from step 3, and the train/devActivationDatasets. The result is a LayerSweepResult holding every trained probe keyed by layer, plus train and dev accuracies. The "best" layer here is best by probe accuracy on the dev set - which is necessary but not sufficient for a probe direction that causally matters like refusal. -
To measure causality, you can re-rank layers by causal effect via validate_by_ablation. For each layer's probe direction, this runs a generation pass over a held-out behavioural set with that direction ablated at every transformer layer, scores the generations with a metric of your choice, and reports per-layer effect against a non-intervened baseline. You can pick the optimal layer by
best_delta(), which gives the layer resulting in the largest delta of the chosen metric.The metric is just
metric_fn: ModelResponses -> float, so anything you can compute from the (command, response) pairs works. For behaviours that need semantic understanding of the response, probelab ships an LLM-judge architecture split along two axes: the property being judged and the backend doing the judging. SemanticJudge is the top-level ABC:judge(command, response) -> 1 | 0 | None(positive class / negative class / unclassifiable), with a defaultjudge_batchthat returns a SemanticScore carryingpositive_rate,negative_rate, and per-example breakdowns. APIJudge is an abstract subclass for third-party-API providers; ClaudeJudge is the concrete Anthropic implementation, parameterized over (system_prompt,tool_name,tool_schema,output_field) so the same provider class works for any binary property. LocalJudge is the corresponding abstract base for self-hosted backends - HF, transformer_lens, vLLM, etc.ClaudeRefusalJudge is the worked refusal example: a thin
ClaudeJudgesubclass that bakes in a refusal-specific system prompt and a tool-use schema that forces a binary output (the system prompt also frames the judge as a safety-evaluation tool, which sidesteps the judge itself refusing on harmful inputs). To target a different concept - truthfulness a la Marks & Tegmark, code-vulnerability a la Yu et al., sycophancy, hallucination, jailbreak success, etc. - write a sibling subclass with the right prompt + tool schema and passmetric_fn=lambda r: r.judge(my_judge).positive_ratetovalidate_by_ablation. To swap providers, write the analogous subclass against anOpenAIJudgeor aLocalJudgeimplementation. The probe directionvalidate_by_ablationsurfaces is then the one that most causally moves the target behaviour. -
Sweep interventions more broadly with intervention_sweep. Once you have a probe direction you trust, you can scan over scales and modes (
add,subtract,ablate) and over which layers to hook, using either an HFInterventionBackend or a TLInterventionBackend.
Visualization utilities are also provided for viewing results of different hyperparameter sweeps for probe training and interventions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file probelab_py-0.1.0.tar.gz.
File metadata
- Download URL: probelab_py-0.1.0.tar.gz
- Upload date:
- Size: 72.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8d750a66770697ca0aa07e9888e43cf7bf8501b34ba807ff55708df403f2fa22
|
|
| MD5 |
91acc50a05ecd9b0e70542bbfcfc3bfd
|
|
| BLAKE2b-256 |
b328e3d54e0fb4332a22ae91ba0eb169e0f78bcb454b7768f00225995d739097
|
File details
Details for the file probelab_py-0.1.0-py3-none-any.whl.
File metadata
- Download URL: probelab_py-0.1.0-py3-none-any.whl
- Upload date:
- Size: 85.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
89ca1872bc95c7b777af7e3d8828805609a9cd649a09297b5579f69e4566128a
|
|
| MD5 |
6994a4a22252ca38fb896843cb705420
|
|
| BLAKE2b-256 |
6ca404c51f5a6d860305e7c6fdae063e02d88f8a16a76b0ccb12df699159f025
|