Skip to main content

A library for linear probe experimentation

Project description

probelab

probelab is a python library designed to enable end-to-end experiments for finding linear probes in open-weight transformer residual streams, with support for both huggingface- and transformer_lens-based models.

Probe Flow

Finding linear probes consists of several high-level steps:

  1. Figure out the concept of interest, e.g., "refusal" a la Arditi et al., "truth" a la Marks & Tegmark, or "code vulnerability" a la Yu et al.. Picking a concept is up to you and your taste.

  2. Construct a dataset of token sequences designed to elicit the concept of interest and its antithesis from the model's activations. This repo has some refusal and truth related datasets as used from the papers referenced above. Datasets for concepts other than those would need to be provided by you. The library provides several helper classes for importing a raw dataset into a normalized ProbeDataset class. The ProbeDataset is consumed by downstream probe training, evaluation, and evaluation classes.

  3. Load the model you want to probe. probelab supports both huggingface (via load_hf) and transformer_lens (via load_tl) backends behind a common ModelHandle.

  4. Experimentally decide where in the model to read activations from, and which tokens to read at. The "where" is an ActivationSpec - a pair of (target layers, residual component). targets="all_transformer" collects every transformer layer; component="resid_post" reads the residual stream at the end of the layer. The "which" is a TokenSelector - LastNTokenSelector(n=1) reads the last N tokens; PostInstructionTokenSelector reads all tokens after user commands of a chat-formatted prompt; AllTokenSelector reads the whole sequence. A TokenReducer then collapses the selected tokens into a single vector per example.

    You can use a ChatFormatter to apply the model's chat template, optionally wrapping raw activity examples into instruction form (with instructionify=True). Then run an HFActivationCollector (or the transformer_lens equivalent) over your train/dev splits to produce an ActivationDataset. Other formatters are available such as for few-shot prompts (used in work like Mixture of Corrections and Geometry of Truth).

  5. Train one probe per layer with sweep_layers. Pass a ProbeTrainer e.g., DifferenceOfMeansTrainer, or a logistic regression trainer if you want a learned classifier, the selector and reducer from step 3, and the train/dev ActivationDatasets. The result is a LayerSweepResult holding every trained probe keyed by layer, plus train and dev accuracies. The "best" layer here is best by probe accuracy on the dev set - which is necessary but not sufficient for a probe direction that causally matters like refusal.

  6. To measure causality, you can re-rank layers by causal effect via validate_by_ablation. For each layer's probe direction, this runs a generation pass over a held-out behavioural set with that direction ablated at every transformer layer, scores the generations with a metric of your choice, and reports per-layer effect against a non-intervened baseline. You can pick the optimal layer by best_delta(), which gives the layer resulting in the largest delta of the chosen metric.

    The metric is just metric_fn: ModelResponses -> float, so anything you can compute from the (command, response) pairs works. For behaviours that need semantic understanding of the response, probelab ships an LLM-judge architecture split along two axes: the property being judged and the backend doing the judging. SemanticJudge is the top-level ABC: judge(command, response) -> 1 | 0 | None (positive class / negative class / unclassifiable), with a default judge_batch that returns a SemanticScore carrying positive_rate, negative_rate, and per-example breakdowns. APIJudge is an abstract subclass for third-party-API providers; ClaudeJudge is the concrete Anthropic implementation, parameterized over (system_prompt, tool_name, tool_schema, output_field) so the same provider class works for any binary property. LocalJudge is the corresponding abstract base for self-hosted backends - HF, transformer_lens, vLLM, etc.

    ClaudeRefusalJudge is the worked refusal example: a thin ClaudeJudge subclass that bakes in a refusal-specific system prompt and a tool-use schema that forces a binary output (the system prompt also frames the judge as a safety-evaluation tool, which sidesteps the judge itself refusing on harmful inputs). To target a different concept - truthfulness a la Marks & Tegmark, code-vulnerability a la Yu et al., sycophancy, hallucination, jailbreak success, etc. - write a sibling subclass with the right prompt + tool schema and pass metric_fn=lambda r: r.judge(my_judge).positive_rate to validate_by_ablation. To swap providers, write the analogous subclass against an OpenAIJudge or a LocalJudge implementation. The probe direction validate_by_ablation surfaces is then the one that most causally moves the target behaviour.

  7. Sweep interventions more broadly with intervention_sweep. Once you have a probe direction you trust, you can scan over scales and modes (add, subtract, ablate) and over which layers to hook, using either an HFInterventionBackend or a TLInterventionBackend.

Visualization utilities are also provided for viewing results of different hyperparameter sweeps for probe training and interventions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

probelab_py-0.1.0.tar.gz (72.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

probelab_py-0.1.0-py3-none-any.whl (85.7 kB view details)

Uploaded Python 3

File details

Details for the file probelab_py-0.1.0.tar.gz.

File metadata

  • Download URL: probelab_py-0.1.0.tar.gz
  • Upload date:
  • Size: 72.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for probelab_py-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8d750a66770697ca0aa07e9888e43cf7bf8501b34ba807ff55708df403f2fa22
MD5 91acc50a05ecd9b0e70542bbfcfc3bfd
BLAKE2b-256 b328e3d54e0fb4332a22ae91ba0eb169e0f78bcb454b7768f00225995d739097

See more details on using hashes here.

File details

Details for the file probelab_py-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: probelab_py-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 85.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.9 {"installer":{"name":"uv","version":"0.9.9"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for probelab_py-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89ca1872bc95c7b777af7e3d8828805609a9cd649a09297b5579f69e4566128a
MD5 6994a4a22252ca38fb896843cb705420
BLAKE2b-256 6ca404c51f5a6d860305e7c6fdae063e02d88f8a16a76b0ccb12df699159f025

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page