A framework for evaluating sparse autoencoders

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

adam_karvonen

These details have not been verified by PyPI

Project description

SAE Bench

Overview
Installation
Running Evaluations
Custom SAE Usage
Training Your Own SAEs
Graphing Results

CURRENT REPO STATUS: SAE Bench is currently a beta release. This repo is still under development as we clean up some of the rough edges left over from the research process. However, it is usable in the current state for both SAE Lens SAEs and custom SAEs.

Overview

SAE Bench is a comprehensive suite of 8 evaluations for Sparse Autoencoder (SAE) models:

Feature Absorption
AutoInterp
L0 / Loss Recovered
RAVEL (under development)
Spurious Correlation Removal (SCR)
Targeted Probe Pertubation (TPP)
Sparse Probing
Unlearning

For more information, refer to our blog post.

Supported Models and SAEs

SAE Lens Pretrained SAEs: Supports evaluations on any SAE Lens SAE.
dictionary_learning SAES: We support evaluations on any SAE trained with the dictionary_learning repo (see Custom SAE Usage).
Custom SAEs: Supports any general SAE object with encode() and decode() methods (see Custom SAE Usage).

Installation

Set up a virtual environment with python >= 3.10.

git clone https://github.com/adamkarvonen/SAEBench.git
cd SAEBench
pip install -e .

Alternative, you can install from pypi:

pip install sae-bench

All evals can be ran with current batch sizes on Gemma-2-2B on a 24GB VRAM GPU (e.g. a RTX 3090). By default, some evals cache LLM activations, which can require up to 100 GB of disk space. However, this can be disabled.

Autointerp requires the creation of openai_api_key.txt. Unlearning requires requesting access to the WMDP bio dataset (refer to unlearning/README.md).

Getting Started

We recommend to get starting by going through the sae_bench_demo.ipynb notebook. In this notebook, we load both a custom SAE and an SAE Lens SAE, run both of them on multiple evaluations, and plot graphs of the results.

Running Evaluations

Each evaluation has an example command located in its respective main.py file. To run all evaluations on a selection of SAE Lens SAEs, refer to shell_scripts/README.md. Here's an example of how to run a sparse probing evaluation on a single SAE Bench Pythia-70M SAE:

python -m sae_bench.evals.sparse_probing.main \
    --sae_regex_pattern "sae_bench_pythia70m_sweep_standard_ctx128_0712" \
    --sae_block_pattern "blocks.4.hook_resid_post__trainer_10" \
    --model_name pythia-70m-deduped

The results will be saved to the eval_results/sparse_probing directory.

We use regex patterns to select SAE Lens SAEs. For more examples of regex patterns, refer to sae_regex_selection.ipynb.

Every eval folder contains an eval_config.py, which contains all relevant hyperparamters for that evaluation. The values are currently set to the default recommended values.

For a tutorial of using SAE Lens SAEs, including calculating L0 and Loss Recovered and getting a set of tokens from The Pile, refer to this notebook: https://github.com/jbloomAus/SAELens/blob/main/tutorials/basic_loading_and_analysing.ipynb

Custom SAE Usage

Our goal is to have first class support for custom SAEs as the field is rapidly evolving. Our evaluations can run on any SAE object with encode(), decode(), and a few config values. We recommend referring to sae_bench_demo.ipynb. In this notebook, we load a custom SAE and an SAE Bench baseline SAE, run them on two evals, and graph the results. There is additional information about custom SAE usage in custom_saes/README.md.

If your SAEs are trained with the dictionary_learning repo, you can evaluate your SAEs by passing in the name of the HuggingFace repo containing your SAEs. Refer to SAEBench/custom_saes/run_all_evals_dictionary_learning_saes.py.

There are two ways to evaluate custom SAEs:

Using Evaluation Templates:
- Use the secondary if __name__ == "__main__" block in each main.py
- Results are saved in SAE Bench format for easy visualization
- Compatible with provided plotting tools
Direct Function Calls:
- Use run_eval_single_sae() in each main.py
- Simpler interface requiring only model, SAE, and config values
- Graphing will require manual formatting

We currently have a suite of SAE Bench SAEs on layers 3 and 4 of Pythia-70M and layers 5, 12, and 19 of Gemma-2-2B, each trained on 200M tokens with checkpoints at various points. These SAEs can serve as baselines for any new custom SAEs. We also have baseline eval results, saved at TODO.

Training Your Own SAEs

You can deterministically replicate the training of our SAEs using scripts provided here, or implement your own SAE, or make a change to one of our SAE implementations. Once you train your new version, you can benchmark against our existing SAEs for a true apples to apples comparison.

Graphing Results

If evaluating your own SAEs, we recommend using the graphing cells in sae_bench_demo.ipynb. To replicate all SAE Bench plots, refer to graphing.ipynb. In this notebook, we download all SAE Bench data and create a variety of plots.

Development

This project uses Poetry for dependency management and packaging.

To install the development dependencies, run:

poetry install

Unit tests can be run with:

poetry run pytest tests/unit

These test will be run automatically on every PR in CI.

There are also acceptance tests than can be run with:

poetry run pytest tests/acceptance

These tests are expensive and will not be run automatically in CI, but are worth running manually before large changes.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

adam_karvonen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.0

May 1, 2026

0.5.1

Dec 30, 2025

0.5.0

Oct 19, 2025

0.4.2

Jul 15, 2025

0.4.1

Apr 11, 2025

0.4.0

Feb 22, 2025

0.3.2

Jan 14, 2025

0.3.1

Jan 14, 2025

0.3.0

Jan 13, 2025

This version

0.2.0

Jan 9, 2025

0.1.0

Jan 9, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sae_bench-0.2.0.tar.gz (185.9 kB view details)

Uploaded Jan 9, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sae_bench-0.2.0-py3-none-any.whl (247.2 kB view details)

Uploaded Jan 9, 2025 Python 3

File details

Details for the file sae_bench-0.2.0.tar.gz.

File metadata

Download URL: sae_bench-0.2.0.tar.gz
Upload date: Jan 9, 2025
Size: 185.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for sae_bench-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`a25d08b43cc00b8814983f088ddf08a5b2afeb80e6596e9b881461005668da3d`
MD5	`a398d7f25db069035948615aad3b6071`
BLAKE2b-256	`717982b73c5083e3d87ab9b3e8bd6f4c587fdb9ef84a55cd31750c0bb1357220`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sae_bench-0.2.0.tar.gz:

Publisher: build.yml on adamkarvonen/SAEBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sae_bench-0.2.0.tar.gz
- Subject digest: a25d08b43cc00b8814983f088ddf08a5b2afeb80e6596e9b881461005668da3d
- Sigstore transparency entry: 161262210
- Sigstore integration time: Jan 9, 2025
Source repository:
- Permalink: adamkarvonen/SAEBench@2c731f611ed1090ea4c04857736cc5e3b9c2af0f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/adamkarvonen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build.yml@2c731f611ed1090ea4c04857736cc5e3b9c2af0f
- Trigger Event: push

File details

Details for the file sae_bench-0.2.0-py3-none-any.whl.

File metadata

Download URL: sae_bench-0.2.0-py3-none-any.whl
Upload date: Jan 9, 2025
Size: 247.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for sae_bench-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7bcdafd03fb66c6151bb85c45eed0980731632497a11acb2ed4bf22c73b324a2`
MD5	`fc83a1a2b50d9d547c0548817db9e7d7`
BLAKE2b-256	`aed8b466ea90fc925bf4e6a2b77bb662df6189ac32a2923a4e3ddcceb2d44196`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sae_bench-0.2.0-py3-none-any.whl:

Publisher: build.yml on adamkarvonen/SAEBench

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sae_bench-0.2.0-py3-none-any.whl
- Subject digest: 7bcdafd03fb66c6151bb85c45eed0980731632497a11acb2ed4bf22c73b324a2
- Sigstore transparency entry: 161262212
- Sigstore integration time: Jan 9, 2025
Source repository:
- Permalink: adamkarvonen/SAEBench@2c731f611ed1090ea4c04857736cc5e3b9c2af0f
- Branch / Tag: refs/heads/main
- Owner: https://github.com/adamkarvonen
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: build.yml@2c731f611ed1090ea4c04857736cc5e3b9c2af0f
- Trigger Event: push

sae-bench 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

SAE Bench

Table of Contents

Overview

Supported Models and SAEs

Installation

Getting Started

Running Evaluations

Custom SAE Usage

Training Your Own SAEs

Graphing Results

Development

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance