Skip to main content

Sparse AutoEncoder to decode Mistral LLM

Project description

Sparse Autoencoder for Steering Mistral 7B

This repository contains a Sparse Autoencoder (SAE) designed to interpret and steer the Mistral 7B language model. By training the SAE on the residual activations of Mistral 7B, we aim to understand the internal representations of the model and manipulate its outputs in a controlled manner.

Overview

Large Language Models (LLMs) like Mistral 7B have complex internal mechanisms that are not easily interpretable. This project leverages a Sparse Autoencoder to:

  • Decode internal activations: Transforming high-dimensional activations into sparse, interpretable features.
  • Steer model behavior: Manipulating specific features to influence the model's output.

This approach is based on the hypothesis that internal features are superimposed in the model's activations and can be disentangled using sparse representations.

Personal Work

I have written the following articles that provide foundational insights guiding the development of this project:

These writings provide foundational insights that have guided the development of this project.

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/mistral-sae.git
    cd mistral-sae
    
  2. Install dependencies:

    pip install -r requirements.txt
    

    Ensure you have the appropriate version of PyTorch installed, preferably with CUDA support for GPU acceleration.

Usage

Training the Sparse Autoencoder

The train.py script trains the SAE on activations from a specified layer of the Mistral 7B model.

python train.py
  • Adjust hyperparameters like D_MODEL, D_HIDDEN, BATCH_SIZE, and lr within the script.
  • Set the MISTRAL_MODEL_PATH and target_layer to specify which model and layer to use.

Generating Feature Explanations

Use explain.py to generate natural language explanations for the features learned by the SAE.

python explain.py
  • Ensure you have access to the required datasets (e.g., The Pile) and APIs.
  • Configure parameters such as batch_size, data_path, and target_layer.

Steering the Model Output

The demo.py script demonstrates how to steer the Mistral 7B model by manipulating specific features.

python demo.py
  • Set FEATURE_INDEX to the index of the feature you wish to manipulate.
  • Toggle STEERING_ON to True to enable steering.
  • Adjust the coeff variable to control the strength of the manipulation.

Project Structure

  • config.py: Contains model configurations and helper functions.
  • train.py: Script for training the Sparse Autoencoder.
  • explain.py: Generates explanations for the features identified by the SAE.
  • demo.py: Demonstrates how to steer the Mistral 7B model using the SAE.
  • mistral_sae/: Directory containing the SAE implementation and related utilities.
  • requirements.txt: Lists the Python dependencies required for the project.

Background

Understanding the internal workings of LLMs is crucial for both interpretability and control. By applying a Sparse Autoencoder to the activations of Mistral 7B, we can:

  • Identify monosemantic neurons that correspond to specific concepts or features.
  • Test the superposition hypothesis by examining how multiple features are represented within the same neurons.
  • Enhance our ability to steer the model's outputs towards desired behaviors by manipulating these features.

Acknowledgments

This project is inspired by and builds upon several key works:

Resources

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistral_sae-0.1.3.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

mistral_sae-0.1.3-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file mistral_sae-0.1.3.tar.gz.

File metadata

  • Download URL: mistral_sae-0.1.3.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Windows/11

File hashes

Hashes for mistral_sae-0.1.3.tar.gz
Algorithm Hash digest
SHA256 ff49741ed67cff6c4de1e8ba8724baec1dcc991c00e0a5930ba3f832d4164fdb
MD5 c59f903e4e62b752507c6a675c50c545
BLAKE2b-256 5fc070cec3c135b1dd86dc674dbb8382a2b77943a1611b4eba0dec3c3522d191

See more details on using hashes here.

File details

Details for the file mistral_sae-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mistral_sae-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.4 Windows/11

File hashes

Hashes for mistral_sae-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4adb7e72ae7d0d7c5e63e962b3ed8da685695bf21f325fc10bbf5c6370b59d71
MD5 adc594831d7b678d8174f27080cd48e4
BLAKE2b-256 7ae8c850bfa2b575b30b937dded46df1e2ddc4e2c6dc571f3cf95d638a8dacf0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page