Steering vectors for transformer language models in Pytorch / Huggingface

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Steering Vectors

Steering vectors for transformer language models in Pytorch / Huggingface

Full docs: https://steering-vectors.github.io/steering-vectors

Colab demo:

About

This library provides utilies for training and applying steering vectors to language models (LMs) from Huggingface, like GPT2, Llama2, GptNeoX, etc...

Steering vectors identify a direction in hidden activations which can be used to control how the model behaves. For example, we can make a LM be more or less honest in its responses, or more or less happy, or more or less confrontational, etc... This works by providing paired positive and negative training examples for the characteristic you're trying to elicit. To train a steering vector for truthfulness, you might use prompts like the following:

Positive prompt (truthful):

Question: What is the correct answer? 2 + 2 =
(A): 4
(B): 7
Answer: A

Negative prompt (not truthful):

Question: What is the correct answer? 2 + 2 =
(A): 4
(B): 7
Answer: B

Then, we can find a steering vector by observing the hidden activations in a language models as it processes the positive and negative statements above and subtract the "negative" actvations from the "positive" activations. Then, we can use this vector to "steer" the model to be more or less truthful. Neat!

For more info on steering vectors, check out the following work:

Steering Llama 2 via Contrastive Activation Addition Rimsky et al., 2023
Representation Engineering: A Top-Down Approach to AI Transparency Zou et al., 2023

Installation

pip install steering-vectors

Basic usage

This library assumes you're using PyTorch with a decoder-only generative language model (e.g. GPT, LLaMa, etc...), and a tokenizer from Huggingface.

To begin, collect tuples of positive and negative training prompts in a list, and run train_steering_vector():

from transformers import AutoModelForCausalLM, AutoTokenizer
from steering_vectors import train_steering_vector

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# training samples are tuples of (positive_prompt, negative_prompt)
training_samples = [
    (
        "2 + 2 = 4",
        "2 + 2 = 7"
    ),
    (
        "The capital of France is Paris",
        "The capital of France is Berlin"
    )
    # ...
]


steering_vector = train_steering_vector(
    model,
    tokenizer,
    training_samples,
    show_progress=True,
)

Then, you can use the steering vector to "steer" the model's behavior:

with steering_vector.apply(model):
    prompt = "Is it true that crystals have magic healing properties?"
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs)

Check out the full documentation for more info.

Contributing

Any contributions to improve this project are welcome! Please open an issue or pull request in this repo with any bugfixes / changes / improvements you have!

This project uses Black for code formatting, Flake8 for linting, and Pytest for tests. Make sure any changes you submit pass these code checks in your PR. If you have trouble getting these to run feel free to open a pull-request regardless and we can discuss further in the PR.

License

This code is released under a MIT license.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.10.2

Apr 2, 2024

0.10.1

Mar 7, 2024

0.10.0

Feb 26, 2024

0.9.0

Feb 21, 2024

0.8.0

Feb 21, 2024

0.7.0

Feb 10, 2024

0.6.0

Feb 10, 2024

This version

0.5.0

Jan 25, 2024

0.4.0

Jan 25, 2024

0.3.0

Jan 23, 2024

0.2.0

Jan 21, 2024

0.1.0

Jan 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

steering_vectors-0.5.0.tar.gz (11.0 kB view hashes)

Uploaded Jan 25, 2024 Source

Built Distribution

steering_vectors-0.5.0-py3-none-any.whl (12.3 kB view hashes)

Uploaded Jan 25, 2024 Python 3

Hashes for steering_vectors-0.5.0.tar.gz

Hashes for steering_vectors-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`a799b158deba0753f761740861661b1fe1e58a8aac3fc5bdec44b6fe750b184d`
MD5	`64e8881fc96c5cf5472d5fd4ee05dd36`
BLAKE2b-256	`4616b34432c2e3a2bc3e5c6b544fddff4c9dbfa9be0d27035127455bccef9e83`

Hashes for steering_vectors-0.5.0-py3-none-any.whl

Hashes for steering_vectors-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5358318970b4a41ecba3f0b3e807c85552f019d621746836b5e0a4b10da1dd2`
MD5	`01c5d27d206516119ab801e524ed88fd`
BLAKE2b-256	`8f5e73cf43f9a2da465d787be3ad9f6c5e3a109c6fb4ee73bba080738f8c7d9d`