An implementation of transformers tailored for mechanistic interpretability.

These details have not been verified by PyPI

Project description

TransformerLens

Pepy Total Downlods PyPI - License

This library is maintained by Joseph Bloom and was created by Neel Nanda

Read the Docs Here

Installation

Install: pip install transformer_lens

import transformer_lens

# Load a model (eg GPT-2 Small)
model = transformer_lens.HookedTransformer.from_pretrained("gpt2-small")

# Run the model and get logits and activations
logits, activations = model.run_with_cache("Hello World")

Key Tutorials

Introduction to the Library and Mech Interp

Demo of Main TransformerLens Features

A Library for Mechanistic Interpretability of Generative Language Models

This is a library for doing mechanistic interpretability of GPT-2 Style language models. The goal of mechanistic interpretability is to take a trained model and reverse engineer the algorithms the model learned during training from its weights. It is a fact about the world today that we have computer programs that can essentially speak English at a human level (GPT-3, PaLM, etc), yet we have no idea how they work nor how to write one ourselves. This offends me greatly, and I would like to solve this!

TransformerLens lets you load in an open source language model, like GPT-2, and exposes the internal activations of the model to you. You can cache any internal activation in the model, and add in functions to edit, remove or replace these activations as the model runs. The core design principle I've followed is to enable exploratory analysis. One of the most fun parts of mechanistic interpretability compared to normal ML is the extremely short feedback loops! The point of this library is to keep the gap between having an experiment idea and seeing the results as small as possible, to make it easy for research to feel like play and to enter a flow state. Part of what I aimed for is to make my experience of doing research easier and more fun, hopefully this transfers to you!

Gallery

Research done involving TransformerLens:

Progress Measures for Grokking via Mechanistic Interpretability (ICLR Spotlight, 2023) by Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt
Finding Neurons in a Haystack: Case Studies with Sparse Probing by Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, Dimitris Bertsimas
Towards Automated Circuit Discovery for Mechanistic Interpretability by Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, Adrià Garriga-Alonso
Actually, Othello-GPT Has A Linear Emergent World Representation by Neel Nanda
A circuit for Python docstrings in a 4-layer attention-only transformer by Stefan Heimersheim and Jett Janiak
A Toy Model of Universality (ICML, 2023) by Bilal Chughtai, Lawrence Chan, Neel Nanda
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models (2023, ICLR Workshop RTML) by Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Fazl Barez
Eliciting Latent Predictions from Transformers with the Tuned Lens by Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, Jacob Steinhardt

User contributed examples of the library being used in action:

Induction Heads Phase Change Replication: A partial replication of In-Context Learning and Induction Heads from Connor Kissane
Decision Transformer Interpretability: A set of scripts for training decision transformers which uses transformer lens to view intermediate activations, perform attribution and ablations. A write up of the initial work can be found here.

Check out our demos folder for more examples of TransformerLens in practice

Getting Started in Mechanistic Interpretability

Mechanistic interpretability is a very young and small field, and there are a lot of open problems. This means there's both a lot of low-hanging fruit, and that the bar for entry is low - if you would like to help, please try working on one! The standard answer to "why has no one done this yet" is just that there aren't enough people! Key resources:

A Guide to Getting Started in Mechanistic Interpretability
ARENA Mechanistic Interpretability Tutorials from Callum McDougall. A comprehensive practical introduction to mech interp, written in TransformerLens - full of snippets to copy and they come with exercises and solutions! Notable tutorials:
- Coding GPT-2 from scratch, with accompanying video tutorial from me (1 2) - a good introduction to transformers
- Introduction to Mech Interp and TransformerLens: An introduction to TransformerLens and mech interp via studying induction heads. Covers the foundational concepts of the library
- Indirect Object Identification: a replication of interpretability in the wild, that covers standard techniques in mech interp such as direct logit attribution, activation patching and path patching
Mech Interp Paper Reading List
200 Concrete Open Problems in Mechanistic Interpretability
A Comprehensive Mechanistic Interpretability Explainer: To look up all the jargon and unfamiliar terms you're going to come across!
Neel Nanda's Youtube channel: A range of mech interp video content, including paper walkthroughs, and walkthroughs of doing research

Support & Community

If you have issues, questions, feature requests or bug reports, please search the issues to check if it's already been answered, and if not please raise an issue!

You're also welcome to join the open source mech interp community on Slack! Please use issues for concrete discussions about the package, and Slack for higher bandwidth discussions about eg supporting important new use cases, or if you want to make substantial contributions to the library and want a maintainer's opinion. We'd also love for you to come and share your projects on the Slack!

We're particularly excited to support grad students and professional researchers using TransformerLens for their work, please have a low bar for reaching out if there's ways we could better support your use case!

Background

I (Neel Nanda) used to work for the Anthropic interpretability team, and I wrote this library because after I left and tried doing independent research, I got extremely frustrated by the state of open source tooling. There's a lot of excellent infrastructure like HuggingFace and DeepSpeed to use or train models, but very little to dig into their internals and reverse engineer how they work. This library tries to solve that, and to make it easy to get into the field even if you don't work at an industry org with real infrastructure! One of the great things about mechanistic interpretability is that you don't need large models or tons of compute. There are lots of important open problems that can be solved with a small model in a Colab notebook!

The core features were heavily inspired by the interface to Anthropic's excellent Garcon tool. Credit to Nelson Elhage and Chris Olah for building Garcon and showing me the value of good infrastructure for enabling exploratory research!

Contributing

See https://neelnanda-io.github.io/TransformerLens/content/contributing.html

Citation

Please cite this library as:

@misc{nanda2022transformerlens,
    title = {TransformerLens},
    author = {Neel Nanda and Joseph Bloom},
    year = {2022},
    howpublished = {\url{https://github.com/neelnanda-io/TransformerLens}},
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.0

Jun 11, 2024

2.0.1

Jun 6, 2024

2.0.0

May 30, 2024

1.19.0

May 30, 2024

1.17.0

Apr 30, 2024

1.16.0

Apr 28, 2024

1.15.0

Mar 28, 2024

1.14.0

Jan 28, 2024

1.13.0

Jan 23, 2024

1.12.1

Jan 17, 2024

1.12.0

Dec 11, 2023

1.11.0

Nov 29, 2023

1.10.0

Nov 10, 2023

1.9.1

Oct 26, 2023

1.9.0

Oct 22, 2023

This version

1.8.1

Oct 19, 2023

1.8.0

Oct 19, 2023

1.7.0

Oct 15, 2023

1.6.1

Sep 10, 2023

1.6.0

Aug 22, 2023

1.5.0

Aug 9, 2023

1.4.0

Jul 26, 2023

1.3.0

Jun 26, 2023

1.2.2

Apr 24, 2023

1.2.1

Mar 19, 2023

1.1.1

Feb 4, 2023

1.1.0

Feb 4, 2023

1.0.0

Jan 16, 2023

0.2.0

Dec 25, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

transformer_lens-1.8.1.tar.gz (110.6 kB view hashes)

Uploaded Oct 19, 2023 Source

Built Distribution

transformer_lens-1.8.1-py3-none-any.whl (113.4 kB view hashes)

Uploaded Oct 19, 2023 Python 3

Hashes for transformer_lens-1.8.1.tar.gz

Hashes for transformer_lens-1.8.1.tar.gz
Algorithm	Hash digest
SHA256	`2b7b3a90bce25899a003d1abeeb42b2291d382510c6eed401a27f29ee559d85a`
MD5	`5a1b6c223429dea70cdb9f086074e9a5`
BLAKE2b-256	`4607ba264ea0e3ec0ad27e21a126ab4fc4209417e1ee9b5155947fe5c2161826`

Hashes for transformer_lens-1.8.1-py3-none-any.whl

Hashes for transformer_lens-1.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0b7a2f63f58b52086a753572f12125a2c1d77a11b4c83d065fad5df5f55f6a2c`
MD5	`3a8821e9bf560918b3e67470e0d53e38`
BLAKE2b-256	`87886d432107c0a08c38997658cfdee4c28ea64d82cd814d854af8afcd82cc05`