Use Activation Intervention to Interpret Causal Mechanism of Model

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language

Project description

Library Paper and Doc Are Forthcoming »

**Use Activation Intervention to Interpret Causal Mechanism of Model**

pyvene supports customizable interventions on different neural architectures (e.g., RNN or Transformers). It supports complex intervention schemas (e.g., parallel or serialized interventions) and a wide range of intervention modes (e.g., static or trained interventions) at scale to gain interpretability insights.

Getting Started: [pyvene 101]

Installation

pip install pyvene

Wrap , Intervene and Share

You can intervene with supported models as,

import pyvene
from pyvene import IntervenableRepresentationConfig, IntervenableConfig, IntervenableModel

# provided wrapper for huggingface gpt2 model
_, tokenizer, gpt2 = pyvene.create_gpt2()

# turn gpt2 into intervenable_gpt2
intervenable_gpt2 = IntervenableModel(
    intervenable_config = IntervenableConfig(
        intervenable_representations=[
            IntervenableRepresentationConfig(
                0,            # intervening layer 0
                "mlp_output", # intervening mlp output
                "pos",        # intervening based on positional indices of tokens
                1             # maximally intervening one token
            ),
        ],
    ), 
    gpt2
)

# intervene base with sources on the fourth token.
original_outputs, intervened_outputs = intervenable_gpt2(
    tokenizer("The capital of Spain is", return_tensors="pt"),
    [tokenizer("The capital of Italy is", return_tensors="pt")],
    {"sources->base": ([[[4]]], [[[4]]])}
)
original_outputs.last_hidden_state - intervened_outputs.last_hidden_state

which returns,

tensor([[[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0008, -0.0078, -0.0066,  ...,  0.0007, -0.0018,  0.0060]]])

showing that we have causal effects only on the last token as expected. You can share your interventions through Huggingface with others with a single call,

intervenable_gpt2.save(
    save_directory="./your_gpt2_mounting_point/",
    save_to_hf_hub=True,
    hf_repo_name="your_gpt2_mounting_point",
)

We see interventions are knobs that can mount on models. And people can share their knobs with others to share knowledge about how to steer models. You can try this at [Intervention Sharing]

Selected Tutorials

Level	Tutorial	Description
Beginner	Getting Started	Introduces basic static intervention on factual recall examples
Beginner	Intervened Model Generation	Shows how to intervene a model during generation
Intermediate	Intervene Your Local Models	Illustrates how to run this library with your own models
Advanced	Trainable Interventions for Causal Abstraction	Illustrates how to train an intervention to discover causal mechanisms of a neural model

Causal Abstraction: From Interventions to Gain Interpretability Insights

Basic interventions are fun but we cannot make any causal claim systematically. To gain actual interpretability insights, we want to measure the counterfactual behaviors of a model in a data-driven fashion. In other words, if the model responds systematically to your interventions, then you start to associate certain regions in the network with a high-level concept. We also call this alignment search process with model internals.

Understanding Causal Mechanisms with Static Interventions

Here is a more concrete example,

def add_three_numbers(a, b, c):
    var_x = a + b
    return var_x + c

The function solves a 3-digit sum problem. Let's say, we trained a neural network to solve this problem perfectly. "Can we find the representation of (a + b) in the neural network?". We can use this library to answer this question. Specifically, we can do the following,

Step 1: Form Interpretability (Alignment) Hypothesis: We hypothesize that a set of neurons N aligns with (a + b).
Step 2: Counterfactual Testings: If our hypothesis is correct, then swapping neurons N between examples would give us expected counterfactual behaviors. For instance, the values of N for (1+2)+3, when swapping with N for (2+3)+4, the output should be (2+3)+3 or (1+2)+4 depending on the direction of the swap.
Step 3: Reject Sampling of Hypothesis: Running tests multiple times and aggregating statistics in terms of counterfactual behavior matching. Proposing a new hypothesis based on the results.

To translate the above steps into API calls with the library, it will be a single call,

intervenable.evaluate(
    train_dataloader=test_dataloader,
    compute_metrics=compute_metrics,
    inputs_collator=inputs_collator
)

where you provide testing data (basically interventional data and the counterfactual behavior you are looking for) along with your metrics functions. The library will try to evaluate the alignment with the intervention you specified in the config.

Understanding Causal Mechanism with Trainable Interventions

The alignment searching process outlined above can be tedious when your neural network is large. For a single hypothesized alignment, you basically need to set up different intervention configs targeting different layers and positions to verify your hypothesis. Instead of doing this brute-force search process, you can turn it into an optimization problem which also has other benefits such as distributed alignments.

In its crux, we basically want to train an intervention to have our desired counterfactual behaviors in mind. And if we can indeed train such interventions, we claim that causally informative information should live in the intervening representations! Below, we show one type of trainable intervention models.interventions.RotatedSpaceIntervention as,

class RotatedSpaceIntervention(TrainableIntervention):
    
    """Intervention in the rotated space."""
    def forward(self, base, source):
        rotated_base = self.rotate_layer(base)
        rotated_source = self.rotate_layer(source)
        # interchange
        rotated_base[:self.interchange_dim] = rotated_source[:self.interchange_dim]
        # inverse base
        output = torch.matmul(rotated_base, self.rotate_layer.weight.T)
        return output

Instead of activation swapping in the original representation space, we first rotate them, and then do the swap followed by un-rotating the intervened representation. Additionally, we try to use SGD to learn a rotation that lets us produce expected counterfactual behavior. If we can find such rotation, we claim there is an alignment. If the cost is between X and Y.ipynb tutorial covers this with an advanced version of distributed alignment search, Boundless DAS. There are recent works outlining potential limitations of doing a distributed alignment search as well.

You can now also make a single API call to train your intervention,

intervenable.train(
    train_dataloader=train_dataloader,
    compute_loss=compute_loss,
    compute_metrics=compute_metrics,
    inputs_collator=inputs_collator
)

where you need to pass in a trainable dataset, and your customized loss and metrics function. The trainable interventions can later be saved on to your disk. You can also use intervenable.evaluate() your interventions in terms of customized objectives.

Contributing to This Library

Please see our guidelines about how to contribute to this repository.

Pull requests, bug reports, and all other forms of contribution are welcomed and highly encouraged! :octocat:

Other Ways of Installation

Method 2: Install from the Repo

pip install git+https://github.com/frankaging/pyvene.git

Method 3: Clone and Import

git clone https://github.com/frankaging/pyvene.git

and in parallel folder, import to your project as,

from pyvene import pyvene
_, tokenizer, gpt2 = pyvene.create_gpt2()

Related Works in Discovering Causal Mechanism of LLMs

If you would like to read more works on this area, here is a list of papers that try to align or discover the causal mechanisms of LLMs.

Causal Abstractions of Neural Networks: This paper introduces interchange intervention (a.k.a. activation patching or causal scrubbing). It tries to align a causal model with the model's representations.
Inducing Causal Structure for Interpretable Neural Networks: Interchange intervention training (IIT) induces causal structures into the model's representations.
Localizing Model Behavior with Path Patching: Path patching (or causal scrubbing) to uncover causal paths in neural model.
Towards Automated Circuit Discovery for Mechanistic Interpretability: Scalable method to prune out a small set of connections in a neural network that can still complete a task.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: Path patching plus posthoc representation study to uncover a circuit that solves the indirect object identification (IOI) task.
Rigorously Assessing Natural Language Explanations of Neurons: Using causal abstraction to validate neuron explanations released by OpenAI.

Citation

If you use this repository, please consider to cite relevant papers:

  @article{geiger-etal-2023-DAS,
        title={Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations}, 
        author={Geiger, Atticus and Wu, Zhengxuan and Potts, Christopher and Icard, Thomas  and Goodman, Noah},
        year={2023},
        booktitle={arXiv}
  }

  @article{wu-etal-2023-Boundless-DAS,
        title={Interpretability at Scale: Identifying Causal Mechanisms in Alpaca}, 
        author={Wu, Zhengxuan and Geiger, Atticus and Icard, Thomas and Potts, Christopher and Goodman, Noah},
        year={2023},
        booktitle={NeurIPS}
  }

Project details

These details have not been verified by PyPI

Project links

Homepage

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- POSIX :: Linux
Programming Language

Release history Release notifications | RSS feed

0.1.8

May 26, 2025

0.1.7

Feb 3, 2025

0.1.6

Nov 6, 2024

0.1.5

Aug 24, 2024

0.1.4

Aug 5, 2024

0.1.3

Aug 5, 2024

0.1.2

Jun 3, 2024

0.1.1

Apr 8, 2024

0.1.0

Apr 5, 2024

0.0.8

Mar 26, 2024

0.0.7

Feb 1, 2024

0.0.6

Jan 13, 2024

0.0.5

Jan 13, 2024

0.0.4

Jan 12, 2024

This version

0.0.3

Jan 12, 2024

0.0.2

Jan 11, 2024

0.0.1

Jan 11, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyvene-0.0.3.tar.gz (48.6 kB view details)

Uploaded Jan 12, 2024 Source

Built Distribution

pyvene-0.0.3-py3-none-any.whl (52.1 kB view details)

Uploaded Jan 12, 2024 Python 3

File details

Details for the file pyvene-0.0.3.tar.gz.

File metadata

Download URL: pyvene-0.0.3.tar.gz
Upload date: Jan 12, 2024
Size: 48.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pyvene-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`bf80a362ee467c009dc5e4e11e4cd61cc2c38988c5efd3529dfd63169979f450`
MD5	`5008929c1a90d77baf6c3f15a0f16041`
BLAKE2b-256	`71e1aa5a29328bc3a3c7377617f459076a8f480513192a5351e611cf9ce9d60b`

See more details on using hashes here.

File details

Details for the file pyvene-0.0.3-py3-none-any.whl.

File metadata

Download URL: pyvene-0.0.3-py3-none-any.whl
Upload date: Jan 12, 2024
Size: 52.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for pyvene-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`013f18245b358a98c7cb167da4fe9ee452874feb76027a3f885917226d189c40`
MD5	`602dcdd3132189b8946613cfac3e9051`
BLAKE2b-256	`1a7c10953eedd92dbab27b71954e3c6e92c6dcffbb3487fae6ce4fd02f656f2c`

See more details on using hashes here.

pyvene 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

**Use Activation Intervention to Interpret Causal Mechanism of Model**

Installation

Wrap , Intervene and Share

Selected Tutorials

Causal Abstraction: From Interventions to Gain Interpretability Insights

Understanding Causal Mechanisms with Static Interventions

Understanding Causal Mechanism with Trainable Interventions

Contributing to This Library

Other Ways of Installation

Related Works in Discovering Causal Mechanism of LLMs

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes