Skip to main content

A library for doing research on developmental interpretability

Project description

DevInterp

PyPI version Python version Contributors Docs

A Python Library for Developmental Interpretability Research

DevInterp is a python library for conducting research on developmental interpretability, a novel AI safety research agenda rooted in Singular Learning Theory (SLT). DevInterp proposes tools for detecting, locating, and ultimately controlling the development of structure over training.

Read more about developmental interpretability.

:warning: This library is still in early development. Don't expect things to work on a first attempt. We are actively working on improving the library and adding new features.

Installation

To install devinterp, simply run pip install devinterp.

Minimal Example

from devinterp.slt import sample, LLCEstimator
from devinterp.optim import SGLD
from devinterp.utils import optimal_temperature

# Assuming you have a PyTorch Module and DataLoader
llc_estimator = LLCEstimator(..., temperature=optimal_temperature(trainloader))
sample(model, trainloader, ..., callbacks = [llc_estimator])

llc_mean = llc_estimator.sample()["llc/mean"]

Advanced Usage

To see DevInterp in action, check out our example notebooks:

For more advanced usage, see the Diagnostics notebook Open In Colab and for a quick guide on picking hyperparameters, see the Calibration notebook. Open In Colab. Documentation can be found here. Docs

For papers that either inspired or used the DevInterp package, click here.

Known Issues

  • The current implementation does not work with transformers out-of-the-box. This can be fixed by adding a wrapper to your model, for example passing Unpack(model) to sample() where unpack is defined by:
class Unpack(nn.Module):
 def __init__(model: nn.Module):
      self.model = model

 def forward(data: Tuple[torch.Tensor, torch.Tensor]):
      return self.model(*data)
  • LLC Estimation is currently more of an art than a science. It will take some time and pain to get it work reliably.

If you run into issues not mentioned here, please first check the github issues, then ask in the DevInterp Discord, and only then make a new github issue.

Contributing

See CONTRIBUTING.md for guidelines on how to contribute.

Credits & Citations

This package was created by Timaeus. The main contributors to this package are Jesse Hoogland, Stan van Wingerden, and George Wang. Zach Furman, Matthew Farrugia-Roberts and Edmund Lau also made valuable contributions or provided useful advice.

If this package was useful in your work, please cite it as:

   @misc{devinterp2024,
      title = {DevInterp},
      author = {Jesse Hoogland, Stan van Wingerden, and George Wang},
      year = {2024},
      howpublished = {\url{https://github.com/timaeus-research/devinterp}},
   }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

devinterp-0.2.0.tar.gz (26.3 kB view hashes)

Uploaded Source

Built Distribution

devinterp-0.2.0-py3-none-any.whl (32.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page