Skip to main content

Interpretability toolbox for LLMs

Project description

⚠️ Warning

This library is currently in beta and many functions may not work. If you use it anyway, we welcome your comments; please open an issue!

The API might change and the documentation is not up to date.

In particular, it is not yet possible to obtain interpretable concept-based explanations.

📚 Table of contents

🚀 Quick Start

The library should be available on PyPI soon. Try pip install interpreto to install it.

Otherwise, you can clone the repository and install it locally with pip install -e ..

And any case, checkout the attribution walkthrough to get started!

📦 What's Included

Interpreto 🪄 provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods:

We currently have these methods available:

Inference-based Methods:

Gradient based methods:

We will be adding these methods soon (Gradient based methods):

Concept-Based Methods

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through three core steps:

  1. Concept Discovery (e.g., from latent embeddings)
  2. Concept Interpretation (mapping discovered concepts to human-understandable elements)
  3. Concept-to-Output Attribution (assessing concept relevance to model outputs)

Concept Discovery Techniques (via Overcomplete):

  • NMF, Semi-NMF, ConvexNMF
  • ICA, SVD, PCA
  • SAE variants (Vanilla SAE, TopK SAE, JumpReLU SAE, BatchTopK SAE)

Available Concept Interpretation Techniques:

  • Top-k tokens from tokenizer vocabulary
  • Top-k tokens/words/sentences/samples from specific datasets

Concept Interpretation Techniques Added Soon:

  • Input-to-concept attribution from dataset examples (Jourdan et al. 2023)
  • Theme prediction via LLMs from top-k tokens/sentences

Concept Interpretation Techniques Added Later:

Concept-to-Output Attribution:

This part will be implemented later, but all the attribution methods presented above will be available here.

Note that only methods with a concept extraction that has an encoder (input to concept) AND a decoder (concept to output) can use this function.

Specific methods:

[Available later when all parts are implemented] Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:

Evaluation Metrics

Evaluation Metrics for Attribution

We don't yet have metrics implemented for attribution methods, but that's coming soon!

Evaluation Metrics for Concepts

Several properties of the concept-space are desirable. The concept-space should (1) be faithful to the latent space data distribution; (2) have a low complexity to push toward interpretability; (3) be stable across different training regimes.
  • Concept-space faithfulness: In Interpreto, you can use the ReconstructionError to define a custom metric by specifying a reconstruction_space and a distance_function. The MSE or FID metrics are also available.
  • Concept-space complexity: Sparsity and SparsityRatio metric are available.
  • Concept-space stability: You can use Stability metric to compare concept-model dictionaries.

👍 Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto 🪄 toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

👀 See Also

More from the DEEL project:

  • Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
  • Puncc a Python library for predictive uncertainty quantification using conformal prediction.
  • oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
  • deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
  • deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
  • Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
  • DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.

🙏 Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

👨‍🎓 Creators

Interpreto 🪄 is a project of the FOR and the DEEL teams at the IRT Saint-Exupéry in Toulouse, France.

🗞️ Citation

If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper (coming soon):

BibTeX entry coming soon

📝 License

The package is released under MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interpreto-0.3.2.tar.gz (117.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interpreto-0.3.2-py3-none-any.whl (167.3 kB view details)

Uploaded Python 3

File details

Details for the file interpreto-0.3.2.tar.gz.

File metadata

  • Download URL: interpreto-0.3.2.tar.gz
  • Upload date:
  • Size: 117.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for interpreto-0.3.2.tar.gz
Algorithm Hash digest
SHA256 cde1cd630afdae4f6d2bce338e83e7ead7f451b789525c12340756124e434b58
MD5 a9d73d129ba9c0b5fc38ac7814579861
BLAKE2b-256 01a60d062106f69526835cf59941b5032448888e1107323f57f6d82602363593

See more details on using hashes here.

Provenance

The following attestation bundles were made for interpreto-0.3.2.tar.gz:

Publisher: release.yml on FOR-sight-ai/interpreto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file interpreto-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: interpreto-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 167.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for interpreto-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7708f2116dcf2f39249ef78e64d09e79ea34b6ca41f5ac5cbccc4ed0d9fce091
MD5 9effa39218ad01ea36d6ed944f9b668f
BLAKE2b-256 b7a8a8c5e5ce2ceae05d1ac8c06d3d361b250d038eb5ca0030be943604fbb070

See more details on using hashes here.

Provenance

The following attestation bundles were made for interpreto-0.3.2-py3-none-any.whl:

Publisher: release.yml on FOR-sight-ai/interpreto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page