Interpretability toolbox for LLMs
Project description
📚 Table of contents
- 📚 Table of contents
- 🚀 Quick Start
- 📦 What's Included
- 👍 Contributing
- 👀 See Also
- 🙏 Acknowledgments
- 👨🎓 Creators
- 🗞️ Citation
- 📝 License
🚀 Quick Start
The library should be available on PyPI soon. Try pip install interpreto to install it.
Otherwise, you can clone the repository and install it locally with pip install -e ..
And any case, checkout the attribution walkthrough to get started!
📦 What's Included
Interpreto 🪄 provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.
Attribution Methods
Interpreto includes both inference-based and gradient-based attribution methods:
We currently have these methods available:
Inference-based Methods:
- Kernel SHAP: Lundberg and Lee, 2017, A Unified Approach to Interpreting Model Predictions.
- LIME: Ribeiro et al. 2013, "Why should i trust you?" explaining the predictions of any classifier.
- Occlusion: Zeiler and Fergus, 2014. Visualizing and understanding convolutional networks.
- Sobol Attribution: Fel et al. 2021, Look at the variance! efficient black-box explanations with sobol-based sensitivity analysis.
Gradient based methods:
- Gradient Shap: Lundberg and Lee, 2017, A Unified Approach to Interpreting Model Predictions.
- InputxGradient: Simonyan et al. 2013, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.
- Integrated Gradient: Sundararajan et al. 2017, Axiomatic Attribution for Deep Networks.
- Saliency: Simonyan et al. 2013, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.
- SmoothGrad: Smilkov et al. 2017, SmoothGrad: removing noise by adding noise
- SquareGrad: Hooker et al. (2019). A Benchmark for Interpretability Methods in Deep Neural Networks.
- VarGrad: Richter et al. 2020, VarGrad: A Low-Variance Gradient Estimator for Variational Inference
Concept-Based Methods
Concept-based explanations aim to provide high-level interpretations of latent model representations.
Interpreto generalizes these methods through three core steps:
- Concept Discovery (e.g., from latent embeddings)
- Concept Interpretation (mapping discovered concepts to human-understandable elements)
- Concept-to-Output Attribution (assessing concept relevance to model outputs)
Concept Discovery Techniques (via Overcomplete):
- NMF, Semi-NMF, ConvexNMF
- ICA, SVD, PCA, KMeans
- SAE variants (Vanilla SAE, TopK SAE, JumpReLU SAE, BatchTopK SAE)
Available Concept Interpretation Techniques:
- Top-k tokens from tokenizer vocabulary
- Top-k tokens/words/sentences/samples from specific datasets
- LLM Labeling (Bills et al. 2023)
Concept Interpretation Techniques Added Soon:
- Input-to-concept attribution from dataset examples (Jourdan et al. 2023)
- Theme prediction via LLMs from top-k tokens/sentences
Concept Interpretation Techniques Added Later:
- Aligning concepts with human labels (Sajjad et al. 2022)
- Word cloud visualizations of concepts (Dalvi et al. 2022)
- VocabProj & TokenChange (Gur-Arieh et al. 2025)
Concept-to-Output Attribution:
This part will be implemented later, but all the attribution methods presented above will be available here.
Note that only methods with a concept extraction that has an encoder (input to concept) AND a decoder (concept to output) can use this function.
Specific methods:
[Available later when all parts are implemented] Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:
- CAV and TCAV: Kim et al. 2018, Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
- ConceptSHAP: Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks
- COCKATIEL: Jourdan et al. 2023, COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP
- Yun et al. 2021, Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
- FFN values interpretation: Geva et al. 2022, Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
- SparseCoding: Cunningham et al. 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models
- Parameter Interpretation: Dar et al. 2023, Analyzing Transformers in Embedding Space
Evaluation Metrics
Evaluation Metrics for Attribution
We don't yet have metrics implemented for attribution methods, but that's coming soon!
Evaluation Metrics for Concepts
Several properties of the concept-space are desirable. The concept-space should (1) be faithful to the latent space data distribution; (2) have a low complexity to push toward interpretability; (3) be stable across different training regimes.
- Concept-space faithfulness: In Interpreto, you can use the ReconstructionError to define a custom metric by specifying a reconstruction_space and a distance_function. The MSE or FID metrics are also available.
- Concept-space complexity: Sparsity and SparsityRatio metric are available.
- Concept-space stability: You can use Stability metric to compare concept-model dictionaries.
👍 Contributing
Feel free to propose your ideas or come and contribute with us on the Interpreto 🪄 toolbox! We have a specific document where we describe in a simple way how to make your first pull request.
👀 See Also
More from the DEEL project:
- Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
- Puncc a Python library for predictive uncertainty quantification using conformal prediction.
- oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
- deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
- deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
- Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
- DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.
🙏 Acknowledgments
This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.
👨🎓 Creators
Interpreto 🪄 is a project of the FOR and the DEEL teams at the IRT Saint-Exupéry in Toulouse, France.
🗞️ Citation
If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper (coming soon):
BibTeX entry coming soon
📝 License
The package is released under MIT license.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file interpreto-0.4.9.tar.gz.
File metadata
- Download URL: interpreto-0.4.9.tar.gz
- Upload date:
- Size: 150.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6d0b0357d152e104119ff8c7b4186375eddd598aa4db87a1337c838d843c0fb
|
|
| MD5 |
13f4a45ea465963bce6dae8cf58f6c16
|
|
| BLAKE2b-256 |
032f6f16626c9b2391af241fec3a4a204548b0fcdcd5f7add1c4da2989df343e
|
Provenance
The following attestation bundles were made for interpreto-0.4.9.tar.gz:
Publisher:
release.yml on FOR-sight-ai/interpreto
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interpreto-0.4.9.tar.gz -
Subject digest:
b6d0b0357d152e104119ff8c7b4186375eddd598aa4db87a1337c838d843c0fb - Sigstore transparency entry: 699811131
- Sigstore integration time:
-
Permalink:
FOR-sight-ai/interpreto@20ccecdb247f3facced69a8dbe4d539ab93072eb -
Branch / Tag:
refs/tags/v0.4.9 - Owner: https://github.com/FOR-sight-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@20ccecdb247f3facced69a8dbe4d539ab93072eb -
Trigger Event:
push
-
Statement type:
File details
Details for the file interpreto-0.4.9-py3-none-any.whl.
File metadata
- Download URL: interpreto-0.4.9-py3-none-any.whl
- Upload date:
- Size: 220.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84db29e0a2f5c80f5254e71ec9673df8bbcc1ccef10730f74bd083e85e1263f6
|
|
| MD5 |
4d9f25b5476934f6be2e81c359d2774f
|
|
| BLAKE2b-256 |
54313383d8f94989ebf9cf692347f0ba50e7cde1e5fd93559c82d379cf8df0d8
|
Provenance
The following attestation bundles were made for interpreto-0.4.9-py3-none-any.whl:
Publisher:
release.yml on FOR-sight-ai/interpreto
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
interpreto-0.4.9-py3-none-any.whl -
Subject digest:
84db29e0a2f5c80f5254e71ec9673df8bbcc1ccef10730f74bd083e85e1263f6 - Sigstore transparency entry: 699811133
- Sigstore integration time:
-
Permalink:
FOR-sight-ai/interpreto@20ccecdb247f3facced69a8dbe4d539ab93072eb -
Branch / Tag:
refs/tags/v0.4.9 - Owner: https://github.com/FOR-sight-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@20ccecdb247f3facced69a8dbe4d539ab93072eb -
Trigger Event:
push
-
Statement type: