Skip to main content

Interpretability toolbox for LLMs

Project description

Interpreto: Interpretability Toolkit for LLMs

Build status Version Python Version Downloads License: MIT

📚 Explore Interpreto docs >>
🖼️ Checkout our explanation gallery >>

🚀 Quick Start

The library is available on PyPI, try pip install interpreto to install it.

Checkout the tutorials to get started:

📦 What's Included

Interpreto 🪄 provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

🔥 Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods.

They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)

Inference-based Methods:

Gradient-based methods:

💡 Concept-Based Methods or Mechanistic Interpretability

Concept-based explanations aim to provide high-level interpretations of latent model representations.

Interpreto generalizes these methods through four core steps:

  1. Split a model in two and obtain a dataset of activations
  2. Concept Discovery (e.g., from latent embeddings)
  3. Concept Interpretation (mapping discovered concepts to human-understandable elements)
  4. Concept-to-Output Attribution (assessing concept relevance to model outputs)

1. Split a model in two and obtain a dataset of activations: (mainly via nnsight):

Choose any layer in any HuggingFace language model with our ModelWithSplitPoints based on nnsight. Then pass a dataset through it to obtain a dataset of activations.

2. Dictionary Learning for Concept Discovery (mainly via overcomplete):

3. Available Concept Interpretation Techniques:

Concept Interpretation Techniques Added in the future:

4. Concept-to-Output Attribution:

Estimate the contribution of each concept to the model output.

Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().

Papers available in the future:

Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:

📊 Evaluation Metrics

Evaluation Metrics for Attribution

To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.

Evaluation Metrics for Concepts

Concept-based methods have several steps that can be evaluated together via ConSim.

Or independently:

👍 Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto 🪄 toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

👀 See Also

More from the DEEL project:

  • Xplique a Python library dedicated to explaining neural networks (Images, Time Series, Tabular data) on TensorFlow.
  • Puncc a Python library for predictive uncertainty quantification using conformal prediction.
  • oodeel a Python library that performs post-hoc deep Out-of-Distribution (OOD) detection on already trained neural network image classifiers.
  • deel-lip a Python library for training k-Lipschitz neural networks on TensorFlow.
  • deel-torchlip a Python library for training k-Lipschitz neural networks on PyTorch.
  • Influenciae a Python library dedicated to computing influence values for the discovery of potentially problematic samples in a dataset.
  • DEEL White paper a summary of the DEEL team on the challenges of certifiable AI and the role of data quality, representativity and explainability for this purpose.

🙏 Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

👨‍🎓 Creators

Interpreto 🪄 is a project of the FOR and the DEEL teams at the IRT Saint-Exupéry in Toulouse, France.

🗞️ Citation

If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper:

@article{poche2025interpreto,
    title       = {Interpreto: An Explainability Library for Transformers},
    author      = {Poch{\'e}, Antonin and Mullor, Thomas and Sarti, Gabriele and Boisnard, Fr{\'e}d{\'e}ric and Friedrich, Corentin and Claye, Charlotte and Hoofd, Fran{\c{c}}ois and Bernas, Raphael and Hudelot, C{\'e}line and Jourdan, Fanny},
    journal     = {arXiv preprint arXiv:2512.09730},
    year        = {2025}
}

📝 License

The package is released under MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interpreto-0.4.20.tar.gz (207.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

interpreto-0.4.20-py3-none-any.whl (301.8 kB view details)

Uploaded Python 3

File details

Details for the file interpreto-0.4.20.tar.gz.

File metadata

  • Download URL: interpreto-0.4.20.tar.gz
  • Upload date:
  • Size: 207.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for interpreto-0.4.20.tar.gz
Algorithm Hash digest
SHA256 5f4804464b7e31e03c33d49ddf4765f6a1a94f25d9f5c851a803514ac11338c3
MD5 1782e54fe47dcb5bf5b232af1fca1217
BLAKE2b-256 a2536c5c226bbf270a64abcde20daa3c15aa0c1929c3aa977977e7075f65e93c

See more details on using hashes here.

Provenance

The following attestation bundles were made for interpreto-0.4.20.tar.gz:

Publisher: release.yml on FOR-sight-ai/interpreto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file interpreto-0.4.20-py3-none-any.whl.

File metadata

  • Download URL: interpreto-0.4.20-py3-none-any.whl
  • Upload date:
  • Size: 301.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for interpreto-0.4.20-py3-none-any.whl
Algorithm Hash digest
SHA256 0eb0188e2e1e0805750f78e271171016fb51be24da1ae036bf4e3135570275ab
MD5 c891815bf84f5b54bfceae657e3350f8
BLAKE2b-256 db93399a8144ef03c283c564f71d7eb9d1b4c9427bd3c7f09426a46fa5cf0d18

See more details on using hashes here.

Provenance

The following attestation bundles were made for interpreto-0.4.20-py3-none-any.whl:

Publisher: release.yml on FOR-sight-ai/interpreto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page