interpreto

Interpretability toolbox for LLMs

These details have not been verified by PyPI

Project description

Interpreto: Interpretability Toolkit for LLMs

📚 Explore Interpreto docs >>
🖼️ Checkout our explanation gallery >> 📜 Read our paper >>

🚀 Quick Start

The library is available on PyPI, try pip install interpreto to install it.

Checkout the tutorials to get started:

Attributions walkthrough (both classification and generation)
Classification concept-based explanations
Generation concept-based explanations

📦 What's Included

Interpreto 🪄 provides a modular framework encompassing Attribution Methods, Concept-Based Methods, and Evaluation Metrics.

🔥 Attribution Methods

Interpreto includes both inference-based and gradient-based attribution methods.

They all work seamlessly for both classification (...ForSequenceClassification) and generation (...ForCausalLM)

Inference-based Methods:

Gradient-based methods:

GradientShap — Lundberg and Lee, 2017
InputxGradient — Simonyan et al., 2013
Integrated Gradient — Sundararajan et al., 2017
Saliency — Simonyan et al., 2013
SmoothGrad — Smilkov et al., 2017
SquareGrad — Hooker et al., 2019
VarGrad — Richter et al., 2020

💡 Concept-Based Methods or Mechanistic Interpretability

Concept-based explanations aim to provide high-level interpretations of latent model representations.

We propose both supervised (probes and CAVs) and unsupervised (dictionary learning) approaches.

Interpreto generalizes these methods through four core steps, the two first are common between both approaches:

Split a model in two and obtain a dataset of activations
Learn concepts (e.g., from latent embeddings)
Interpret concepts (mapping discovered concepts to human-understandable elements)
Estimate concepts importance (assessing concept relevance to model outputs)

1. Split a model in two and obtain a dataset of activations: (mainly via nnsight):

Choose any layer in any HuggingFace language model with our ModelWithSplitPoints based on nnsight. Then pass a dataset through it to obtain a dataset of activations.

2. (supervised) Train probe with the ProbeExplainer

We differentiate two families of probes:

Linear probes: LinearRegressionProbe, LogisticRegressionProbe, LinearSVMProbe, MeansDiffProbe
Centroid-based probes: CosineCentroidProbe, DotProductCentroidProbe, SqL2CentroidProbe, SVDDCentroidProbe, DiagonalMahalanobisCentroidProbe

Both can be tuned with bias_calibrator and normalization parameters.

2. (unsupervised) Dictionary Learning for Concept Discovery (mainly via overcomplete):

Interpret neurons directly via NeuronsAsConcepts
NMF, Semi-NMF, ConvexNMF
ICA, SVD, PCA, KMeans
SAE variants: Vanilla SAE, TopK SAE, JumpReLU SAE, BatchTopK SAE

3. (unsupervised) Available Concept Interpretation Techniques:

Top-k tokens from tokenizer vocabulary via TopKInputs and use_vocab=True
Top-k tokens/words/sentences/samples from specific datasets via TopKInputs
Label concepts via LLMs with LLMLabels (Bills et al. 2023)
Input-to-concept attribution from dataset examples (Concept Attributions) (Jourdan et al. 2023)

Concept Interpretation Techniques Added in the future:

Aligning concepts with human labels (Sajjad et al. 2022)
Word cloud visualizations of concepts (Dalvi et al. 2022)
VocabProj & TokenChange (Gur-Arieh et al. 2025)

4. (unsupervised) Concept-to-Output Attribution:

Estimate the contribution of each concept to the model output.

Can be obtained with any concept-based explainer via MethodConcepts.concept_output_gradient().

Papers available in the future:

Thanks to this generalization encompassing all concept-based methods and our highly flexible architecture, we can easily obtain a large number of concept-based methods:

ConceptSHAP: Yeh et al. 2020, On Completeness-aware Concept-Based Explanations in Deep Neural Networks
COCKATIEL: Jourdan et al. 2023, COCKATIEL: COntinuous Concept ranKed ATtribution with Interpretable ELements for explaining neural net classifiers on NLP
Yun et al. 2021, Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors
FFN values interpretation: Geva et al. 2022, Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
SparseCoding: Cunningham et al. 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models
Parameter Interpretation: Dar et al. 2023, Analyzing Transformers in Embedding Space

📊 Evaluation Metrics

Evaluation Metrics for Attribution

To evaluate attribution methods faithfulness, there are the Insertion and Deletion metrics.

Evaluation Metrics for Concepts

Concept-based methods have several steps that can be evaluated together via ConSim.

Or independently:

Concept-space (dictionary learning evaluation)
- faithfulness: MSE, FID, and ReconstructionError
- complexity: Sparsity, SparsityRatio, SparsityRatio
- stability: Stability
Concepts interpretations
- No metric yet, will be included soon.
Concept-to-Output attribution
- No metric yet, will be included soon.

👍 Contributing

Feel free to propose your ideas or come and contribute with us on the Interpreto 🪄 toolbox! We have a specific document where we describe in a simple way how to make your first pull request.

👀 See Also

🙏 Acknowledgments

This project received funding from the French ”Investing for the Future – PIA3” program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL and the FOR projects.

👨‍🎓 Creators

Interpreto 🪄 is a project of the FOR and the DEEL teams at the IRT Saint-Exupéry in Toulouse, France.

🗞️ Citation

If you use Interpreto 🪄 as part of your workflow in a scientific publication, please consider citing 🗞️ our paper:

@article{poche2025interpreto,
    title       = {Interpreto: An Explainability Library for Transformers},
    author      = {Poch{\'e}, Antonin and Mullor, Thomas and Sarti, Gabriele and Boisnard, Fr{\'e}d{\'e}ric and Friedrich, Corentin and Claye, Charlotte and Hoofd, Fran{\c{c}}ois and Bernas, Raphael and Hudelot, C{\'e}line and Jourdan, Fanny},
    journal     = {arXiv preprint arXiv:2512.09730},
    year        = {2025}
}

📝 License

The package is released under MIT license.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.0

Jun 22, 2026

0.5.0.dev1 pre-release

Jun 10, 2026

0.5.0.dev0 pre-release

May 25, 2026

0.4.20

Mar 20, 2026

0.4.19

Mar 16, 2026

0.4.18

Mar 13, 2026

0.4.17

Mar 3, 2026

0.4.16

Feb 16, 2026

0.4.15

Jan 20, 2026

0.4.14

Jan 13, 2026

0.4.13

Jan 13, 2026

0.4.12

Jan 9, 2026

0.4.11

Dec 3, 2025

0.4.10

Nov 20, 2025

0.4.9

Nov 13, 2025

0.4.8

Oct 24, 2025

0.4.7

Oct 13, 2025

0.4.6

Oct 10, 2025

0.4.5

Sep 30, 2025

0.4.4

Sep 26, 2025

0.4.3

Sep 26, 2025

0.4.2

Sep 25, 2025

0.4.1

Sep 23, 2025

0.4.0

Sep 11, 2025

0.3.4

Aug 11, 2025

0.3.3

Jul 22, 2025

0.3.2

Jul 1, 2025

0.3.1

Jun 26, 2025

0.3.0

Jun 22, 2025

0.2.4

Jun 13, 2025

0.2.3

Jun 11, 2025

0.2.2

Jun 5, 2025

0.1.0

Feb 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

interpreto-0.5.0.tar.gz (230.7 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

interpreto-0.5.0-py3-none-any.whl (338.7 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file interpreto-0.5.0.tar.gz.

File metadata

Download URL: interpreto-0.5.0.tar.gz
Upload date: Jun 22, 2026
Size: 230.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for interpreto-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`f1f2002b3d9fc88e1c2dde0d6956b1100acf0e12bad932425ccfb90b0a2e861e`
MD5	`fde2b3d79dfe3f143004aa639c8f766e`
BLAKE2b-256	`5c3111b41d28747c13ad05d18575eb40e56f149184e8ee9e734407ea5357251e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for interpreto-0.5.0.tar.gz:

Publisher: release.yml on FOR-sight-ai/interpreto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: interpreto-0.5.0.tar.gz
- Subject digest: f1f2002b3d9fc88e1c2dde0d6956b1100acf0e12bad932425ccfb90b0a2e861e
- Sigstore transparency entry: 1917739470
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: FOR-sight-ai/interpreto@b8228f94fc9b305da126c41a6bec44b8890d28eb
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/FOR-sight-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b8228f94fc9b305da126c41a6bec44b8890d28eb
- Trigger Event: push

File details

Details for the file interpreto-0.5.0-py3-none-any.whl.

File metadata

Download URL: interpreto-0.5.0-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 338.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for interpreto-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0c1b7cd31a7e44eb535289d7dbde38eeb1b2319b75cf610f4aa029ec34445d56`
MD5	`b5dcee1ba8ddd7954abcb220ca409a4d`
BLAKE2b-256	`381746baff1ea55fccf452c1a4387986fa16d6691741cf7932092525ab65970d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for interpreto-0.5.0-py3-none-any.whl:

Publisher: release.yml on FOR-sight-ai/interpreto

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: interpreto-0.5.0-py3-none-any.whl
- Subject digest: 0c1b7cd31a7e44eb535289d7dbde38eeb1b2319b75cf610f4aa029ec34445d56
- Sigstore transparency entry: 1917739657
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: FOR-sight-ai/interpreto@b8228f94fc9b305da126c41a6bec44b8890d28eb
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/FOR-sight-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@b8228f94fc9b305da126c41a6bec44b8890d28eb
- Trigger Event: push

interpreto 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

🚀 Quick Start

📦 What's Included

🔥 Attribution Methods

💡 Concept-Based Methods or Mechanistic Interpretability

📊 Evaluation Metrics

👍 Contributing

👀 See Also

🙏 Acknowledgments

👨‍🎓 Creators

🗞️ Citation

📝 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance