Library and command-line tool to list features lowering your evaluation score.

These details have not been verified by PyPI

Project links

Project description

TruLens Hotspots

TruLens Hotspots is a tool for listing features in your evaluation data that correlate with worse results, according to your evaluation metric.

TruLens Hotspots:

works with any evaluation metric, be they human scores, formula-based scores (F1, BLEU, ROUGE), learnable evaluation metrics (COMET) or LLM-as-a-judge scores
is a totally black-box method, any model can be diagnosed
all that is required is just a data frame with per-sample evaluation scores
is easily pluggable into your Python code (at the end of the day, it's a function that takes a dataframe) or can be used as a stand-alone command-line tool
can be run stand-alone or as a part of the whole TruLens framework

How to run

As a command-line tool

python -m trulens.hotspots file.csv \
   --score_column COLUMN_WITH_EVALUATION_SCORES \
   [--skip_columns OPTIONAL_SPACE_SEPARATED_LIST_COLUMNS_TO_BE_DISREGARDED] \
   [--more_skipped_columns OPTIONAL_SPACE_SEPARATED_LIST_COLUMNS_TO_BE_DISREGARDED]

Skipping columns

Some columns might correlate with the evaluation metric in a trivial manner, let's say your mean metric is F-score, but you also have columns for precision and recall, it does not make any sense to look at them for hotspots. The list of columns to be skipped is determined in the following manner:

Some columns are skipped automatically using a simple heuristic (a regular expression for things like "score", "precision", "false negative", etc.)
You can override them with the --skip_columns argument.
... or you can add more with the --more_skipped_columns argument.

As a stand-alone library

from trulens.hotspots import hotspots_as_df, HotspotsConfig

# A Pandas data frame with per-item evaluation scores, the output of whatever evaluation process
# you're using
evaluation_results_df = ...

# The name of the column with the evaluation score
column_with_evaluation_scores = ...

# The names of the columns to be skipped (e.g. auxiliary evaluation scores)
columns_to_be_skipped = ...

hotspots_config=HotspotsConfig(
   score_column=column_with_evaluation_score,
   skip_columns=columns_to_be_skipped)


hotspots_df = hotspots_as_df(hotspots_config, evaluation_results_df)
# hotspots_df is a data frame with hotspots

As a part of the whole TruLens framework

from trulens.core.session import TruSession
from trulens.tru_hotspots import get_hotspots

session = TruSession()

# ... running the whole experiment

feedback_name = "Comprehensiveness" # one of the feedbacks
hotspots_df = get_hotspots(session, feedback="Comprehensiveness")
# hotspots_df is a data frame with hotspots

As a part of a stand-alone Streamlit app

It's easy to run TruLens Hotspots as a part of a Streamlit app, see trulens/streamlit.py for helper functions and an example. You can run it as follows:

pip install streamlit
python -m trulens.hotspots.hotspots_streamlit

Screenshot for a sample Streamlit app

Similarly, you can use TruLens Hotspots for a Streamlit-in-Snowflake app.

TruLens Hotspots output

TruLens Hotspots returns a table with the following columns:

Feature - one of the following types:
- FOO=bar - field FOO is equal to bar
- FOO:bar - field FOO contains the word bar
- len(FOO)>=10 - the length of field FOO is greater or equal to 10
- FOO<20 - the value in field FOO is less than 20
Number of occurrences - how many samples the feature occurred in
Average score - average score for samples containing the feature (it will be significantly worse than the total score!)
Deterioration - the delta between the average score for samples containing the feature (i.e. column 3) and the average score for the rest of samples (almost always negative)
Opportunity - how much the average total score would improve if we had somehow fixed the problem with the feature (i.e. move it to the level of the average of negative examples)
p-Value - assuming that the feature has no real effect on worsening scores, how likely is to get the such a distribution of scores (usually, a tiny value).

Example

Here's the output for a Hotspots run for some LLM evaluated on MMLU:

len(QUESTION)>=117              7027	0.60772753	-0.16549267	+0.08281705	0.00000000000000000000
QUESTION:1	                1711	0.46492572	-0.25676377	+0.03128634	0.00000000000000000000
DOC_ID:mathematics	        748	0.46494490	-0.23814388	+0.01268563	0.00000000000000000000
TAGS=["professional_law"]	1534	0.49343116	-0.22112893	+0.02415694	0.00000000000000000000
DOC_ID:physics	                488	0.56084207	-0.13422581	+0.00466473	0.00000000000000000000
CHOICES:2	                574	0.47180662	-0.22791301	+0.00931648	0.00000000000000000000
TARGET=Wrong, Not wrong	        217	0.32424306	-0.37190742	+0.00574732	0.00000000000000000000
TAGS=["virology"]	        166	0.52613443	-0.16623389	+0.00196516	0.00000000000060876997
CHOICES:0	                372	0.43726682	-0.26002489	+0.00688857	0.00000000000000000000
TAGS=["global_facts"]	        100	0.45071932	-0.24140297	+0.00171915	0.00000000000000904794
TAGS=["professional_accounting"]282	0.51067434	-0.18341221	+0.00368340	0.00000000000000000000
DOC_ID:chemistry	        303	0.56378787	-0.12940765	+0.00279237	0.00000000006836024711
TARGET=Wrong, Wrong	        213	0.37777224	-0.31744618	+0.00481527	0.00000000000000000000
CHOICES=["True, True", ...]	103	0.33951457	-0.35348142	+0.00259283	0.00000000000000000000
CHOICES:10	                317	0.45494130	-0.24090021	+0.00543835	0.00000000000000000000
TAGS=["formal_logic"]	        126	0.51026687	-0.18176729	+0.00163101	0.00000000000034218325
TARGET:only	                245	0.54455000	-0.14844314	+0.00258998	0.00000000000003856501
TAGS=["high_school_mathematics"]270	0.38323714	-0.31318800	+0.00602199	0.00000000000000000000
CHOICES:admissible	        111	0.40249648	-0.29020067	+0.00229399	0.00000000000000000000
QUESTION:Z	                48	0.35212494	-0.33943852	+0.00116031	0.00000000000110015792

As you can see, the hardest examples were the ones with long questions (with at least 117 characters), there were 7027 samples like this, their average score (0.6077) was worse than the average score for other questions (shorter than 117 characters), the difference is 0.1655, which is quite high. The last column indicates that it is virtually impossible to get it by chance. If we somehow fixed the issue with long questions (not saying it would be easy!), we would improve the score by 8 points, which would be huge for MMLU. The second most "troublesome" feature was having "1" in the QUESTION field, i.e. questions about numbers or containing an enumeration. The difference is even bigger (0.2568) than for long questions, but there are fewer of them, so the potential gain is big (3 points), but smaller than for longer questions.

Then follow the features basically representing the hardest domains: mathematics, law, physics (for mathematics and physics even when subtracting the bad effect of long questions and questions containing 1!). Law looks particularly interesting, making the model on par with other domains would improve the score by 2 points.

And there are more interesting observations: questions with numbers in possible answers (CHOICES), with the expected answer being Wrong, Not wrong or containing the word "only" (TARGET) or with letter "Z" pose a special difficulty to our model.

What's the use?

Basically, with TruLens Hotspots, you can easily find features of input, expected output, actual output and metadata that make your evaluation scores go down. You can then diagnose and attack the core issue: it might be a simple problem with pre/post-processing, or maybe you need more or better annotated training (or evaluation) data sets, or maybe you even need to change your model in a significant manner.

What is cool about TruLens Hotspots is that you don't have to pre-define categories of samples to consider ("I will list the average score for domains to check whether any of them got significantly worse score"), TruLens Hotspots will just look at all features and will list the most problematic things for you. In this way, you can find biases of the model you might not even be aware of (like your model being much worse for people from San Mateo than for people from San Jose).

Frequently Asked Questions

Q: So this is just yet another method for explainability, something like Shapley values?
A: No, this is not about how much each feature contributes to the generated output. TruLens Hotspots is about finding "troublemakers", features that make your evaluation scores worse in a systematic manner.

Q: Why are you using p-values? Weren't they discredited? Shouldn't you use the Bonferroni correction at least? A: We're using p-values just for sorting hotspots, there is no claim of statistical significance, we don't use any artificial significance level.

History

TruLens Hotspots is based on the ideas implemented in GEval. An important improvement is that TruLens Hotspots discards features highly correlating with already listed features.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.7.2

Apr 9, 2026

2.7.1

Mar 10, 2026

2.7.0

Feb 19, 2026

2.6.0

Feb 3, 2026

2.5.3

Jan 15, 2026

2.5.2

Dec 11, 2025

2.5.1

Nov 21, 2025

2.5.0

Nov 14, 2025

2.4.2

Oct 21, 2025

2.4.1

Oct 8, 2025

2.4.0

Sep 23, 2025

2.3.1

Sep 4, 2025

2.3.0

Aug 28, 2025

2.2.4

Aug 20, 2025

2.2.3

Aug 19, 2025

This version

2.2.2

Aug 18, 2025

2.2.1

Aug 13, 2025

2.2.0

Aug 11, 2025

2.1.4

Aug 6, 2025

2.1.2

Jul 31, 2025

2.1.2a0 pre-release

Jul 29, 2025

2.1.1

Jul 25, 2025

2.1.0

Jul 22, 2025

2.0.2a0 pre-release

Jul 17, 2025

1.5.3

Jul 1, 2025

1.5.2

Jun 18, 2025

1.5.1

Jun 5, 2025

1.5.0

Jun 2, 2025

1.4.9

Apr 10, 2025

1.4.8

Apr 3, 2025

1.4.7

Mar 20, 2025

1.4.6

Mar 13, 2025

1.4.5

Mar 6, 2025

1.4.4

Feb 27, 2025

1.4.3

Feb 26, 2025

1.4.2

Feb 20, 2025

1.4.1

Feb 20, 2025

1.4.1a1 pre-release

Feb 18, 2025

1.4.1a0 pre-release

Feb 18, 2025

1.4.0

Feb 13, 2025

1.4.0a0 pre-release

Feb 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trulens_hotspots-2.2.2.tar.gz (16.2 kB view details)

Uploaded Aug 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

trulens_hotspots-2.2.2-py3-none-any.whl (14.2 kB view details)

Uploaded Aug 18, 2025 Python 3

File details

Details for the file trulens_hotspots-2.2.2.tar.gz.

File metadata

Download URL: trulens_hotspots-2.2.2.tar.gz
Upload date: Aug 18, 2025
Size: 16.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.11

File hashes

Hashes for trulens_hotspots-2.2.2.tar.gz
Algorithm	Hash digest
SHA256	`f8231ce40e3600939fbe91afc7546f57d50a705e013f95a427aaebe224959d9e`
MD5	`333dcc31d7b5d72b3eca604e13f558f2`
BLAKE2b-256	`ada526b4b976969ddfaff7217a6db80ef590df96c344e18a98930a427db0bce3`

See more details on using hashes here.

File details

Details for the file trulens_hotspots-2.2.2-py3-none-any.whl.

File metadata

Download URL: trulens_hotspots-2.2.2-py3-none-any.whl
Upload date: Aug 18, 2025
Size: 14.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.11

File hashes

Hashes for trulens_hotspots-2.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba7c1bda7323019d666b22783f0edde49c43141a5b27aa92c65af873c7df88c9`
MD5	`f057fcace76b9885b453c1876ebe389e`
BLAKE2b-256	`d3d2bf1e474a6e60c55c67993b5482f57663f065302efc2d13af932e72faeedc`

See more details on using hashes here.

trulens-hotspots 2.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TruLens Hotspots

How to run

As a command-line tool

Skipping columns

As a stand-alone library

As a part of the whole TruLens framework

As a part of a stand-alone Streamlit app

TruLens Hotspots output

Example

What's the use?

Frequently Asked Questions

History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes