Automatically discover where and why your LLM is failing

These details have not been verified by PyPI

Project description

faultmap

Automatically discover where and why your LLM is failing.

The Problem

Your eval says 85% accuracy. Users are complaining. Where are the failures?

Aggregate metrics hide critical patterns. A model can score well overall while catastrophically failing on specific input types — legal questions, billing disputes, technical setup — that matter most to your users.

faultmap is the diagnostic layer that answers three questions:

"Where exactly is my model failing?" — Automated discovery of input slices where failure rate is statistically significantly higher than baseline.
"Can I trust my test suite?" — Embedding-space coverage analysis that finds semantic blind spots your test suite has never touched.
"Which model is better, and where?" — Paired statistical comparison of two models on the same prompt set, with per-slice winners and effect sizes.

See It In Action

faultmap 3D embedding visualization

▶ Interactive visualization demo →

Each red cluster is a Failure Slice — a group of semantically similar prompts where your LLM fails disproportionately often. UMAP reduces 1 536-D prompt embeddings to 3D so you can see exactly where reliability breaks down, then faultmap runs statistical hypothesis testing to tell you which clusters are significant and by how much.

How It Works

Prompts → Embed → Cluster → Statistical Test → BH Correction → Name → Report

faultmap embeds your prompts into a semantic space, clusters them to find coherent input groups, runs hypothesis tests to identify clusters with elevated failure rates, applies Benjamini-Hochberg correction to control false discovery rate, and uses an LLM to generate human-readable names for each failure slice.

Failure Slice Detection

Embed — your prompts are embedded into a high-dimensional vector space using a local or API-based embedding model
Cluster — HDBSCAN (default) or agglomerative clustering groups semantically similar prompts
Test — chi-squared or Fisher exact test compares each cluster's failure rate against the baseline
Correct — Benjamini-Hochberg FDR correction filters out false positives across all clusters
Name — an LLM automatically generates a human-readable name and description for each significant cluster

Coverage Auditing

Embed — both your test prompts and production prompts are embedded
Distance — nearest-neighbor search finds production prompts that have no nearby test prompt
Cluster — uncovered production prompts are grouped into semantically coherent gap clusters
Name — the LLM names each gap so you know exactly what your test suite is missing

Installation

pip install faultmap                # Core (uses API embeddings via litellm)
pip install faultmap[local]         # + sentence-transformers for local embeddings
pip install faultmap[rich]          # + rich for pretty terminal output
pip install faultmap[all]           # Everything

Notebooks

Notebook	Description
`notebooks/tutorial.ipynb`	All four usage modes with a mock path (no API key needed)
`notebooks/visualization_demo.ipynb`	Interactive 3D UMAP scatter of failure slices — no API key needed

Quick Start

Mode 1: Bring Your Own Scores

Use scores from DeepEval, Ragas, human reviewers, or any evaluation framework.

from faultmap import SliceAnalyzer

analyzer = SliceAnalyzer(model="gpt-4o-mini")

report = analyzer.analyze(
    prompts=prompts,
    responses=responses,
    scores=scores,  # list of floats in [0, 1], lower = worse
)
print(report)

Mode 2: Reference-Based Scoring

Provide ground-truth answers. faultmap scores responses via cosine similarity between response and reference embeddings.

report = analyzer.analyze(
    prompts=prompts,
    responses=responses,
    references=ground_truth_answers,
)

Mode 3: Reference-Free (Fully Autonomous)

No ground truth needed. faultmap estimates reliability via semantic entropy and self-consistency by sampling multiple responses from the LLM and measuring how consistent they are.

report = analyzer.analyze(
    prompts=prompts,
    responses=responses,
    # No scores, no references — faultmap scores autonomously
)

Coverage Auditing

Find what your test suite is missing by comparing it against production traffic.

coverage = analyzer.audit_coverage(
    test_prompts=my_test_set,
    production_prompts=my_production_logs,
)
print(coverage)

Model Comparison

Compare two models on the same prompt set to find where each model wins.

comparison = analyzer.compare_models(
    prompts=prompts,
    responses_a=gpt4_responses,
    responses_b=gpt4mini_responses,
    scores_a=gpt4_scores,        # Mode 1: pre-computed scores
    scores_b=gpt4mini_scores,
    model_a_name="GPT-4o",
    model_b_name="GPT-4o-mini",
)
print(comparison)

Reading Results

Analysis Report

# Top-level stats
print(report.total_prompts)          # 500
print(report.total_failures)         # 87
print(report.baseline_failure_rate)  # 0.174
print(report.num_significant)        # 2

# Inspect each failure slice
for s in report.slices:
    print(s.name)             # "Legal compliance questions"
    print(s.description)      # "Prompts asking about regulatory requirements..."
    print(s.size)             # 45
    print(s.failure_rate)     # 0.72
    print(s.baseline_rate)    # 0.17
    print(s.effect_size)      # 4.2   (how many times worse than baseline)
    print(s.adjusted_p_value) # 0.0003
    print(s.test_used)        # "chi2"

    # Actionable insights (why it fails + how to fix it)
    print(s.root_cause)             # "Model lacks context for regulatory terminology..."
    print(s.suggested_remediation)  # "Add: 'You are a compliance expert...'"

    # Recover the original data
    print(s.sample_indices)   # [12, 45, 67, ...]  — indices into your prompts list
    print(s.examples)         # [{"prompt": "...", "response": "...", "score": 0.1}, ...]
    print(s.representative_prompts)  # ["How do I comply with GDPR?", ...]

# Export to dict/JSON
import json
json.dumps(report.to_dict())

Coverage Report

print(coverage.overall_coverage_score)  # 0.82 (82% of prod prompts are covered)
print(coverage.num_gaps)                # 3
print(coverage.metadata["num_uncovered_total"])  # uncovered prompts, clustered or not
print(coverage.metadata["unclustered_prompt_indices"])  # uncovered prompts below gap threshold

for gap in coverage.gaps:
    print(gap.name)                    # "Two-factor authentication setup"
    print(gap.description)             # "Users asking about 2FA configuration..."
    print(gap.size)                    # 28 (production prompts with no test coverage)
    print(gap.mean_distance)           # 1.73 (avg distance to nearest test prompt)
    print(gap.representative_prompts)  # ["How do I enable 2FA?", ...]
    print(gap.prompt_indices)          # indices into your production_prompts list

API Reference

`SliceAnalyzer`

analyzer = SliceAnalyzer(
    model="gpt-4o-mini",           # litellm model string for LLM calls (naming + Mode 3)
    embedding_model="text-embedding-3-small",  # default API embedding model
    significance_level=0.05,       # alpha for Benjamini-Hochberg FDR correction
    min_slice_size=10,             # minimum cluster size (smaller clusters ignored)
    failure_threshold=0.5,         # score below this is a failure (for binary classification)
    n_samples=8,                   # Mode 3 only: responses sampled per prompt
    clustering_method="hdbscan",   # "hdbscan" (auto k) or "agglomerative" (grid search)
    max_concurrent_requests=50,    # max parallel LLM API calls
    temperature=1.0,               # Mode 3 only: sampling temperature
    consistency_threshold=0.8,     # Mode 3 only: cosine sim threshold for self-consistency
)

Parameter	Default	Description
`model`	`"gpt-4o-mini"`	litellm model string. Supports 100+ providers: `"anthropic/claude-3-haiku"`, `"ollama/mistral"`, etc.
`embedding_model`	`"text-embedding-3-small"`	API model string by default, so `pip install faultmap` works without extras. Local sentence-transformers models require `pip install faultmap[local]`.
`significance_level`	`0.05`	FDR alpha. Slices with `adjusted_p_value < significance_level` are reported.
`min_slice_size`	`10`	Minimum prompts per cluster. Smaller clusters are discarded before testing.
`failure_threshold`	`0.5`	Score cutoff. A prompt with `score < failure_threshold` is counted as a failure.
`n_samples`	`8`	Mode 3 only: number of additional responses sampled per prompt. Higher = more accurate entropy estimate.
`clustering_method`	`"hdbscan"`	`"hdbscan"` auto-discovers the number of clusters. `"agglomerative"` performs grid search over `[5, 10, 15, 20, 25, 30]` clusters.
`max_concurrent_requests`	`50`	Semaphore-limited concurrency for LLM calls. Reduce if hitting rate limits.
`temperature`	`1.0`	Mode 3 only: higher temperature increases response diversity for entropy estimation.
`consistency_threshold`	`0.8`	Mode 3 only: cosine similarity cutoff for self-consistency scoring.

`analyze()`

report: AnalysisReport = analyzer.analyze(
    prompts: list[str],
    responses: list[str],
    scores: list[float] | None = None,      # Mode 1: pre-computed scores in [0, 1]
    references: list[str] | None = None,    # Mode 2: ground-truth answers
    # Neither → Mode 3 (entropy, autonomous)
    # Both → Mode 1 wins, Mode 2 is ignored (with warning)
)

`audit_coverage()`

report: CoverageReport = analyzer.audit_coverage(
    test_prompts: list[str],
    production_prompts: list[str],
    distance_threshold: float | None = None,  # auto-computed if None
    min_gap_size: int = 5,                    # minimum prompts to form a gap
)

`compare_models()`

report: ComparisonReport = analyzer.compare_models(
    prompts: list[str],
    responses_a: list[str],
    responses_b: list[str],
    scores_a: list[float] | None = None,     # Mode 1: pre-computed scores for A
    scores_b: list[float] | None = None,     # Mode 1: pre-computed scores for B
    references: list[str] | None = None,     # Mode 2: shared ground-truth answers
    # Neither scores nor references → Mode 3 (entropy scoring for each model)
    model_a_name: str = "Model A",
    model_b_name: str = "Model B",
)

Same mode detection as analyze():

scores_a + scores_b provided → Mode 1 (must both be present or neither)
references provided (no scores) → Mode 2
Neither → Mode 3 (2× API calls — each model scored independently)

Reading `ComparisonReport`

# Global headline
print(comparison.global_winner)         # "a", "b", or "tie"
print(comparison.global_advantage_rate) # 0.78 → "78% of disagreements favor A"
print(comparison.global_p_value)        # 0.0001

# Per-slice breakdown
for s in comparison.slices:
    print(s.name)               # "Legal Compliance"
    print(s.winner)             # "a" — GPT-4o wins this slice
    print(s.failure_rate_a)     # 0.044
    print(s.failure_rate_b)     # 0.422
    print(s.advantage_rate)     # 0.947 — 94.7% of disagreements favor A here
    print(s.discordant_a_wins)  # 18 — A passes, B fails
    print(s.discordant_b_wins)  # 1  — A fails, B passes
    print(s.adjusted_p_value)   # 0.0002

# Export
import json
json.dumps(comparison.to_dict())

Scoring Modes In Depth

Mode 1 — Precomputed Scores

Pass any float list where higher = better quality. The failure_threshold parameter (default 0.5) decides the binary split.

Works with scores from any eval framework:

DeepEval: metric.score
Ragas: result['answer_relevancy']
Human annotation: 1.0 (good) / 0.0 (bad)
LLM-as-judge: normalized score in [0, 1]

Mode 2 — Reference-Based

Provide references (ground-truth answers). faultmap:

Embeds both responses and references using the configured embedding model
Computes cosine similarity between each response and its reference
Uses similarity as the quality score

Best for: RAG evaluation, Q&A benchmarks, translation quality.

Mode 3 — Entropy / Autonomous

No external labels needed. faultmap:

Samples n_samples additional responses per prompt at temperature=1.0
Embeds all samples + original response
Measures semantic entropy: how spread out are the responses in embedding space? (high entropy = inconsistent = uncertain)
Measures self-consistency: what fraction of samples are cosine-similar to the original?
Combines: score = 0.5 * (1 - normalized_entropy) + 0.5 * self_consistency

High entropy clusters reveal where the model is uncertain or hallucinating. Best for: discovering unknown unknowns when you have no ground truth.

Embedding Models

Local (no API calls, requires `pip install faultmap[local]`)

# Sentence-transformers models — auto-detected by prefix
SliceAnalyzer(embedding_model="all-MiniLM-L6-v2")    # fast, 384-dim
SliceAnalyzer(embedding_model="all-mpnet-base-v2")   # accurate, 768-dim
SliceAnalyzer(embedding_model="paraphrase-multilingual-MiniLM-L12-v2")  # multilingual

API (via litellm)

SliceAnalyzer()                                          # defaults to text-embedding-3-small
SliceAnalyzer(embedding_model="text-embedding-3-small")   # OpenAI
SliceAnalyzer(embedding_model="text-embedding-3-large")   # OpenAI
SliceAnalyzer(embedding_model="voyage/voyage-2")          # Voyage AI

For asymmetric embedding APIs, you can pass role-specific request options:

SliceAnalyzer(
    embedding_model="nvidia_nim/nvidia/nv-embedqa-e5-v5",
    embedding_usage_kwargs={"query": {"input_type": "query"}},
)

API embeddings also truncate long texts to 2000 characters by default to avoid strict provider token limits. Set embedding_max_text_chars=None to disable it.

LLM Providers

faultmap uses litellm internally, so any provider litellm supports works:

SliceAnalyzer(model="gpt-4o-mini")                      # OpenAI
SliceAnalyzer(model="anthropic/claude-3-haiku-20240307") # Anthropic
SliceAnalyzer(model="ollama/mistral")                    # Local via Ollama
SliceAnalyzer(model="groq/llama3-8b-8192")              # Groq
SliceAnalyzer(model="azure/gpt-4o-mini")                # Azure OpenAI

Statistical Details

faultmap uses one-sided hypothesis testing — it only flags clusters that fail more than the baseline, never less.

Test selection (per cluster):

If expected cell count < 5: Fisher exact test (exact, handles small samples)
Otherwise: chi-squared with Yates continuity correction

Multiple comparison correction: Benjamini-Hochberg FDR at the configured significance_level. This controls the false discovery rate — the fraction of reported slices that are false positives — which is more appropriate than Bonferroni for exploration.

No external statistics libraries: chi-squared p-values are computed via math.erfc(sqrt(chi2/2)) (exact for df=1), Fisher exact via math.lgamma to avoid overflow. Zero scipy/statsmodels dependency.

Model Comparison Statistics

compare_models() uses McNemar's test — the standard test for paired binary data.

Each prompt produces two binary outcomes (A pass/fail, B pass/fail). Only discordant pairs carry information:

b = A passes, B fails ("A wins")
c = A fails, B passes ("B wins")

Test selection:

b + c >= 25 → chi-squared with continuity correction: (|b-c|-1)² / (b+c)
b + c < 25 → exact binomial two-sided test via math.lgamma
b + c == 0 → p=1.0, winner="tie" (models agree everywhere)

Effect size: advantage_rate = b / (b+c) — the proportion of disagreements where A wins. Intuitive and actionable: "78% of the time the models disagree, GPT-4o gives the better answer."

Winner determination: adjusted_p < significance_level AND advantage_rate > 0.5 → A wins; advantage_rate < 0.5 → B wins; otherwise tie.

Multiple comparisons corrected with Benjamini-Hochberg FDR across all per-slice tests.

Logging

import logging
logging.getLogger("faultmap").setLevel(logging.DEBUG)

faultmap logs progress at each pipeline step: scoring mode, failure counts, cluster sizes, significance results.

Contributing

git clone https://github.com/gabonavarroo/faultmap.git
cd faultmap
pip install -e ".[dev]"
pytest tests/ -v
ruff check .

All tests use mocks — no API keys needed to run the test suite.

Changelog

See CHANGELOG.md for release history.

Examples

See the examples/ directory:

example_mode1_custom_scores.py — Mode 1 with pre-computed scores
example_mode2_reference_based.py — Mode 2 with ground-truth references
example_mode3_reference_free.py — Mode 3 autonomous entropy scoring
example_coverage_audit.py — Coverage gap detection
example_model_comparison.py — Model comparison with per-slice winners

License

Apache 2.0

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.5.0

Apr 5, 2026

0.4.0

Apr 4, 2026

0.3.0

Apr 3, 2026

0.2.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

faultmap-0.5.0.tar.gz (1.9 MB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

faultmap-0.5.0-py3-none-any.whl (53.1 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file faultmap-0.5.0.tar.gz.

File metadata

Download URL: faultmap-0.5.0.tar.gz
Upload date: Apr 5, 2026
Size: 1.9 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for faultmap-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`2bb18a2a6ff4e0e1aa593ac244a9ce5d0a0f0abe6ccba646c1d76365ea3ea588`
MD5	`16ca524bddd3028831610daaa58cb9a0`
BLAKE2b-256	`41867cd25e7d83cbea6831690e6b26d0cffe9ac7dcfc84d07762b5aba3012d71`

See more details on using hashes here.

Provenance

The following attestation bundles were made for faultmap-0.5.0.tar.gz:

Publisher: publish.yml on gabonavarroo/faultmap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: faultmap-0.5.0.tar.gz
- Subject digest: 2bb18a2a6ff4e0e1aa593ac244a9ce5d0a0f0abe6ccba646c1d76365ea3ea588
- Sigstore transparency entry: 1237865673
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: gabonavarroo/faultmap@591dcc520ae6b31f5aa88d60df3be4b04f96157b
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/gabonavarroo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@591dcc520ae6b31f5aa88d60df3be4b04f96157b
- Trigger Event: push

File details

Details for the file faultmap-0.5.0-py3-none-any.whl.

File metadata

Download URL: faultmap-0.5.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 53.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for faultmap-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6452b1bc1e7a2dcbb632d6cf5ef1ad56bfbc7298532b5e221248e39f6b99d489`
MD5	`1a9c712f4c778eae89dd72fd605b2b25`
BLAKE2b-256	`ad406eeefc92dc33e4d5a4f6d9866f713a8fb8a1e702bd1ec38db408a7e1e7ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for faultmap-0.5.0-py3-none-any.whl:

Publisher: publish.yml on gabonavarroo/faultmap

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: faultmap-0.5.0-py3-none-any.whl
- Subject digest: 6452b1bc1e7a2dcbb632d6cf5ef1ad56bfbc7298532b5e221248e39f6b99d489
- Sigstore transparency entry: 1237865687
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: gabonavarroo/faultmap@591dcc520ae6b31f5aa88d60df3be4b04f96157b
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/gabonavarroo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@591dcc520ae6b31f5aa88d60df3be4b04f96157b
- Trigger Event: push

faultmap 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

faultmap

The Problem

See It In Action

How It Works

Failure Slice Detection

Coverage Auditing

Installation

Notebooks

Quick Start

Mode 1: Bring Your Own Scores

Mode 2: Reference-Based Scoring

Mode 3: Reference-Free (Fully Autonomous)

Coverage Auditing

Model Comparison

Reading Results

Analysis Report

Coverage Report

API Reference

SliceAnalyzer

analyze()

audit_coverage()

compare_models()

Reading ComparisonReport

Scoring Modes In Depth

Mode 1 — Precomputed Scores

Mode 2 — Reference-Based

Mode 3 — Entropy / Autonomous

Embedding Models

Local (no API calls, requires pip install faultmap[local])

API (via litellm)

LLM Providers

Statistical Details

Model Comparison Statistics

Logging

Contributing

Changelog

Examples

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`SliceAnalyzer`

`analyze()`

`audit_coverage()`

`compare_models()`

Reading `ComparisonReport`

Local (no API calls, requires `pip install faultmap[local]`)