Detect flaky LLM eval cases across repeated runs. Pass-rate + standard-deviation per case, with per-case severity. Python port of @mukundakatta/eval-flake-detector.
Project description
eval-flake-detector
Detect flaky LLM eval cases across repeated runs. Computes per-case pass rate, standard deviation, and a flakiness severity ("stable" / "low" / "medium" / "high"). Pure stdlib, zero runtime dependencies.
Python port of @mukundakatta/eval-flake-detector. The JS-compatible shape is exposed as detect_eval_flakes so existing dashboards keep working.
Install
pip install eval-flake-detector
Usage
from eval_flake_detector import detect
# One inner list per case; entries are repeat results.
runs = [
[True, True, True, True, True], # case 0: stable pass
[False, False, False, False, False], # case 1: stable fail
[True, False, True, False, True, False], # case 2: flaky 50/50
]
report = detect(runs, flaky_low=0.1, flaky_high=0.9, min_runs=3)
report.stable # False -- case 2 is flaky
for case in report.flakes:
print(case.case_id, case.pass_rate, case.std_dev, case.severity)
# case-2 0.5 0.5 high
Already have a flat list of result dicts (one row per repeat)? Pass that
directly -- it's auto-grouped by id:
runs = [
{"id": "math-1", "passed": True},
{"id": "math-1", "passed": False},
{"id": "math-1", "passed": True},
{"id": "lookup-2", "score": 0.95},
{"id": "lookup-2", "score": 1.00},
]
detect(runs)
Severity bands
| Band | Meaning |
|---|---|
stable |
Pass rate is outside the ambiguous band -- consistently passes or consistently fails. |
low |
In band, std_dev < 0.30. Probably noise; revisit prompt. |
medium |
In band, 0.30 <= std_dev < 0.45. Real flakiness; investigate. |
high |
In band, std_dev >= 0.45. Behaves randomly; consider redesigning the eval. |
JS-compat output
from eval_flake_detector import detect_eval_flakes
# Same shape as the JS sibling: {"stable": bool, "flakes": [{"id", "count", "stddev"}]}
detect_eval_flakes(runs, max_stddev=0.2)
See the JS sibling's README for the original design notes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file eval_flake_detector-0.1.0.tar.gz.
File metadata
- Download URL: eval_flake_detector-0.1.0.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be89c2e4e8e2ad7fb13983350ce88676c9c3fc6c0fa70605332a80f172d23aea
|
|
| MD5 |
c6fc658d5d0f274c4e3ce9296240173d
|
|
| BLAKE2b-256 |
5f1c2d5dee57b4f6b5880fb4764f77932ced94d6a8ad2ee9c5c7f0af2c2a4ad3
|
File details
Details for the file eval_flake_detector-0.1.0-py3-none-any.whl.
File metadata
- Download URL: eval_flake_detector-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41868159e4cfb2bcba474726b7b24f958e656d38be9330d00e64245c779b844a
|
|
| MD5 |
58953b7dd86b047039682c4f2471647f
|
|
| BLAKE2b-256 |
355e036450eaaac30603382a7703c0f5a584bf951bf689c84ebaf0f062d7810b
|