Skip to main content

Detect flaky LLM eval cases across repeated runs. Pass-rate + standard-deviation per case, with per-case severity. Python port of @mukundakatta/eval-flake-detector.

Project description

eval-flake-detector

PyPI Python License: MIT

Detect flaky LLM eval cases across repeated runs. Computes per-case pass rate, standard deviation, and a flakiness severity ("stable" / "low" / "medium" / "high"). Pure stdlib, zero runtime dependencies.

Python port of @mukundakatta/eval-flake-detector. The JS-compatible shape is exposed as detect_eval_flakes so existing dashboards keep working.

Install

pip install eval-flake-detector

Usage

from eval_flake_detector import detect

# One inner list per case; entries are repeat results.
runs = [
    [True, True, True, True, True],            # case 0: stable pass
    [False, False, False, False, False],       # case 1: stable fail
    [True, False, True, False, True, False],   # case 2: flaky 50/50
]

report = detect(runs, flaky_low=0.1, flaky_high=0.9, min_runs=3)

report.stable          # False -- case 2 is flaky
for case in report.flakes:
    print(case.case_id, case.pass_rate, case.std_dev, case.severity)
# case-2  0.5  0.5  high

Already have a flat list of result dicts (one row per repeat)? Pass that directly -- it's auto-grouped by id:

runs = [
    {"id": "math-1",   "passed": True},
    {"id": "math-1",   "passed": False},
    {"id": "math-1",   "passed": True},
    {"id": "lookup-2", "score":  0.95},
    {"id": "lookup-2", "score":  1.00},
]
detect(runs)

Severity bands

Band Meaning
stable Pass rate is outside the ambiguous band -- consistently passes or consistently fails.
low In band, std_dev < 0.30. Probably noise; revisit prompt.
medium In band, 0.30 <= std_dev < 0.45. Real flakiness; investigate.
high In band, std_dev >= 0.45. Behaves randomly; consider redesigning the eval.

JS-compat output

from eval_flake_detector import detect_eval_flakes

# Same shape as the JS sibling: {"stable": bool, "flakes": [{"id", "count", "stddev"}]}
detect_eval_flakes(runs, max_stddev=0.2)

See the JS sibling's README for the original design notes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eval_flake_detector-0.1.0.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

eval_flake_detector-0.1.0-py3-none-any.whl (7.2 kB view details)

Uploaded Python 3

File details

Details for the file eval_flake_detector-0.1.0.tar.gz.

File metadata

  • Download URL: eval_flake_detector-0.1.0.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for eval_flake_detector-0.1.0.tar.gz
Algorithm Hash digest
SHA256 be89c2e4e8e2ad7fb13983350ce88676c9c3fc6c0fa70605332a80f172d23aea
MD5 c6fc658d5d0f274c4e3ce9296240173d
BLAKE2b-256 5f1c2d5dee57b4f6b5880fb4764f77932ced94d6a8ad2ee9c5c7f0af2c2a4ad3

See more details on using hashes here.

File details

Details for the file eval_flake_detector-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for eval_flake_detector-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41868159e4cfb2bcba474726b7b24f958e656d38be9330d00e64245c779b844a
MD5 58953b7dd86b047039682c4f2471647f
BLAKE2b-256 355e036450eaaac30603382a7703c0f5a584bf951bf689c84ebaf0f062d7810b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page