Skip to main content

Pytest plugin for running non-deterministic LLM tests with automatic retry and beautiful reports

Project description

llm-flaky

PyPI version CI Python versions License: MIT

Pytest plugin for running non-deterministic LLM tests.

llm-flaky report

LLM tests are inherently non-deterministic due to the probabilistic nature of language models. This plugin handles flakiness by automatically retrying tests and requiring an 80% pass rate (4/5 by default).

Features

  • Auto-marking: Automatically applies @pytest.mark.flaky to tests with @pytest.mark.llm
  • 80% accuracy default: Tests pass if 4 out of 5 runs succeed (configurable)
  • Beautiful reports: Replaces standard flaky output with a formatted table
  • Environment variable support: Use FLAKY_MAX_RUNS to control retries
  • pytest-xdist compatible: Works correctly with parallel test execution

Installation

pip install llm-flaky

Usage

Mark your LLM tests with @pytest.mark.llm:

import pytest

@pytest.mark.llm
async def test_llm_response():
    response = await call_llm("What is 2+2?")
    assert "4" in response

The plugin automatically applies flaky retry logic. No additional code needed!

Example output

══════════════════════════════════════════════════════════════════════════════
 LLM TESTS SUMMARY
══════════════════════════════════════════════════════════════════════════════

 Test                                                     Passed       Result
 ────────────────────────────────────────────────────────────────────────────
 test_llm_response_quality                                 4 / 4     ✓ PASSED
 test_llm_context_handling[short]                          4 / 4     ✓ PASSED
 test_llm_context_handling[long]                           3 / 4     ✓ PASSED

 ✗ FAILED TESTS:
 ────────────────────────────────────────────────────────────────────────────
 test_llm_edge_case                                        2 / 4     ✗ FAILED
 ────────────────────────────────────────────────────────────────────────────
 ⚠ Total                                                   3 / 4       75.0%
══════════════════════════════════════════════════════════════════════════════

Configuration

Environment variables

FLAKY_MAX_RUNS=3 pytest  # Run each test up to 3 times (min_passes=2)

Command line options

pytest --llm-flaky-max-runs=5           # Max runs for LLM tests (default: 5)
pytest --llm-flaky-min-passes=4         # Min passes required (default: max_runs - 1)
pytest --llm-flaky-exclude-marker=skip  # Marker to exclude from flaky
pytest --llm-flaky-title="My Title"     # Custom report title
pytest --no-llm-flaky-report            # Disable beautiful report

pytest.ini options

[pytest]
llm_flaky_max_runs = 5
llm_flaky_min_passes = 4
llm_flaky_exclude_marker = langsmith_dataset
llm_flaky_title = LLM TESTS SUMMARY

Priority

Configuration is read in this order (highest priority first):

  1. FLAKY_MAX_RUNS environment variable
  2. Command line options (--llm-flaky-*)
  3. pytest.ini options (llm_flaky_*)
  4. Defaults (max_runs=5, min_passes=4)

How it works

  1. Collection phase: Plugin finds all tests with @pytest.mark.llm
  2. Auto-marking: Applies @pytest.mark.flaky(max_runs=5, min_passes=4)
  3. Execution: pytest-flaky handles retry logic
  4. Reporting: Beautiful summary table replaces standard output

Excluding tests

Tests with @langsmith_dataset marker are excluded by default (they use LangSmith's built-in evaluation):

@pytest.mark.llm
@langsmith_dataset("my_dataset.yaml")
async def test_with_langsmith():
    # This test won't get flaky retry - LangSmith handles evaluation
    pass

Requirements

  • Python >= 3.9
  • pytest >= 7.0.0
  • flaky >= 3.7.0

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_flaky-0.1.0.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_flaky-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file llm_flaky-0.1.0.tar.gz.

File metadata

  • Download URL: llm_flaky-0.1.0.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for llm_flaky-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9269a4a269fd83976a645ad9a6c350979473313faace583a9c6edac15f7233eb
MD5 6c50e47ce23e8dfbcdae809b8f6c5485
BLAKE2b-256 7e9d18a938805e90fa3de06b5a03108cfd919cdff4a80c0f01b6bf36d927e824

See more details on using hashes here.

File details

Details for the file llm_flaky-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_flaky-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for llm_flaky-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 70e612d1405a44af48f6d4c8d9b9b3dc2f2ecb88c0cd4d9bfa0ca13b6d9b468f
MD5 cc6f98c644a72190f34874aaaa30da73
BLAKE2b-256 aa21e84654cbd9aa5656ee7d030c4ffb40b34b5c642a588c6b23045951bb6c5d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page