Skip to main content

A pytest plugin to evaluate/benchmark LLM prompts

Project description

pytest-llmeval

PyPI version Python versions See Build Status on GitHub Actions

A pytest plugin to evaluate/benchmark LLM prompts

Features

  • Simple interface: Just mark which tests are LLM evals and store the results
  • Evaluation metrics: Get comprehensive classification metrics including precision, recall, and F1 scores
  • Grouped evaluations: Compare how different prompts or models perform acorss your test cases
  • File export: Save evaluation reports to file for monitoring performance changes over time
  • Custom analysis function: Write your own analysis function if you prefer
  • Pytest integration: Evaluations fit right in with your project's other tests

Usage

See full usage examples in examples/.

The main interface for this plugin is the @pytest.mark.llmeval() decorator, which injects an llmeval_result parameter into your test function.

Basic Usage

You can run the same code cross multiple test cases by using pytest's parametrize functionality.

TEST_CASES = [
    {"input": "I need to debug this Python code", "expected": True},
    {"input": "The cat jumped over the lazy dog", "expected": False},
    {"input": "My monitor keeps flickering", "expected": True},
]

@pytest.mark.llmeval()
@pytest.mark.parametrize("test_case", TEST_CASES)
def test_computer_related(llmeval_result, test_case):

    # Run your llm code that returns a result for this test case
    result = llm_is_computer_related(test_case["input"])

    # Store the details on `llmeval_result`
    llmeval_result.set_result(
        input_data=test_case["input"],
        expected=test_case["expected"],
        actual=result,
    )

    # `assert` whether the actual result was the expected result
    assert llmeval_result.is_correct()

Run test like normal (with uv run pytest or similar) When the tests complete, a classification report will be printed to stdout, in a format like:

# LLM Eval: test_computer_related

## Group: overall
              precision    recall  f1-score   support

        True       0.00      0.00      0.00         1
       False       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3

Comparing across variables like different prompts or models

You can run compare different prompts or other variables by specifying llmeval.set_result()'s group= parameter:

PROMPT_TEMPLATES = [
    f"Is this computer related? Say True or False",
    f"Say True or False: Is this computer related?",
]

TEST_CASES = [
    {"input": "I need to debug this Python code", "expected": True},
    {"input": "The cat jumped over the lazy dog", "expected": False},
    {"input": "My monitor keeps flickering", "expected": True},
]

@pytest.mark.llmeval()
@pytest.mark.parametrize("prompt_template", PROMPT_TEMPLATES)
@pytest.mark.parametrize("test_case", TEST_CASES)
def test_prompts(llmeval_result, prompt_template, test_case):
    result = llm_is_computer_related(test_case["input"])

    llmeval_result.set_result(
        input_data=test_case["input"],
        expected=test_case["expected"],
        actual=result,
        group=prompt_template,
    )
    assert llmeval_result.is_correct()
# LLM Eval: test_prompts

## Group: Is this computer related? Say True or False
              precision    recall  f1-score   support

       False       0.00      0.00      0.00         1
        True       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3


## Group: Say True or False: Is this computer related?
              precision    recall  f1-score   support

       False       0.33      1.00      0.50         1
        True       0.00      0.00      0.00         2

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3

Saving reports

You can save evaluation results to a file by providing the @pytest.mark.llmeval() the file_path parameter:

@pytest.mark.llmeval(file_path="results/test_prompts.txt")
def test_prompts(llmeval_result, prompt_template, test_case):
    # Your test code here
    pass

The test report would be saved to "results/test_prompts.txt".

Custom analysis functions

If you prefer to do a different analysis across the results, pass a function with the analysis_func parameter:

def my_analysis(test_id, results):
    print(f"My custom analysis function processed {len(results)} results")

@pytest.mark.llmeval(analysis_func=my_analyis)
def test_prompts(llmeval_result, prompt_template, test_case):
    # Your test code here
    pass

API

@pytest.mark.llmeval()

Marks a test function for evaluation. The test function will be passed the parameter llmeval_result.

Parameters:

  • file_path (str, optional): Path where the evaluation report will be saved. If not provided, the report will only be displayed in the test output.

  • analysis_func(test_id: str, results: ClassificationResult[]) -> str[] (function, optional): A custom analysis function to run across all results. Do whatever calculations you want in here. Optionally return a list of strings to be printed to stdout.

Injected parameters:

  • llmeval_result: An object to track test evaluation results with the following methods:

    • set_result(expected: str, actual: str, input_data: str | dict, group?: str): Record the details of this test result
    • is_correct() -> bool: Returns whether the expected result equals the actual result

ClassificationResult

  • expected: The expected result
  • actual: The actual result
  • input: Input data used for this test case
  • group (optional): An optional variable to group by before running analyses. E.g. pass a prompt to group results by prompt

Installation

You can install "pytest-llmeval" via pipx:

pipx install pytest-llmeval

Contributing

Contributions are very welcome. Tests can be run with uv run pytest, please ensure the coverage at least stays the same before you submit a pull request.

This pytest plugin was generated with Cookiecutter along with @hackebrot's cookiecutter-pytest-plugin template.

License

Distributed under the terms of the MIT license, "pytest-llmeval" is free and open source software

Issues

If you encounter any problems, please file an issue along with a detailed description.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_llmeval-0.0.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_llmeval-0.0.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file pytest_llmeval-0.0.0.tar.gz.

File metadata

  • Download URL: pytest_llmeval-0.0.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pytest_llmeval-0.0.0.tar.gz
Algorithm Hash digest
SHA256 32e4091fd0821f51484e1751cf1de0d435f7796d989cc0daa3cb5874cf9f0073
MD5 6de845e72b7e4992674209d9d7ff02b0
BLAKE2b-256 3b2bcc25b00733279f74a7cb85482e8320a7af4e533196ff88a4e0f92d6c2f70

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llmeval-0.0.0.tar.gz:

Publisher: release.yml on kevinschaul/pytest-llmeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytest_llmeval-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: pytest_llmeval-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pytest_llmeval-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 812ab0de969b287c2a5ff68a1e74fe3d1ec117da544c089190a7a46f2f083cf8
MD5 4a3a6533edeca98edc34fde731b9fe01
BLAKE2b-256 f14c68552f24943437251be004c43bfe2888c7427f1e8869d229007860932fdf

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llmeval-0.0.0-py3-none-any.whl:

Publisher: release.yml on kevinschaul/pytest-llmeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page