A pytest plugin to evaluate/benchmark LLM prompts

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

kevin.schaul

These details have not been verified by PyPI

Project description

pytest-llmeval

A pytest plugin to evaluate/benchmark LLM prompts

Features

Simple interface: Just mark which tests are LLM evals and store the results
Evaluation metrics: Get comprehensive classification metrics including precision, recall, and F1 scores
Grouped evaluations: Compare how different prompts or models perform acorss your test cases
File export: Save evaluation reports to file for monitoring performance changes over time
Custom analysis function: Write your own analysis function if you prefer
Pytest integration: Evaluations fit right in with your project's other tests

Usage

See full usage examples in examples/.

The main interface for this plugin is the @pytest.mark.llmeval() decorator, which injects an llmeval_result parameter into your test function.

Basic Usage

You can run the same code cross multiple test cases by using pytest's parametrize functionality.

TEST_CASES = [
    {"input": "I need to debug this Python code", "expected": True},
    {"input": "The cat jumped over the lazy dog", "expected": False},
    {"input": "My monitor keeps flickering", "expected": True},
]

@pytest.mark.llmeval()
@pytest.mark.parametrize("test_case", TEST_CASES)
def test_computer_related(llmeval_result, test_case):

    # Run your llm code that returns a result for this test case
    result = llm_is_computer_related(test_case["input"])

    # Store the details on `llmeval_result`
    llmeval_result.set_result(
        input_data=test_case["input"],
        expected=test_case["expected"],
        actual=result,
    )

    # `assert` whether the actual result was the expected result
    assert llmeval_result.is_correct()

Run test like normal (with uv run pytest or similar) When the tests complete, a classification report will be printed to stdout, in a format like:

# LLM Eval: test_computer_related

## Group: overall
              precision    recall  f1-score   support

        True       0.00      0.00      0.00         1
       False       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3

Comparing across variables like different prompts or models

You can run compare different prompts or other variables by specifying llmeval.set_result()'s group= parameter:

PROMPT_TEMPLATES = [
    f"Is this computer related? Say True or False",
    f"Say True or False: Is this computer related?",
]

TEST_CASES = [
    {"input": "I need to debug this Python code", "expected": True},
    {"input": "The cat jumped over the lazy dog", "expected": False},
    {"input": "My monitor keeps flickering", "expected": True},
]

@pytest.mark.llmeval()
@pytest.mark.parametrize("prompt_template", PROMPT_TEMPLATES)
@pytest.mark.parametrize("test_case", TEST_CASES)
def test_prompts(llmeval_result, prompt_template, test_case):
    result = llm_is_computer_related(test_case["input"])

    llmeval_result.set_result(
        input_data=test_case["input"],
        expected=test_case["expected"],
        actual=result,
        group=prompt_template,
    )
    assert llmeval_result.is_correct()

# LLM Eval: test_prompts

## Group: Is this computer related? Say True or False
              precision    recall  f1-score   support

       False       0.00      0.00      0.00         1
        True       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3


## Group: Say True or False: Is this computer related?
              precision    recall  f1-score   support

       False       0.33      1.00      0.50         1
        True       0.00      0.00      0.00         2

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3

Saving reports

You can save evaluation results to a file by providing the @pytest.mark.llmeval() the file_path parameter:

@pytest.mark.llmeval(file_path="results/test_prompts.txt")
def test_prompts(llmeval_result, prompt_template, test_case):
    # Your test code here
    pass

The test report would be saved to "results/test_prompts.txt".

Custom analysis functions

If you prefer to do a different analysis across the results, pass a function with the analysis_func parameter:

def my_analysis(test_id, results):
    print(f"My custom analysis function processed {len(results)} results")

@pytest.mark.llmeval(analysis_func=my_analyis)
def test_prompts(llmeval_result, prompt_template, test_case):
    # Your test code here
    pass

API

`@pytest.mark.llmeval()`

Marks a test function for evaluation. The test function will be passed the parameter llmeval_result.

Parameters:

file_path (str, optional): Path where the evaluation report will be saved. If not provided, the report will only be displayed in the test output.
analysis_func(test_id: str, results: ClassificationResult[]) -> str[] (function, optional): A custom analysis function to run across all results. Do whatever calculations you want in here. Optionally return a list of strings to be printed to stdout.

Injected parameters:

llmeval_result: An object to track test evaluation results with the following methods:
- set_result(expected: str, actual: str, input_data: str | dict, group?: str): Record the details of this test result
- is_correct() -> bool: Returns whether the expected result equals the actual result

`ClassificationResult`

expected: The expected result
actual: The actual result
input: Input data used for this test case
group (optional): An optional variable to group by before running analyses. E.g. pass a prompt to group results by prompt

Installation

You can install "pytest-llmeval" via pipx:

pipx install pytest-llmeval

Contributing

Contributions are very welcome. Tests can be run with uv run pytest, please ensure the coverage at least stays the same before you submit a pull request.

This pytest plugin was generated with Cookiecutter along with @hackebrot's cookiecutter-pytest-plugin template.

License

Distributed under the terms of the MIT license, "pytest-llmeval" is free and open source software

Issues

If you encounter any problems, please file an issue along with a detailed description.

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

kevin.schaul

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.0

Mar 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_llmeval-0.0.0.tar.gz (9.1 kB view details)

Uploaded Mar 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_llmeval-0.0.0-py3-none-any.whl (7.9 kB view details)

Uploaded Mar 19, 2025 Python 3

File details

Details for the file pytest_llmeval-0.0.0.tar.gz.

File metadata

Download URL: pytest_llmeval-0.0.0.tar.gz
Upload date: Mar 19, 2025
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pytest_llmeval-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`32e4091fd0821f51484e1751cf1de0d435f7796d989cc0daa3cb5874cf9f0073`
MD5	`6de845e72b7e4992674209d9d7ff02b0`
BLAKE2b-256	`3b2bcc25b00733279f74a7cb85482e8320a7af4e533196ff88a4e0f92d6c2f70`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llmeval-0.0.0.tar.gz:

Publisher: release.yml on kevinschaul/pytest-llmeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_llmeval-0.0.0.tar.gz
- Subject digest: 32e4091fd0821f51484e1751cf1de0d435f7796d989cc0daa3cb5874cf9f0073
- Sigstore transparency entry: 185121360
- Sigstore integration time: Mar 19, 2025
Source repository:
- Permalink: kevinschaul/pytest-llmeval@be3b4b223ef9380feb70309eb6ffd861f4fbe455
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/kevinschaul
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@be3b4b223ef9380feb70309eb6ffd861f4fbe455
- Trigger Event: release

File details

Details for the file pytest_llmeval-0.0.0-py3-none-any.whl.

File metadata

Download URL: pytest_llmeval-0.0.0-py3-none-any.whl
Upload date: Mar 19, 2025
Size: 7.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pytest_llmeval-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`812ab0de969b287c2a5ff68a1e74fe3d1ec117da544c089190a7a46f2f083cf8`
MD5	`4a3a6533edeca98edc34fde731b9fe01`
BLAKE2b-256	`f14c68552f24943437251be004c43bfe2888c7427f1e8869d229007860932fdf`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llmeval-0.0.0-py3-none-any.whl:

Publisher: release.yml on kevinschaul/pytest-llmeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_llmeval-0.0.0-py3-none-any.whl
- Subject digest: 812ab0de969b287c2a5ff68a1e74fe3d1ec117da544c089190a7a46f2f083cf8
- Sigstore transparency entry: 185121361
- Sigstore integration time: Mar 19, 2025
Source repository:
- Permalink: kevinschaul/pytest-llmeval@be3b4b223ef9380feb70309eb6ffd861f4fbe455
- Branch / Tag: refs/tags/v0.0.0
- Owner: https://github.com/kevinschaul
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@be3b4b223ef9380feb70309eb6ffd861f4fbe455
- Trigger Event: release

pytest-llmeval 0.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pytest-llmeval

Features

Usage

Basic Usage

Comparing across variables like different prompts or models

Saving reports

Custom analysis functions

API

@pytest.mark.llmeval()

ClassificationResult

Installation

Contributing

License

Issues

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`@pytest.mark.llmeval()`

`ClassificationResult`