Skip to main content

A pytest plugin for LLM evaluation tests with threshold-based pass/fail

Project description

pytest-agent-eval

PyPI version Python versions License pytest plugin Ruff pre-commit

LLM evaluation tests that actually mean something. A pytest plugin for testing LLM agents with threshold-based pass/fail scoring, multi-turn transcripts, and LLM-as-judge rubrics — without breaking your CI bill.

Highlights

  • 🎯 Threshold-based pass/fail — run each test N times, pass when ≥ threshold% succeed
  • 📝 YAML or Python transcripts — pick the authoring style your team prefers
  • 🔍 YAML auto-discovery — drop *.yaml files in any configured directory and they become pytest tests automatically
  • 🛡 CI-safe by default — eval tests skip unless --agent-eval-live or EVAL_LIVE=1
  • Parallel-readypytest -n auto (via pytest-xdist) just works
  • 📄 Markdown reports — full per-run trace with --agent-eval-report=eval.md

Installation

# pip
pip install pytest-agent-eval

# uv
uv add pytest-agent-eval

Supported frameworks

pytest-agent-eval ships first-class adapters for the major Python agent frameworks. Each is an optional extra so you only install what you use.

Framework Extra Adapter
pydantic-ai (default) pytest_agent_eval.adapters.pydantic_ai.PydanticAIAdapter
LangChain / LangGraph langchain pytest_agent_eval.adapters.langchain.LangChainAdapter
OpenAI SDK openai pytest_agent_eval.adapters.openai.OpenAIAdapter
smolagents smolagents pytest_agent_eval.adapters.smolagents.SmolagentsAdapter
pip install "pytest-agent-eval[langchain]"
pip install "pytest-agent-eval[openai]"
pip install "pytest-agent-eval[smolagents]"
# or with uv:
uv add "pytest-agent-eval[langchain]"
uv add "pytest-agent-eval[openai]"
uv add "pytest-agent-eval[smolagents]"

Bringing your own framework? Any async def agent(messages) -> (reply, tool_calls) callable works directly — no base class needed.

What you can test

pytest-agent-eval separates the kinds of checks you might want into composable evaluators:

  • Deterministic checksContainsEvaluator(any_of=["confirmed", "booked"]) for substring/regex assertions over the agent reply.
  • Tool-call assertionsToolCallEvaluator(must_include=["create_booking"], ordered=True) to verify that the agent called the right tools, in the right order.
  • LLM-as-judgeJudgeEvaluator(rubric="Reply must be friendly, include a date, and confirm the booking.") for open-ended quality checks the agent under test should meet.

Mix and match per turn — every evaluator participates in the threshold score.

Quick start

import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator, ToolCallEvaluator, JudgeEvaluator

@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking(agent_eval):
    result = await agent_eval.run(
        agent=my_agent,
        turns=[
            Turn(
                user="Book me a slot tomorrow at 10am",
                expect=Expect(evaluators=[
                    ContainsEvaluator(any_of=["confirmed", "booked"]),
                    ToolCallEvaluator(must_include=["create_booking"]),
                    JudgeEvaluator(rubric="Reply must include a reference number."),
                ]),
            )
        ],
    )
    result.assert_threshold()
pytest --agent-eval-live

See the full documentation for the YAML authoring style, configuration, and reporting options.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_eval-0.1.0.tar.gz (273.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agent_eval-0.1.0-py3-none-any.whl (22.1 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agent_eval-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_agent_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 273.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pytest_agent_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7287c54bb7005c2e71fabc09523a641487ab2a2555e5651a0963f1feae745828
MD5 570a95583f29e1db41c4ac871bc26473
BLAKE2b-256 ad784e48726efb65bcb7d14060adb72a06a33b02e4636c0238a7935ec6509fc4

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_agent_eval-0.1.0.tar.gz:

Publisher: release.yml on datarootsio/pytest-agent-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytest_agent_eval-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agent_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d3ea8bab73bc2cbf993c3ac80c64361f21570a45edd67535a2d0dc1637265cb4
MD5 93dd4e06f64102a04013d5804d7d408d
BLAKE2b-256 a18f6b02e9470af69ab2585b27b34788e1a9c196649582f47bde45bc7438c31a

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_agent_eval-0.1.0-py3-none-any.whl:

Publisher: release.yml on datarootsio/pytest-agent-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page