A pytest plugin for LLM evaluation tests with threshold-based pass/fail
Project description
pytest-agent-eval
LLM evaluation tests that actually mean something. A pytest plugin for testing LLM agents with threshold-based pass/fail scoring, multi-turn transcripts, and LLM-as-judge rubrics — without breaking your CI bill.
Highlights
- 🎯 Threshold-based pass/fail — run each test N times, pass when ≥ threshold% succeed
- 📝 YAML or Python transcripts — pick the authoring style your team prefers
- 🔍 YAML auto-discovery — drop
*.yamlfiles in any configured directory and they become pytest tests automatically - 🛡 CI-safe by default — eval tests skip unless
--agent-eval-liveorEVAL_LIVE=1 - ⚡ Parallel-ready —
pytest -n auto(viapytest-xdist) just works - 📄 Markdown reports — full per-run trace with
--agent-eval-report=eval.md
Installation
# pip
pip install pytest-agent-eval
# uv
uv add pytest-agent-eval
Supported frameworks
pytest-agent-eval ships first-class adapters for the major Python agent frameworks. Each is an optional extra so you only install what you use.
| Framework | Extra | Adapter |
|---|---|---|
| pydantic-ai | (default) | pytest_agent_eval.adapters.pydantic_ai.PydanticAIAdapter |
| LangChain / LangGraph | langchain |
pytest_agent_eval.adapters.langchain.LangChainAdapter |
| OpenAI SDK | openai |
pytest_agent_eval.adapters.openai.OpenAIAdapter |
| smolagents | smolagents |
pytest_agent_eval.adapters.smolagents.SmolagentsAdapter |
pip install "pytest-agent-eval[langchain]"
pip install "pytest-agent-eval[openai]"
pip install "pytest-agent-eval[smolagents]"
# or with uv:
uv add "pytest-agent-eval[langchain]"
uv add "pytest-agent-eval[openai]"
uv add "pytest-agent-eval[smolagents]"
Bringing your own framework? Any async def agent(messages) -> (reply, tool_calls) callable works directly — no base class needed.
What you can test
pytest-agent-eval separates the kinds of checks you might want into composable evaluators:
- Deterministic checks —
ContainsEvaluator(any_of=["confirmed", "booked"])for substring/regex assertions over the agent reply. - Tool-call assertions —
ToolCallEvaluator(must_include=["create_booking"], ordered=True)to verify that the agent called the right tools, in the right order. - LLM-as-judge —
JudgeEvaluator(rubric="Reply must be friendly, include a date, and confirm the booking.")for open-ended quality checks the agent under test should meet.
Mix and match per turn — every evaluator participates in the threshold score.
Quick start
import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator, ToolCallEvaluator, JudgeEvaluator
@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking(agent_eval):
result = await agent_eval.run(
agent=my_agent,
turns=[
Turn(
user="Book me a slot tomorrow at 10am",
expect=Expect(evaluators=[
ContainsEvaluator(any_of=["confirmed", "booked"]),
ToolCallEvaluator(must_include=["create_booking"]),
JudgeEvaluator(rubric="Reply must include a reference number."),
]),
)
],
)
result.assert_threshold()
pytest --agent-eval-live
See the full documentation for the YAML authoring style, configuration, and reporting options.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_agent_eval-0.1.0.tar.gz.
File metadata
- Download URL: pytest_agent_eval-0.1.0.tar.gz
- Upload date:
- Size: 273.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7287c54bb7005c2e71fabc09523a641487ab2a2555e5651a0963f1feae745828
|
|
| MD5 |
570a95583f29e1db41c4ac871bc26473
|
|
| BLAKE2b-256 |
ad784e48726efb65bcb7d14060adb72a06a33b02e4636c0238a7935ec6509fc4
|
Provenance
The following attestation bundles were made for pytest_agent_eval-0.1.0.tar.gz:
Publisher:
release.yml on datarootsio/pytest-agent-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_agent_eval-0.1.0.tar.gz -
Subject digest:
7287c54bb7005c2e71fabc09523a641487ab2a2555e5651a0963f1feae745828 - Sigstore transparency entry: 1409356527
- Sigstore integration time:
-
Permalink:
datarootsio/pytest-agent-eval@1da9e4fb236ebba494b34f3b4c9188a820e0b1b8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/datarootsio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1da9e4fb236ebba494b34f3b4c9188a820e0b1b8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pytest_agent_eval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pytest_agent_eval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 22.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d3ea8bab73bc2cbf993c3ac80c64361f21570a45edd67535a2d0dc1637265cb4
|
|
| MD5 |
93dd4e06f64102a04013d5804d7d408d
|
|
| BLAKE2b-256 |
a18f6b02e9470af69ab2585b27b34788e1a9c196649582f47bde45bc7438c31a
|
Provenance
The following attestation bundles were made for pytest_agent_eval-0.1.0-py3-none-any.whl:
Publisher:
release.yml on datarootsio/pytest-agent-eval
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_agent_eval-0.1.0-py3-none-any.whl -
Subject digest:
d3ea8bab73bc2cbf993c3ac80c64361f21570a45edd67535a2d0dc1637265cb4 - Sigstore transparency entry: 1409356530
- Sigstore integration time:
-
Permalink:
datarootsio/pytest-agent-eval@1da9e4fb236ebba494b34f3b4c9188a820e0b1b8 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/datarootsio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1da9e4fb236ebba494b34f3b4c9188a820e0b1b8 -
Trigger Event:
push
-
Statement type: