A pytest plugin for LLM evaluation tests with threshold-based pass/fail

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

pytest-agent-eval

LLM evaluation tests that actually mean something. A pytest plugin for testing LLM agents with threshold-based pass/fail scoring, multi-turn transcripts, and LLM-as-judge rubrics — without breaking your CI bill.

Highlights

🎯 Threshold-based pass/fail — run each test N times, pass when ≥ threshold% succeed
📝 YAML or Python transcripts — pick the authoring style your team prefers
🔍 YAML auto-discovery — drop *.yaml files in any configured directory and they become pytest tests automatically
🛡 CI-safe by default — eval tests skip unless --agent-eval-live or EVAL_LIVE=1
⚡ Parallel-ready — pytest -n auto (via pytest-xdist) just works
📄 Markdown reports — full per-run trace with --agent-eval-report=eval.md

Installation

# pip
pip install pytest-agent-eval

# uv
uv add pytest-agent-eval

Supported frameworks

pytest-agent-eval ships first-class adapters for the major Python agent frameworks. Each is an optional extra so you only install what you use.

Framework	Extra	Adapter
pydantic-ai	(default)	`pytest_agent_eval.adapters.pydantic_ai.PydanticAIAdapter`
LangChain / LangGraph	`langchain`	`pytest_agent_eval.adapters.langchain.LangChainAdapter`
OpenAI SDK	`openai`	`pytest_agent_eval.adapters.openai.OpenAIAdapter`
smolagents	`smolagents`	`pytest_agent_eval.adapters.smolagents.SmolagentsAdapter`
LiveKit (voice)	`livekit`	`pytest_agent_eval.adapters.livekit.LiveKitAdapter`

pip install "pytest-agent-eval[langchain]"
pip install "pytest-agent-eval[openai]"
pip install "pytest-agent-eval[smolagents]"
pip install "pytest-agent-eval[livekit]"
# or with uv:
uv add "pytest-agent-eval[langchain]"
uv add "pytest-agent-eval[openai]"
uv add "pytest-agent-eval[smolagents]"
uv add "pytest-agent-eval[livekit]"

Bringing your own framework? Any async def agent(messages) -> (reply, tool_calls) callable works directly — no base class needed.

What you can test

pytest-agent-eval separates the kinds of checks you might want into composable evaluators:

Deterministic checks — ContainsEvaluator(any_of=["confirmed", "booked"]) for substring/regex assertions over the agent reply.
Tool-call assertions — ToolCallEvaluator(must_include=["create_booking"], ordered=True) to verify that the agent called the right tools, in the right order.
LLM-as-judge — JudgeEvaluator(rubric="Reply must be friendly, include a date, and confirm the booking.") for open-ended quality checks the agent under test should meet.

Mix and match per turn — every evaluator participates in the threshold score.

Quick start

You can author evals two ways. YAML is the recommended starting point — it lets non-Python contributors (PMs, QA, domain experts) write tests, keeps eval data readable in code review, and turns each .yaml file into a pytest test automatically.

1 — Configure where YAMLs live

# pyproject.toml
[tool.agent_eval]
model     = "openai:gpt-4o"   # used by the LLM-as-judge
threshold = 0.8               # default pass fraction
runs      = 3                 # default reps per transcript
yaml_dirs = ["tests/evals"]

2 — Wire up your agent once

# tests/conftest.py
import pytest
from pytest_agent_eval.adapters.pydantic_ai import PydanticAIAdapter
from my_app import build_agent

@pytest.fixture
def llm_eval_agent():
    return PydanticAIAdapter(build_agent())

3 — Write transcripts as YAML

# tests/evals/booking_single_turn.yaml
id: booking_single_turn
threshold: 0.8
runs: 3

turns:
  - user: "Book me a slot tomorrow at 10am."
    expect:
      reply_contains_any: ["confirmed", "booked"]
      tool_calls_include: ["create_booking"]
      judge:
        rubric: "Reply must confirm the booking and include a reference number."

4 — Run it

pytest --agent-eval-live

============================ test session starts =============================
plugins: agent-eval-0.1.0, asyncio-1.0.0
collected 2 items

tests/evals/booking_single_turn.yaml::booking_single_turn PASSED       [ 50%]
tests/evals/booking_multi_turn.yaml::booking_multi_turn PASSED         [100%]

============================== 2 passed in 14.03s ============================

By default eval tests are skipped outside of explicit live runs (so a missed pytest . doesn't burn API credits). Flip --agent-eval-live on, set EVAL_LIVE=1, or live = true in [tool.agent_eval] for local-only auto-on. See Configuration for the full precedence rules.

Multi-turn conversations

Real agents fail on context, not single-shot replies. A model that nails turn 1 might forget the user's name by turn 3, or call the wrong tool once the user changes their mind. Multi-turn YAML transcripts test the whole conversation arc, with each turn asserting against the agent's state at that point:

# tests/evals/booking_multi_turn.yaml
id: booking_multi_turn
threshold: 0.66          # tolerate 1/3 flaky runs
runs: 3
tags: [gate:booking, smoke]

turns:
  # Turn 1 — initial booking
  - user: "Book me a slot tomorrow at 10am."
    expect:
      reply_contains_any: ["confirmed", "booked"]
      tool_calls_include: ["create_booking"]

  # Turn 2 — agent must remember the booking from turn 1
  - user: "Actually, can you move it to 11am instead?"
    expect:
      tool_calls_include: ["update_booking"]
      tool_calls_exclude: ["create_booking"]   # must update, not double-book
      judge:
        rubric: "Confirms the new time AND references the original 10am booking."

  # Turn 3 — context propagates further
  - user: "Email me the confirmation."
    expect:
      tool_calls_include: ["send_email"]
      reply_contains_any: ["sent", "email"]

Each turn's full conversation history is built up as the test runs — your agent receives all prior (user, assistant) pairs as context, the same way it would in production. Failures point at the exact turn that broke, not just "the test failed."

Voice agents are supported via the [livekit] extra — see Voice testing. Each turn declares an audio: turn.wav path; the adapter streams it through a real LiveKit AgentSession and asserts on the same surface (tool calls, reply, judge rubric).

Sample report

Add --agent-eval-report=eval.md to get a human-readable trail of every run, every turn, and every evaluator's reasoning. Useful for CI artifacts and PR diffs:

pytest --agent-eval-live --agent-eval-report=eval.md

# LLM Eval Report — 2026-04-30

## Summary

| Transcript            | Runs | Passed | Score | Threshold | Status  |
|-----------------------|------|--------|-------|-----------|---------|
| booking_single_turn   | 3    | 3      | 1.00  | 0.80      | ✅ PASS |
| booking_multi_turn    | 3    | 2      | 0.67  | 0.66      | ✅ PASS |

## Details

### booking_multi_turn
**Run 1** ✅
- Turn 1: PASS
- Turn 2: PASS
  - Judge: Reply confirmed move to 11am and acknowledged original 10am slot.
- Turn 3: PASS
**Run 2** ❌
- Turn 1: PASS
- Turn 2: FAIL
  - Tool calls expected to include 'update_booking', got ['create_booking']
- Turn 3: PASS
**Run 3** ✅
- ...

The two-out-of-three pass still clears the 0.66 threshold, so the suite passes — that's the point of running each transcript multiple times instead of treating LLM tests as binary.

Python API

YAML covers most cases; drop into Python when you need parametrization, programmatic test generation, or per-test fixtures:

import pytest
from pytest_agent_eval import Turn, Expect, ContainsEvaluator, ToolCallEvaluator, JudgeEvaluator

@pytest.mark.agent_eval(threshold=0.8, runs=3)
async def test_booking(agent_eval):
    result = await agent_eval.run(
        agent=my_agent,
        turns=[
            Turn(
                user="Book me a slot tomorrow at 10am",
                expect=Expect(evaluators=[
                    ContainsEvaluator(any_of=["confirmed", "booked"]),
                    ToolCallEvaluator(must_include=["create_booking"]),
                    JudgeEvaluator(rubric="Reply must include a reference number."),
                ]),
            )
        ],
    )
    result.assert_threshold()

See the full documentation for the complete YAML reference, configuration options, parallel execution, and CI patterns. Contributions are welcome — see CONTRIBUTING.md.

License

MIT — see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

murilo-cunha

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.0

Apr 30, 2026

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_eval-0.2.0.tar.gz (305.6 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_agent_eval-0.2.0-py3-none-any.whl (32.1 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file pytest_agent_eval-0.2.0.tar.gz.

File metadata

Download URL: pytest_agent_eval-0.2.0.tar.gz
Upload date: Apr 30, 2026
Size: 305.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pytest_agent_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`b95dc85eb88702db3ac52b9a50ea4df94ab1e45a6ee54f74f0ae9d3a2a810c7f`
MD5	`564046e6e3d8773322f96f2afbff0042`
BLAKE2b-256	`c95bba34442fbcfab2e4ffd988f639f5b65299d190b6128714bd4c32d7b743a2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_agent_eval-0.2.0.tar.gz:

Publisher: release.yml on datarootsio/pytest-agent-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_agent_eval-0.2.0.tar.gz
- Subject digest: b95dc85eb88702db3ac52b9a50ea4df94ab1e45a6ee54f74f0ae9d3a2a810c7f
- Sigstore transparency entry: 1409425510
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: datarootsio/pytest-agent-eval@f3376d4b836688a71751627b8ed972cc854dfac0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/datarootsio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f3376d4b836688a71751627b8ed972cc854dfac0
- Trigger Event: push

File details

Details for the file pytest_agent_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: pytest_agent_eval-0.2.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 32.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pytest_agent_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`32a5c400a8d38507546953774a8338c68e1eca285b9be3ef41c125f3c5e8edfe`
MD5	`3f39c76ab8be7561971b6d06fa720daa`
BLAKE2b-256	`5a7b0d82ce2fa1dace5dc00975aec077164c89e361bb76ddb44e5487fc5a0896`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_agent_eval-0.2.0-py3-none-any.whl:

Publisher: release.yml on datarootsio/pytest-agent-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_agent_eval-0.2.0-py3-none-any.whl
- Subject digest: 32a5c400a8d38507546953774a8338c68e1eca285b9be3ef41c125f3c5e8edfe
- Sigstore transparency entry: 1409425522
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: datarootsio/pytest-agent-eval@f3376d4b836688a71751627b8ed972cc854dfac0
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/datarootsio
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f3376d4b836688a71751627b8ed972cc854dfac0
- Trigger Event: push

pytest-agent-eval 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pytest-agent-eval

Highlights

Installation

Supported frameworks

What you can test

Quick start

1 — Configure where YAMLs live

2 — Wire up your agent once

3 — Write transcripts as YAML

4 — Run it

Multi-turn conversations

Sample report

Python API

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance