Pytest plugin for evaluating AI Agents
Project description
Pytest Agent Evals
A pytest plugin for evaluating AI Agents, seamlessly integrated with VS Code Test Explorer and AI Toolkit.
Installation
pip install pytest-agent-evals --pre
Note: This package is currently in beta. The
--preflag is required to install pre-release versions.
Features
- Data Loading: Parametrizing tests from dataset files (JSONL) or inline data.
- Agent Execution: Running agents (
ChatAgentor Foundry agent) and caching responses to disk to avoid redundant API calls. - Evaluation: Running built-in or custom evaluators (LLM-based or code-based) on the agent's response.
- Reporting: Aggregating evaluation results into a JSON report and a terminal summary.
Usage
This plugin enables you to evaluate agent responses against datasets using built-in or custom evaluators.
1. Define Your Agent
You can test both local agents (running in your process) and remote agents (hosted in Microsoft Foundry).
Local Agent (ChatAgent)
Test local agent instances that utilize the agent_framework.ChatAgent class. Use ChatAgentConfig to reference a pytest fixture that provides the initialized agent.
import pytest
from pytest_agent_evals import evals, ChatAgentConfig
from my_app.agents import create_my_agent # Your source code
@pytest.fixture
def my_agent():
# Return your initialized agent instance
return create_my_agent()
@evals.agent(ChatAgentConfig(agent_fixture=my_agent))
class TestMyAgent:
...
Remote Agent (Foundry)
Connect to an agent hosted in Foundry using FoundryAgentConfig.
from pytest_agent_evals import evals, FoundryAgentConfig
@evals.agent(FoundryAgentConfig(
agent_name="my-agent",
project_endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>"
))
class TestFoundryAgent:
...
2. Configure Dataset
Use @evals.dataset to parametrize your test class with data from a JSONL file or inline list.
@evals.dataset("data.jsonl")
class TestMyAgent:
...
3. Configure Judge Model
Use @evals.judge_model to configure the LLM used for AI-assisted evaluation (e.g., Azure OpenAI).
from pytest_agent_evals import AzureOpenAIModelConfig
@evals.judge_model(AzureOpenAIModelConfig(
deployment_name="gpt-4.1",
endpoint="https://<resource>.openai.azure.com/",
))
class TestEvaluation:
...
4. Define Evaluators
Use @evals.evaluator on your test function to register evaluators that run against the agent's response.
Built-in Evaluators
Use BuiltInEvaluatorConfig to configure built-in evaluators (e.g., coherence, relevance).
from pytest_agent_evals import BuiltInEvaluatorConfig
@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestEvaluation:
@evals.evaluator(BuiltInEvaluatorConfig(name="coherence"))
def test_quality(self, evaluator_results):
assert evaluator_results.coherence.result == "pass"
Custom Prompt Evaluators
Use CustomPromptEvaluatorConfig to define your own LLM-based evaluation logic using a Jinja2 template.
from pytest_agent_evals import CustomPromptEvaluatorConfig
friendliness_prompt = """
You are an AI assistant that evaluates the tone of a response.
Score the friendliness of the response on a scale of 1 to 5, where 1 is hostile or rude, and 5 is very friendly and warm.
Provide a brief reason for your score.
### Input:
Response:
{{response}}
You must output your result in the following JSON format:
{
"result": <integer from 1 to 5>,
"reason": "<brief explanation>"
}
"""
@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestMyCustomPrompts:
@evals.evaluator(CustomPromptEvaluatorConfig(
name="friendliness",
prompt=friendliness_prompt,
threshold=3
))
def test_friendliness(self, evaluator_results):
assert evaluator_results.friendliness.result == "pass"
Custom Code Evaluators
Use CustomCodeEvaluatorConfig to execute a Python function for deterministic or rule-based grading.
from pytest_agent_evals import CustomCodeEvaluatorConfig
def length_check(sample, item):
# Return 1.0 (pass) or 0.0 (fail)
return 1.0 if len(sample["output_text"]) < 100 else 0.0
@evals.agent(...)
@evals.dataset(...)
class TestMyCodeEvals:
@evals.evaluator(CustomCodeEvaluatorConfig(
name="conciseness",
grader=length_check,
threshold=0.9
))
def test_conciseness(self, evaluator_results):
assert evaluator_results.conciseness.result == "pass"
CLI Options
List Evaluations
Preview the evaluations that will be run, grouped by unique combinations of Agent, Dataset, and Evaluators.
pytest --collect-evals
Cache Management
Control how agent responses are cached during test execution.
# 'session' (default): Clears cache at startup.
# Ensures consistency by sharing the same fresh response across all evaluators for a query.
pytest --cache-mode session
# 'persistence': Preserves cache across sessions.
# Avoids redundant agent execution, enabling rapid evaluator tuning without agent changes.
pytest --cache-mode persistence
Requirements
- Python 3.10+
- Visual Studio Code (recommended for running tests from Test Explorer)
- VS Code AI Toolkit (recommended for visualizing and analyzing evaluation results, submitting evaluations to run in Foundry, etc.)
License
This project is licensed under the Microsoft AI Toolkit – Pytest Agent Evals License Terms.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_agent_evals-0.0.1b260130.tar.gz.
File metadata
- Download URL: pytest_agent_evals-0.0.1b260130.tar.gz
- Upload date:
- Size: 32.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2fa5a605cc6c84ad97cd088e08e529d8b2f1c0de2c66a75e832d6abc41b87a57
|
|
| MD5 |
a6bc3395fa901f96034d5c3c3c70181f
|
|
| BLAKE2b-256 |
a1cddff231a4be3bce0f8412f60699604cf10745e24b6b6d52b573b564fd9775
|
File details
Details for the file pytest_agent_evals-0.0.1b260130-py3-none-any.whl.
File metadata
- Download URL: pytest_agent_evals-0.0.1b260130-py3-none-any.whl
- Upload date:
- Size: 34.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff8c3cb6148d78780386204380f9da30e6f29a36796715804c0933486e586043
|
|
| MD5 |
e399601f4b7a958e2272b5cd1ce42030
|
|
| BLAKE2b-256 |
452962165a985b7ca2729ca86c04f6ad5f9354ea99589981682c38285b908672
|