Skip to main content

Pytest plugin for evaluating AI Agents

Project description

Pytest Agent Evals

A pytest plugin for evaluating AI Agents, seamlessly integrated with VS Code Test Explorer and AI Toolkit.

Installation

pip install pytest-agent-evals --pre

Note: This package is currently in beta. The --pre flag is required to install pre-release versions.

Features

  • Data Loading: Parametrizing tests from dataset files (JSONL) or inline data.
  • Agent Execution: Running agents (ChatAgent or Foundry agent) and caching responses to disk to avoid redundant API calls.
  • Evaluation: Running built-in or custom evaluators (LLM-based or code-based) on the agent's response.
  • Reporting: Aggregating evaluation results into a JSON report and a terminal summary.

Usage

This plugin enables you to evaluate agent responses against datasets using built-in or custom evaluators.

1. Define Your Agent

You can test both local agents (running in your process) and remote agents (hosted in Microsoft Foundry).

Local Agent (ChatAgent)

Test local agent instances that utilize the agent_framework.ChatAgent class. Use ChatAgentConfig to reference a pytest fixture that provides the initialized agent.

import pytest
from pytest_agent_evals import evals, ChatAgentConfig
from my_app.agents import create_my_agent  # Your source code

@pytest.fixture
def my_agent():
    # Return your initialized agent instance
    return create_my_agent()

@evals.agent(ChatAgentConfig(agent_fixture=my_agent))
class TestMyAgent:
    ...

Remote Agent (Foundry)

Connect to an agent hosted in Foundry using FoundryAgentConfig.

from pytest_agent_evals import evals, FoundryAgentConfig

@evals.agent(FoundryAgentConfig(
    agent_name="my-agent",
    project_endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>"
))
class TestFoundryAgent:
    ...

2. Configure Dataset

Use @evals.dataset to parametrize your test class with data from a JSONL file or inline list.

@evals.dataset("data.jsonl")
class TestMyAgent:
    ...

3. Configure Judge Model

Use @evals.judge_model to configure the LLM used for AI-assisted evaluation (e.g., Azure OpenAI).

from pytest_agent_evals import AzureOpenAIModelConfig

@evals.judge_model(AzureOpenAIModelConfig(
    deployment_name="gpt-4.1", 
    endpoint="https://<resource>.openai.azure.com/", 
))
class TestEvaluation:
    ...

4. Define Evaluators

Use @evals.evaluator on your test function to register evaluators that run against the agent's response.

Built-in Evaluators

Use BuiltInEvaluatorConfig to configure built-in evaluators (e.g., coherence, relevance).

from pytest_agent_evals import BuiltInEvaluatorConfig

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestEvaluation:

    @evals.evaluator(BuiltInEvaluatorConfig(name="coherence"))
    def test_quality(self, evaluator_results):
        assert evaluator_results.coherence.result == "pass"

Custom Prompt Evaluators

Use CustomPromptEvaluatorConfig to define your own LLM-based evaluation logic using a Jinja2 template.

from pytest_agent_evals import CustomPromptEvaluatorConfig

friendliness_prompt = """
You are an AI assistant that evaluates the tone of a response.
Score the friendliness of the response on a scale of 1 to 5, where 1 is hostile or rude, and 5 is very friendly and warm.
Provide a brief reason for your score.

### Input:
Response:
{{response}}

You must output your result in the following JSON format:
{
    "result": <integer from 1 to 5>,
    "reason": "<brief explanation>"
}
"""

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestMyCustomPrompts:

    @evals.evaluator(CustomPromptEvaluatorConfig(
        name="friendliness",
        prompt=friendliness_prompt,
        threshold=3
    ))
    def test_friendliness(self, evaluator_results):
        assert evaluator_results.friendliness.result == "pass"

Custom Code Evaluators

Use CustomCodeEvaluatorConfig to execute a Python function for deterministic or rule-based grading.

from pytest_agent_evals import CustomCodeEvaluatorConfig

def length_check(sample, item):
    # Return 1.0 (pass) or 0.0 (fail)
    return 1.0 if len(sample["output_text"]) < 100 else 0.0

@evals.agent(...)
@evals.dataset(...)
class TestMyCodeEvals:

    @evals.evaluator(CustomCodeEvaluatorConfig(
        name="conciseness",
        grader=length_check,
        threshold=0.9
    ))
    def test_conciseness(self, evaluator_results):
        assert evaluator_results.conciseness.result == "pass"

CLI Options

List Evaluations

Preview the evaluations that will be run, grouped by unique combinations of Agent, Dataset, and Evaluators.

pytest --collect-evals

Cache Management

Control how agent responses are cached during test execution.

# 'session' (default): Clears cache at startup. 
# Ensures consistency by sharing the same fresh response across all evaluators for a query.
pytest --cache-mode session

# 'persistence': Preserves cache across sessions. 
# Avoids redundant agent execution, enabling rapid evaluator tuning without agent changes.
pytest --cache-mode persistence

Requirements

  • Python 3.10+
  • Visual Studio Code (recommended for running tests from Test Explorer)
  • VS Code AI Toolkit (recommended for visualizing and analyzing evaluation results, submitting evaluations to run in Foundry, etc.)

License

This project is licensed under the Microsoft AI Toolkit – Pytest Agent Evals License Terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_evals-0.0.1b260210.tar.gz (32.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agent_evals-0.0.1b260210-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agent_evals-0.0.1b260210.tar.gz.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260210.tar.gz
Algorithm Hash digest
SHA256 5e7fdbd52d2a9e39ac3df3ad0d69c6d842ec3ab1701a2863ea8c7d119eb2f925
MD5 ca7f9abb0813d75e61ed7c1a9b981645
BLAKE2b-256 71476873411acd5fd4521ebfa6bfb8221781d52e12289a22914af60c1ce942ea

See more details on using hashes here.

File details

Details for the file pytest_agent_evals-0.0.1b260210-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260210-py3-none-any.whl
Algorithm Hash digest
SHA256 043ee92a90a6b923d51f8e7dab0ecbd373087627aa9cd2e500149b0ed0bd06e7
MD5 219361eb3e511529059af7deac2817a6
BLAKE2b-256 ebbab76294b0122e493dc546156dee112a7c2ced3de576d6b994682620ce8d84

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page