Skip to main content

Pytest plugin for evaluating AI Agents

Project description

Pytest Agent Evals

A pytest plugin for evaluating AI Agents, seamlessly integrated with VS Code Test Explorer and AI Toolkit.

Installation

pip install pytest-agent-evals --pre

Note: This package is currently in beta. The --pre flag is required to install pre-release versions.

Features

  • Data Loading: Parametrizing tests from dataset files (JSONL) or inline data.
  • Agent Execution: Running agents (ChatAgent or Foundry agent) and caching responses to disk to avoid redundant API calls.
  • Evaluation: Running built-in or custom evaluators (LLM-based or code-based) on the agent's response.
  • Reporting: Aggregating evaluation results into a JSON report and a terminal summary.

Usage

This plugin enables you to evaluate agent responses against datasets using built-in or custom evaluators.

1. Define Your Agent

You can test both local agents (running in your process) and remote agents (hosted in Microsoft Foundry).

Local Agent (ChatAgent)

Test local agent instances that utilize the agent_framework.ChatAgent class. Use ChatAgentConfig to reference a pytest fixture that provides the initialized agent.

import pytest
from pytest_agent_evals import evals, ChatAgentConfig
from my_app.agents import create_my_agent  # Your source code

@pytest.fixture
def my_agent():
    # Return your initialized agent instance
    return create_my_agent()

@evals.agent(ChatAgentConfig(agent_fixture=my_agent))
class TestMyAgent:
    ...

Remote Agent (Foundry)

Connect to an agent hosted in Foundry using FoundryAgentConfig.

from pytest_agent_evals import evals, FoundryAgentConfig

@evals.agent(FoundryAgentConfig(
    agent_name="my-agent",
    project_endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>"
))
class TestFoundryAgent:
    ...

2. Configure Dataset

Use @evals.dataset to parametrize your test class with data from a JSONL file or inline list.

@evals.dataset("data.jsonl")
class TestMyAgent:
    ...

3. Configure Judge Model

Use @evals.judge_model to configure the LLM used for AI-assisted evaluation (e.g., Azure OpenAI).

from pytest_agent_evals import AzureOpenAIModelConfig

@evals.judge_model(AzureOpenAIModelConfig(
    deployment_name="gpt-4.1", 
    endpoint="https://<resource>.openai.azure.com/", 
))
class TestEvaluation:
    ...

4. Define Evaluators

Use @evals.evaluator on your test function to register evaluators that run against the agent's response.

Built-in Evaluators

Use BuiltInEvaluatorConfig to configure built-in evaluators (e.g., coherence, relevance).

from pytest_agent_evals import BuiltInEvaluatorConfig

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestEvaluation:

    @evals.evaluator(BuiltInEvaluatorConfig(name="coherence"))
    def test_quality(self, evaluator_results):
        assert evaluator_results.coherence.result == "pass"

Custom Prompt Evaluators

Use CustomPromptEvaluatorConfig to define your own LLM-based evaluation logic using a Jinja2 template.

from pytest_agent_evals import CustomPromptEvaluatorConfig

friendliness_prompt = """
You are an AI assistant that evaluates the tone of a response.
Score the friendliness of the response on a scale of 1 to 5, where 1 is hostile or rude, and 5 is very friendly and warm.
Provide a brief reason for your score.

### Input:
Response:
{{response}}

You must output your result in the following JSON format:
{
    "result": <integer from 1 to 5>,
    "reason": "<brief explanation>"
}
"""

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestMyCustomPrompts:

    @evals.evaluator(CustomPromptEvaluatorConfig(
        name="friendliness",
        prompt=friendliness_prompt,
        threshold=3
    ))
    def test_friendliness(self, evaluator_results):
        assert evaluator_results.friendliness.result == "pass"

Custom Code Evaluators

Use CustomCodeEvaluatorConfig to execute a Python function for deterministic or rule-based grading.

from pytest_agent_evals import CustomCodeEvaluatorConfig

def length_check(sample, item):
    # Return 1.0 (pass) or 0.0 (fail)
    return 1.0 if len(sample["output_text"]) < 100 else 0.0

@evals.agent(...)
@evals.dataset(...)
class TestMyCodeEvals:

    @evals.evaluator(CustomCodeEvaluatorConfig(
        name="conciseness",
        grader=length_check,
        threshold=0.9
    ))
    def test_conciseness(self, evaluator_results):
        assert evaluator_results.conciseness.result == "pass"

CLI Options

List Evaluations

Preview the evaluations that will be run, grouped by unique combinations of Agent, Dataset, and Evaluators.

pytest --collect-evals

Cache Management

Control how agent responses are cached during test execution.

# 'session' (default): Clears cache at startup. 
# Ensures consistency by sharing the same fresh response across all evaluators for a query.
pytest --cache-mode session

# 'persistence': Preserves cache across sessions. 
# Avoids redundant agent execution, enabling rapid evaluator tuning without agent changes.
pytest --cache-mode persistence

Requirements

  • Python 3.10+
  • Visual Studio Code (recommended for running tests from Test Explorer)
  • VS Code AI Toolkit (recommended for visualizing and analyzing evaluation results, submitting evaluations to run in Foundry, etc.)

License

This project is licensed under the Microsoft AI Toolkit – Pytest Agent Evals License Terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_evals-0.0.1b260305.tar.gz (33.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agent_evals-0.0.1b260305-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agent_evals-0.0.1b260305.tar.gz.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260305.tar.gz
Algorithm Hash digest
SHA256 94a28092fbcb276829c357c3a5a46db0e18267bf87e9b606c7cc0154ecf03182
MD5 e15923576fd750a4de05b5e30ae48943
BLAKE2b-256 07b3fec1a234f8b9c48d5fd71ce2e100e42279f57970acf71a96c585f8e588dd

See more details on using hashes here.

File details

Details for the file pytest_agent_evals-0.0.1b260305-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260305-py3-none-any.whl
Algorithm Hash digest
SHA256 e0886fac0cf744c9c4b5a50d849127896bd1810fa35112de29fa1b16ca54e47e
MD5 a569ee1e6d78cb6c4ae8d3eaa02010b4
BLAKE2b-256 24209f60186324daaac3bd819369474cb356b6e9ba3d3831cf5baffff1eda180

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page