Skip to main content

Pytest plugin for evaluating AI Agents

Project description

Pytest Agent Evals

A pytest plugin for evaluating AI Agents, seamlessly integrated with VS Code Test Explorer and AI Toolkit.

Installation

pip install pytest-agent-evals --pre

Note: This package is currently in beta. The --pre flag is required to install pre-release versions.

Features

  • Data Loading: Parametrizing tests from dataset files (JSONL) or inline data.
  • Agent Execution: Running agents (ChatAgent or Foundry agent) and caching responses to disk to avoid redundant API calls.
  • Evaluation: Running built-in or custom evaluators (LLM-based or code-based) on the agent's response.
  • Reporting: Aggregating evaluation results into a JSON report and a terminal summary.

Usage

This plugin enables you to evaluate agent responses against datasets using built-in or custom evaluators.

1. Define Your Agent

You can test both local agents (running in your process) and remote agents (hosted in Microsoft Foundry).

Local Agent (ChatAgent)

Test local agent instances that utilize the agent_framework.ChatAgent class. Use ChatAgentConfig to reference a pytest fixture that provides the initialized agent.

import pytest
from pytest_agent_evals import evals, ChatAgentConfig
from my_app.agents import create_my_agent  # Your source code

@pytest.fixture
def my_agent():
    # Return your initialized agent instance
    return create_my_agent()

@evals.agent(ChatAgentConfig(agent_fixture=my_agent))
class TestMyAgent:
    ...

Remote Agent (Foundry)

Connect to an agent hosted in Foundry using FoundryAgentConfig.

from pytest_agent_evals import evals, FoundryAgentConfig

@evals.agent(FoundryAgentConfig(
    agent_name="my-agent",
    project_endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>"
))
class TestFoundryAgent:
    ...

2. Configure Dataset

Use @evals.dataset to parametrize your test class with data from a JSONL file or inline list.

@evals.dataset("data.jsonl")
class TestMyAgent:
    ...

3. Configure Judge Model

Use @evals.judge_model to configure the LLM used for AI-assisted evaluation (e.g., Azure OpenAI).

from pytest_agent_evals import AzureOpenAIModelConfig

@evals.judge_model(AzureOpenAIModelConfig(
    deployment_name="gpt-4.1", 
    endpoint="https://<resource>.openai.azure.com/", 
))
class TestEvaluation:
    ...

4. Define Evaluators

Use @evals.evaluator on your test function to register evaluators that run against the agent's response.

Built-in Evaluators

Use BuiltInEvaluatorConfig to configure built-in evaluators (e.g., coherence, relevance).

from pytest_agent_evals import BuiltInEvaluatorConfig

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestEvaluation:

    @evals.evaluator(BuiltInEvaluatorConfig(name="coherence"))
    def test_quality(self, evaluator_results):
        assert evaluator_results.coherence.result == "pass"

Custom Prompt Evaluators

Use CustomPromptEvaluatorConfig to define your own LLM-based evaluation logic using a Jinja2 template.

from pytest_agent_evals import CustomPromptEvaluatorConfig

friendliness_prompt = """
You are an AI assistant that evaluates the tone of a response.
Score the friendliness of the response on a scale of 1 to 5, where 1 is hostile or rude, and 5 is very friendly and warm.
Provide a brief reason for your score.

### Input:
Response:
{{response}}

You must output your result in the following JSON format:
{
    "result": <integer from 1 to 5>,
    "reason": "<brief explanation>"
}
"""

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestMyCustomPrompts:

    @evals.evaluator(CustomPromptEvaluatorConfig(
        name="friendliness",
        prompt=friendliness_prompt,
        threshold=3
    ))
    def test_friendliness(self, evaluator_results):
        assert evaluator_results.friendliness.result == "pass"

Custom Code Evaluators

Use CustomCodeEvaluatorConfig to execute a Python function for deterministic or rule-based grading.

from pytest_agent_evals import CustomCodeEvaluatorConfig

def length_check(sample, item):
    # Return 1.0 (pass) or 0.0 (fail)
    return 1.0 if len(sample["output_text"]) < 100 else 0.0

@evals.agent(...)
@evals.dataset(...)
class TestMyCodeEvals:

    @evals.evaluator(CustomCodeEvaluatorConfig(
        name="conciseness",
        grader=length_check,
        threshold=0.9
    ))
    def test_conciseness(self, evaluator_results):
        assert evaluator_results.conciseness.result == "pass"

CLI Options

List Evaluations

Preview the evaluations that will be run, grouped by unique combinations of Agent, Dataset, and Evaluators.

pytest --collect-evals

Cache Management

Control how agent responses are cached during test execution.

# 'session' (default): Clears cache at startup. 
# Ensures consistency by sharing the same fresh response across all evaluators for a query.
pytest --cache-mode session

# 'persistence': Preserves cache across sessions. 
# Avoids redundant agent execution, enabling rapid evaluator tuning without agent changes.
pytest --cache-mode persistence

Requirements

  • Python 3.10+
  • Visual Studio Code (recommended for running tests from Test Explorer)
  • VS Code AI Toolkit (recommended for visualizing and analyzing evaluation results, submitting evaluations to run in Foundry, etc.)

License

This project is licensed under the Microsoft AI Toolkit – Pytest Agent Evals License Terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_evals-0.0.1b260128.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agent_evals-0.0.1b260128-py3-none-any.whl (34.3 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agent_evals-0.0.1b260128.tar.gz.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260128.tar.gz
Algorithm Hash digest
SHA256 cded758973eb28534fd90a57949d2e8b56094075e15b9ba17a0f0995af40f35f
MD5 de31f840ab79d9826aeac11d63c01339
BLAKE2b-256 16c6b0b4278761a26f5eec78c72b04d22b88898bcebff609400b612c514a90c7

See more details on using hashes here.

File details

Details for the file pytest_agent_evals-0.0.1b260128-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260128-py3-none-any.whl
Algorithm Hash digest
SHA256 47f67f47f6477174ab9e3ac0829586df4794a643cd6ed8019a4066d177b58bf2
MD5 e9f936ea41cae8b20071b5e0adec4cef
BLAKE2b-256 5611ee27a235c2950c616b79706c6995597a5feb1589c3068103cd88c6418a01

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page