Skip to main content

Pytest plugin for evaluating AI Agents

Project description

Pytest Agent Evals

A pytest plugin for evaluating AI Agents, seamlessly integrated with VS Code Test Explorer and AI Toolkit.

Installation

pip install pytest-agent-evals --pre

Note: This package is currently in beta. The --pre flag is required to install pre-release versions.

Features

  • Data Loading: Parametrizing tests from dataset files (JSONL) or inline data.
  • Agent Execution: Running agents (ChatAgent or Foundry agent) and caching responses to disk to avoid redundant API calls.
  • Evaluation: Running built-in or custom evaluators (LLM-based or code-based) on the agent's response.
  • Reporting: Aggregating evaluation results into a JSON report and a terminal summary.

Usage

This plugin enables you to evaluate agent responses against datasets using built-in or custom evaluators.

1. Define Your Agent

You can test both local agents (running in your process) and remote agents (hosted in Microsoft Foundry).

Local Agent (ChatAgent)

Test local agent instances that utilize the agent_framework.ChatAgent class. Use ChatAgentConfig to reference a pytest fixture that provides the initialized agent.

import pytest
from pytest_agent_evals import evals, ChatAgentConfig
from my_app.agents import create_my_agent  # Your source code

@pytest.fixture
def my_agent():
    # Return your initialized agent instance
    return create_my_agent()

@evals.agent(ChatAgentConfig(agent_fixture=my_agent))
class TestMyAgent:
    ...

Remote Agent (Foundry)

Connect to an agent hosted in Foundry using FoundryAgentConfig.

from pytest_agent_evals import evals, FoundryAgentConfig

@evals.agent(FoundryAgentConfig(
    agent_name="my-agent",
    project_endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>"
))
class TestFoundryAgent:
    ...

2. Configure Dataset

Use @evals.dataset to parametrize your test class with data from a JSONL file or inline list.

@evals.dataset("data.jsonl")
class TestMyAgent:
    ...

3. Configure Judge Model

Use @evals.judge_model to configure the LLM used for AI-assisted evaluation (e.g., Azure OpenAI).

from pytest_agent_evals import AzureOpenAIModelConfig

@evals.judge_model(AzureOpenAIModelConfig(
    deployment_name="gpt-4.1", 
    endpoint="https://<resource>.openai.azure.com/", 
))
class TestEvaluation:
    ...

4. Define Evaluators

Use @evals.evaluator on your test function to register evaluators that run against the agent's response.

Built-in Evaluators

Use BuiltInEvaluatorConfig to configure built-in evaluators (e.g., coherence, relevance).

from pytest_agent_evals import BuiltInEvaluatorConfig

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestEvaluation:

    @evals.evaluator(BuiltInEvaluatorConfig(name="coherence"))
    def test_quality(self, evaluator_results):
        assert evaluator_results.coherence.result == "pass"

Custom Prompt Evaluators

Use CustomPromptEvaluatorConfig to define your own LLM-based evaluation logic using a Jinja2 template.

from pytest_agent_evals import CustomPromptEvaluatorConfig

friendliness_prompt = """
You are an AI assistant that evaluates the tone of a response.
Score the friendliness of the response on a scale of 1 to 5, where 1 is hostile or rude, and 5 is very friendly and warm.
Provide a brief reason for your score.

### Input:
Response:
{{response}}

You must output your result in the following JSON format:
{
    "result": <integer from 1 to 5>,
    "reason": "<brief explanation>"
}
"""

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestMyCustomPrompts:

    @evals.evaluator(CustomPromptEvaluatorConfig(
        name="friendliness",
        prompt=friendliness_prompt,
        threshold=3
    ))
    def test_friendliness(self, evaluator_results):
        assert evaluator_results.friendliness.result == "pass"

Custom Code Evaluators

Use CustomCodeEvaluatorConfig to execute a Python function for deterministic or rule-based grading.

from pytest_agent_evals import CustomCodeEvaluatorConfig

def length_check(sample, item):
    # Return 1.0 (pass) or 0.0 (fail)
    return 1.0 if len(sample["output_text"]) < 100 else 0.0

@evals.agent(...)
@evals.dataset(...)
class TestMyCodeEvals:

    @evals.evaluator(CustomCodeEvaluatorConfig(
        name="conciseness",
        grader=length_check,
        threshold=0.9
    ))
    def test_conciseness(self, evaluator_results):
        assert evaluator_results.conciseness.result == "pass"

CLI Options

List Evaluations

Preview the evaluations that will be run, grouped by unique combinations of Agent, Dataset, and Evaluators.

pytest --collect-evals

Cache Management

Control how agent responses are cached during test execution.

# 'session' (default): Clears cache at startup. 
# Ensures consistency by sharing the same fresh response across all evaluators for a query.
pytest --cache-mode session

# 'persistence': Preserves cache across sessions. 
# Avoids redundant agent execution, enabling rapid evaluator tuning without agent changes.
pytest --cache-mode persistence

Requirements

  • Python 3.10+
  • Visual Studio Code (recommended for running tests from Test Explorer)
  • VS Code AI Toolkit (recommended for visualizing and analyzing evaluation results, submitting evaluations to run in Foundry, etc.)

License

This project is licensed under the Microsoft AI Toolkit – Pytest Agent Evals License Terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_evals-0.0.1b260130.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_agent_evals-0.0.1b260130-py3-none-any.whl (34.2 kB view details)

Uploaded Python 3

File details

Details for the file pytest_agent_evals-0.0.1b260130.tar.gz.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260130.tar.gz
Algorithm Hash digest
SHA256 2fa5a605cc6c84ad97cd088e08e529d8b2f1c0de2c66a75e832d6abc41b87a57
MD5 a6bc3395fa901f96034d5c3c3c70181f
BLAKE2b-256 a1cddff231a4be3bce0f8412f60699604cf10745e24b6b6d52b573b564fd9775

See more details on using hashes here.

File details

Details for the file pytest_agent_evals-0.0.1b260130-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_agent_evals-0.0.1b260130-py3-none-any.whl
Algorithm Hash digest
SHA256 ff8c3cb6148d78780386204380f9da30e6f29a36796715804c0933486e586043
MD5 e399601f4b7a958e2272b5cd1ce42030
BLAKE2b-256 452962165a985b7ca2729ca86c04f6ad5f9354ea99589981682c38285b908672

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page