Pytest plugin for evaluating AI Agents

These details have not been verified by PyPI

Project links

Project description

Pytest Agent Evals

A pytest plugin for evaluating AI Agents, seamlessly integrated with VS Code Test Explorer and AI Toolkit.

Installation

pip install pytest-agent-evals --pre

Note: This package is currently in beta. The --pre flag is required to install pre-release versions.

Features

Data Loading: Parametrizing tests from dataset files (JSONL) or inline data.
Agent Execution: Running agents (ChatAgent or Foundry agent) and caching responses to disk to avoid redundant API calls.
Evaluation: Running built-in or custom evaluators (LLM-based or code-based) on the agent's response.
Reporting: Aggregating evaluation results into a JSON report and a terminal summary.

Usage

This plugin enables you to evaluate agent responses against datasets using built-in or custom evaluators.

1. Define Your Agent

You can test both local agents (running in your process) and remote agents (hosted in Microsoft Foundry).

Local Agent (ChatAgent)

Test local agent instances that utilize the agent_framework.ChatAgent class. Use ChatAgentConfig to reference a pytest fixture that provides the initialized agent.

import pytest
from pytest_agent_evals import evals, ChatAgentConfig
from my_app.agents import create_my_agent  # Your source code

@pytest.fixture
def my_agent():
    # Return your initialized agent instance
    return create_my_agent()

@evals.agent(ChatAgentConfig(agent_fixture=my_agent))
class TestMyAgent:
    ...

Remote Agent (Foundry)

Connect to an agent hosted in Foundry using FoundryAgentConfig.

from pytest_agent_evals import evals, FoundryAgentConfig

@evals.agent(FoundryAgentConfig(
    agent_name="my-agent",
    project_endpoint="https://<resource>.services.ai.azure.com/api/projects/<project>"
))
class TestFoundryAgent:
    ...

2. Configure Dataset

Use @evals.dataset to parametrize your test class with data from a JSONL file or inline list.

@evals.dataset("data.jsonl")
class TestMyAgent:
    ...

3. Configure Judge Model

Use @evals.judge_model to configure the LLM used for AI-assisted evaluation (e.g., Azure OpenAI).

from pytest_agent_evals import AzureOpenAIModelConfig

@evals.judge_model(AzureOpenAIModelConfig(
    deployment_name="gpt-4.1", 
    endpoint="https://<resource>.openai.azure.com/", 
))
class TestEvaluation:
    ...

4. Define Evaluators

Use @evals.evaluator on your test function to register evaluators that run against the agent's response.

Built-in Evaluators

Use BuiltInEvaluatorConfig to configure built-in evaluators (e.g., coherence, relevance).

from pytest_agent_evals import BuiltInEvaluatorConfig

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestEvaluation:

    @evals.evaluator(BuiltInEvaluatorConfig(name="coherence"))
    def test_quality(self, evaluator_results):
        assert evaluator_results.coherence.result == "pass"

Custom Prompt Evaluators

Use CustomPromptEvaluatorConfig to define your own LLM-based evaluation logic using a Jinja2 template.

from pytest_agent_evals import CustomPromptEvaluatorConfig

friendliness_prompt = """
You are an AI assistant that evaluates the tone of a response.
Score the friendliness of the response on a scale of 1 to 5, where 1 is hostile or rude, and 5 is very friendly and warm.
Provide a brief reason for your score.

### Input:
Response:
{{response}}

You must output your result in the following JSON format:
{
    "result": <integer from 1 to 5>,
    "reason": "<brief explanation>"
}
"""

@evals.agent(...)
@evals.dataset(...)
@evals.judge_model(...)
class TestMyCustomPrompts:

    @evals.evaluator(CustomPromptEvaluatorConfig(
        name="friendliness",
        prompt=friendliness_prompt,
        threshold=3
    ))
    def test_friendliness(self, evaluator_results):
        assert evaluator_results.friendliness.result == "pass"

Custom Code Evaluators

Use CustomCodeEvaluatorConfig to execute a Python function for deterministic or rule-based grading.

from pytest_agent_evals import CustomCodeEvaluatorConfig

def length_check(sample, item):
    # Return 1.0 (pass) or 0.0 (fail)
    return 1.0 if len(sample["output_text"]) < 100 else 0.0

@evals.agent(...)
@evals.dataset(...)
class TestMyCodeEvals:

    @evals.evaluator(CustomCodeEvaluatorConfig(
        name="conciseness",
        grader=length_check,
        threshold=0.9
    ))
    def test_conciseness(self, evaluator_results):
        assert evaluator_results.conciseness.result == "pass"

CLI Options

List Evaluations

Preview the evaluations that will be run, grouped by unique combinations of Agent, Dataset, and Evaluators.

pytest --collect-evals

Cache Management

Control how agent responses are cached during test execution.

# 'session' (default): Clears cache at startup. 
# Ensures consistency by sharing the same fresh response across all evaluators for a query.
pytest --cache-mode session

# 'persistence': Preserves cache across sessions. 
# Avoids redundant agent execution, enabling rapid evaluator tuning without agent changes.
pytest --cache-mode persistence

Requirements

Python 3.10+
Visual Studio Code (recommended for running tests from Test Explorer)
VS Code AI Toolkit (recommended for visualizing and analyzing evaluation results, submitting evaluations to run in Foundry, etc.)

License

This project is licensed under the Microsoft AI Toolkit – Pytest Agent Evals License Terms.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.1b260313 pre-release

Mar 13, 2026

0.0.1b260306 pre-release

Mar 6, 2026

0.0.1b260305 pre-release

Mar 5, 2026

0.0.1b260210 pre-release

Feb 10, 2026

This version

0.0.1b260130 pre-release

Jan 30, 2026

0.0.1b260129 pre-release

Jan 29, 2026

0.0.1b260128 pre-release

Jan 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_agent_evals-0.0.1b260130.tar.gz (32.8 kB view details)

Uploaded Jan 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_agent_evals-0.0.1b260130-py3-none-any.whl (34.2 kB view details)

Uploaded Jan 30, 2026 Python 3

File details

Details for the file pytest_agent_evals-0.0.1b260130.tar.gz.

File metadata

Download URL: pytest_agent_evals-0.0.1b260130.tar.gz
Upload date: Jan 30, 2026
Size: 32.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pytest_agent_evals-0.0.1b260130.tar.gz
Algorithm	Hash digest
SHA256	`2fa5a605cc6c84ad97cd088e08e529d8b2f1c0de2c66a75e832d6abc41b87a57`
MD5	`a6bc3395fa901f96034d5c3c3c70181f`
BLAKE2b-256	`a1cddff231a4be3bce0f8412f60699604cf10745e24b6b6d52b573b564fd9775`

See more details on using hashes here.

File details

Details for the file pytest_agent_evals-0.0.1b260130-py3-none-any.whl.

File metadata

Download URL: pytest_agent_evals-0.0.1b260130-py3-none-any.whl
Upload date: Jan 30, 2026
Size: 34.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.3

File hashes

Hashes for pytest_agent_evals-0.0.1b260130-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ff8c3cb6148d78780386204380f9da30e6f29a36796715804c0933486e586043`
MD5	`e399601f4b7a958e2272b5cd1ce42030`
BLAKE2b-256	`452962165a985b7ca2729ca86c04f6ad5f9354ea99589981682c38285b908672`

See more details on using hashes here.

pytest-agent-evals 0.0.1b260130

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Pytest Agent Evals

Installation

Features

Usage

1. Define Your Agent

Local Agent (ChatAgent)

Remote Agent (Foundry)

2. Configure Dataset

3. Configure Judge Model

4. Define Evaluators

Built-in Evaluators

Custom Prompt Evaluators

Custom Code Evaluators

CLI Options

List Evaluations

Cache Management

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes