Production-grade evaluation framework for Agent Communication Protocol (ACP) agents

These details have not been verified by PyPI

Project links

Project description

ACP Evals

ACP Evals is an open framework for evaluating AI agents across accuracy, performance, and reliability dimensions.

Modern AI agents need comprehensive testing before deployment. ACP Evals provides production-grade evaluation using LLM-as-judge methodology, designed to integrate seamlessly with the BeeAI ecosystem and any ACP-compliant agent.

ACP Evals enables you to:

Measure response accuracy using configurable LLM judges
Track performance metrics including latency and memory usage
Validate tool usage patterns and error handling
Run batch evaluations for comprehensive test coverage
Generate detailed reports for continuous improvement

Core Concepts

Concept	Description
Accuracy	Evaluates response quality against expected outputs using LLM-as-judge methodology. Supports custom rubrics for domain-specific evaluation.
Performance	Measures latency, memory usage, and token efficiency. Essential for production deployments where speed and resource constraints matter.
Reliability	Validates tool usage patterns, error handling, and consistency across runs. Critical for agents that interact with external systems.

Quick Example

Evaluate agent accuracy with just a few lines:

from acp_evals import AccuracyEval

evaluation = AccuracyEval(
    agent="http://localhost:8001/agents/my-agent",
    rubric="factual"
)

result = await evaluation.run(
    input="What is 10*5 then to the power of 2? do it step by step",
    expected="2500",
    print_results=True
)
assert result is not None and result.score >= 0.7

Core Features

Comprehensive Evaluation - Run all three evaluation dimensions in a single command
Rich TUI Display - Interactive terminal UI with detailed metrics and LLM judge explanations
Batch Testing - Evaluate multiple test cases with parallel execution
Multiple Provider Support - Works with OpenAI, Anthropic, Ollama, and more
Export Capabilities - Generate JSON reports for CI/CD integration

Installation

pip install acp-evals

Quickstart

1. Configure your LLM provider

echo "OPENAI_API_KEY=your-key-here" > .env
acp-evals check

2. Run your first evaluation

# Test accuracy
acp-evals run accuracy http://localhost:8001/agents/my-agent \
  -i "What is 2+2?" -e "4"

3. Run comprehensive evaluation

acp-evals comprehensive http://localhost:8001/agents/my-agent \
  -i "Calculate compound interest" -e "Detailed calculation"

Examples

Performance Evaluation

from acp_evals import PerformanceEval

evaluation = PerformanceEval(
    agent="http://localhost:8001/agents/my-agent",
    num_iterations=5,
    track_memory=True
)

result = await evaluation.run(
    input_text="What is the capital of France?",
    print_results=True
)

Reliability Evaluation

from acp_evals import ReliabilityEval

evaluation = ReliabilityEval(
    agent="http://localhost:8001/agents/my-agent",
    tool_definitions=["search", "calculator"]
)

result = await evaluation.run(
    input="Search for AAPL price and calculate P/E ratio",
    expected_tools=["search", "calculator"],
    print_results=True
)
assert result.passed

Agent Formats

ACP Evals works with any agent implementation:

ACP-compliant agents: http://localhost:8001/agents/my-agent
Python functions: agent.py:function_name
Python modules: mymodule.agent_function

CLI Reference

# Check setup
acp-evals check

# Run evaluations
acp-evals run accuracy <agent> -i <input> -e <expected>
acp-evals run performance <agent> -i <input>
acp-evals run reliability <agent> -i <input> --expected-tools <tool>

# Comprehensive testing
acp-evals comprehensive <agent> -i <input> -e <expected>

# Batch testing
acp-evals run accuracy <agent> --test-file tests.jsonl

Resources

Documentation - API reference and guides
Examples - Ready-to-run code samples
Issues - Report bugs or request features

License

Apache 2.0 - see LICENSE

Developed by contributors to the BeeAI project, this initiative is part of the Linux Foundation AI & Data program. Its development follows open, collaborative, and community-driven practices.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jun 23, 2025

0.1.1

Jun 14, 2025

0.1.0

Jun 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acp_evals-1.0.0.tar.gz (98.6 kB view details)

Uploaded Jun 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

acp_evals-1.0.0-py3-none-any.whl (82.5 kB view details)

Uploaded Jun 23, 2025 Python 3

File details

Details for the file acp_evals-1.0.0.tar.gz.

File metadata

Download URL: acp_evals-1.0.0.tar.gz
Upload date: Jun 23, 2025
Size: 98.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`0eb16f134c485f53639eabe359d609b4c36a11b3a9f5e89b72180de0c90a40dc`
MD5	`7ef5cc6da1281ed82f62fe32deeae155`
BLAKE2b-256	`2a5c327507d3f77b0bd0360551edfd6251fa824f6bb8ab7b119e70a07cf98433`

See more details on using hashes here.

File details

Details for the file acp_evals-1.0.0-py3-none-any.whl.

File metadata

Download URL: acp_evals-1.0.0-py3-none-any.whl
Upload date: Jun 23, 2025
Size: 82.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dca995be483c72fb13c3e8bb27d47043b0bc0ab14609a04d0f338b283c0470e1`
MD5	`03a5b9d2e8774c3ba3fc65ce678cd7cb`
BLAKE2b-256	`3fedf7108cb6644a976a717f449900bf08db82b7b17624c5bbe85997925f7258`

See more details on using hashes here.

acp-evals 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ACP Evals

Core Concepts

Quick Example

Core Features

Installation

Quickstart

Examples

Performance Evaluation

Reliability Evaluation

Agent Formats

CLI Reference

Resources

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes