Production-grade evaluation framework for Agent Communication Protocol (ACP) agents
Project description
ACP Evals
ACP Evals is an open framework for evaluating AI agents across accuracy, performance, and reliability dimensions.
Modern AI agents need comprehensive testing before deployment. ACP Evals provides production-grade evaluation using LLM-as-judge methodology, designed to integrate seamlessly with the BeeAI ecosystem and any ACP-compliant agent.
ACP Evals enables you to:
- Measure response accuracy using configurable LLM judges
- Track performance metrics including latency and memory usage
- Validate tool usage patterns and error handling
- Run batch evaluations for comprehensive test coverage
- Generate detailed reports for continuous improvement
Core Concepts
| Concept | Description |
|---|---|
| Accuracy | Evaluates response quality against expected outputs using LLM-as-judge methodology. Supports custom rubrics for domain-specific evaluation. |
| Performance | Measures latency, memory usage, and token efficiency. Essential for production deployments where speed and resource constraints matter. |
| Reliability | Validates tool usage patterns, error handling, and consistency across runs. Critical for agents that interact with external systems. |
Quick Example
Evaluate agent accuracy with just a few lines:
from acp_evals import AccuracyEval
evaluation = AccuracyEval(
agent="http://localhost:8001/agents/my-agent",
rubric="factual"
)
result = await evaluation.run(
input="What is 10*5 then to the power of 2? do it step by step",
expected="2500",
print_results=True
)
assert result is not None and result.score >= 0.7
Core Features
- Comprehensive Evaluation - Run all three evaluation dimensions in a single command
- Rich TUI Display - Interactive terminal UI with detailed metrics and LLM judge explanations
- Batch Testing - Evaluate multiple test cases with parallel execution
- Multiple Provider Support - Works with OpenAI, Anthropic, Ollama, and more
- Export Capabilities - Generate JSON reports for CI/CD integration
Installation
pip install acp-evals
Quickstart
1. Configure your LLM provider
echo "OPENAI_API_KEY=your-key-here" > .env
acp-evals check
2. Run your first evaluation
# Test accuracy
acp-evals run accuracy http://localhost:8001/agents/my-agent \
-i "What is 2+2?" -e "4"
3. Run comprehensive evaluation
acp-evals comprehensive http://localhost:8001/agents/my-agent \
-i "Calculate compound interest" -e "Detailed calculation"
Examples
Performance Evaluation
from acp_evals import PerformanceEval
evaluation = PerformanceEval(
agent="http://localhost:8001/agents/my-agent",
num_iterations=5,
track_memory=True
)
result = await evaluation.run(
input_text="What is the capital of France?",
print_results=True
)
Reliability Evaluation
from acp_evals import ReliabilityEval
evaluation = ReliabilityEval(
agent="http://localhost:8001/agents/my-agent",
tool_definitions=["search", "calculator"]
)
result = await evaluation.run(
input="Search for AAPL price and calculate P/E ratio",
expected_tools=["search", "calculator"],
print_results=True
)
assert result.passed
Agent Formats
ACP Evals works with any agent implementation:
- ACP-compliant agents:
http://localhost:8001/agents/my-agent - Python functions:
agent.py:function_name - Python modules:
mymodule.agent_function
CLI Reference
# Check setup
acp-evals check
# Run evaluations
acp-evals run accuracy <agent> -i <input> -e <expected>
acp-evals run performance <agent> -i <input>
acp-evals run reliability <agent> -i <input> --expected-tools <tool>
# Comprehensive testing
acp-evals comprehensive <agent> -i <input> -e <expected>
# Batch testing
acp-evals run accuracy <agent> --test-file tests.jsonl
Resources
- Documentation - API reference and guides
- Examples - Ready-to-run code samples
- Issues - Report bugs or request features
License
Apache 2.0 - see LICENSE
Developed by contributors to the BeeAI project, this initiative is part of the Linux Foundation AI & Data program. Its development follows open, collaborative, and community-driven practices.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file acp_evals-1.0.0.tar.gz.
File metadata
- Download URL: acp_evals-1.0.0.tar.gz
- Upload date:
- Size: 98.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0eb16f134c485f53639eabe359d609b4c36a11b3a9f5e89b72180de0c90a40dc
|
|
| MD5 |
7ef5cc6da1281ed82f62fe32deeae155
|
|
| BLAKE2b-256 |
2a5c327507d3f77b0bd0360551edfd6251fa824f6bb8ab7b119e70a07cf98433
|
File details
Details for the file acp_evals-1.0.0-py3-none-any.whl.
File metadata
- Download URL: acp_evals-1.0.0-py3-none-any.whl
- Upload date:
- Size: 82.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dca995be483c72fb13c3e8bb27d47043b0bc0ab14609a04d0f338b283c0470e1
|
|
| MD5 |
03a5b9d2e8774c3ba3fc65ce678cd7cb
|
|
| BLAKE2b-256 |
3fedf7108cb6644a976a717f449900bf08db82b7b17624c5bbe85997925f7258
|