Comprehensive evaluation framework for Agent Communication Protocol (ACP) agents
Project description
ACP Evals
Production-ready evaluation framework for multi-agent systems in the ACP/BeeAI ecosystem
Overview
ACP Evals is an evaluation framework for multi-agent systems built on the Agent Communication Protocol. Evaluation frameworks measure the quality, performance, and safety of AI agent outputs through automated scoring methods. In production agent systems, these measurements become critical for ensuring reliability, detecting regressions, and optimizing performance at scale.
ACP Evals specializes in the unique challenges of coordinated agent systems. The framework measures how well agents collaborate, preserve information across handoffs, and maintain workflow coherence under production conditions.
Getting Started
The quickest way to understand ACP Evals is through the basic evaluation workflow. Install the framework, configure your LLM provider, and run your first evaluation to establish the fundamental pattern.
Installation
# Basic installation
pip install acp-evals
# Development installation with all providers
cd python/
pip install -e ".[dev,all-providers]"
Provider Configuration
Create a .env file in your project root:
# Copy the example configuration
cp python/.env.example python/.env
# Add your API keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://localhost:11434
Basic Evaluation
from acp_evals import evaluate, AccuracyEval
# Evaluate any ACP agent with three lines
result = evaluate(
AccuracyEval(agent="http://localhost:8000/agents/my-agent"),
input="What is the capital of France?",
expected="Paris"
)
print(f"Score: {result.score:.2f}, Cost: ${result.cost:.4f}")
# Works with any provider out of the box
eval = AccuracyEval(agent=my_agent, provider="anthropic") # or "openai", "ollama"
# Multi-agent coordination (unique to ACP Evals)
from acp_evals import HandoffEval
result = HandoffEval(agents={"researcher": url1, "writer": url2}).run(task)
This pattern extends to all evaluation types. Replace AccuracyEval with PerformanceEval, SafetyEval, or ReliabilityEval to measure different aspects of agent behavior.
System Architecture
graph TB
A[ACP Agent] --> B[ACP Evals Framework]
B --> C[Developer API<br/>evaluate Accuracy and Performance]
B --> D[Multi-Agent Evaluators<br/> Communication Patterns and Framework Integrity]
B --> E[Production Features<br/>Trace Recycling and Continuous Evaluation]
F[LLM Providers<br/>OpenAI Anthropic Ollama] --> B
B --> G[Results<br/>Eval Performance and Costs Analytics]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
style F fill:#f1f8e9
style G fill:#e0f2f1
Core Evaluation Capabilities
Quick Start: Start with 3-line evaluations, scale to enterprise multi-agent benchmarks
Quality & Performance Evaluators
- AccuracyEval: LLM-as-judge with customizable rubrics (factual, research, code quality)
- GroundednessEvaluator: Context-grounded response validation
- RetrievalEvaluator: Information retrieval quality assessment
- DocumentRetrievalEvaluator: Full IR metrics (precision, recall, NDCG, MAP, MRR)
- PerformanceEval: Token usage, latency, and cost tracking across providers
Multi-Agent Specialized Metrics
Industry First: An evaluation framework built specifically for multi-agent coordination
- Handoff Quality: Information preservation across agent transitions
- Coordination Patterns: LinearPattern, SupervisorPattern, SwarmPattern evaluation
- Context Maintenance: Cross-agent context analysis and noise detection
- Decision Preservation: Agent-to-agent decision quality tracking
Risk & Safety Evaluators
- SafetyEval: Composite safety and bias detection
- Adversarial Testing: Real-world attack pattern resistance (prompt injection, jailbreaks)
- ReliabilityEval: Tool usage validation and error handling assessment
Quick Start
⚡ Zero to Evaluation: Get comprehensive agent metrics in under 60 seconds
from acp_evals import evaluate, AccuracyEval
# Evaluate any ACP agent in 3 lines
result = evaluate(
AccuracyEval(agent="http://localhost:8000/agents/research-agent"),
input="What are the latest developments in quantum computing?",
expected="Recent quantum computing advances include..."
)
print(f"Score: {result.score}, Cost: ${result.cost}")
Multi-Agent Evaluation
Coordination Testing: Measure how well agents work together, not just individually
from acp_evals.benchmarks import HandoffBenchmark
from acp_evals.patterns import LinearPattern
# Evaluate agent coordination
benchmark = HandoffBenchmark(
pattern=LinearPattern(["researcher", "analyzer", "synthesizer"]),
tasks="research_quality",
endpoint="http://localhost:8000"
)
results = await benchmark.run_batch(
test_data="multi_agent_tasks.jsonl",
parallel=True,
export="coordination_results.json"
)
Advanced Features
Production Integration
Built for real-world deployment monitoring
- Trace Recycling: Convert production telemetry to evaluation datasets (example)
- Continuous Evaluation: Automated regression detection and baseline tracking (docs)
- OpenTelemetry Export: Real-time metrics to Jaeger, Phoenix, and observability platforms
- Cost Optimization: Multi-provider cost comparison and budget alerts
Adversarial & Robustness Testing
Test against real-world attack patterns, not academic examples
- Real-World Attack Patterns: Prompt injection, context manipulation, data extraction (example)
- Edge Case Generation: Synthetic adversarial scenario creation
Dataset & Benchmarking
Gold standard datasets to custom synthetic data
- Gold Standard Datasets: Production-realistic multi-step agent tasks
- External Integration: TRAIL, GAIA, SWE-Bench benchmark support
- Custom Dataset Loaders: Flexible evaluation data management
- Synthetic Data Generation: Automated test case creation
Installation & Setup
# Basic installation
pip install acp-evals
# Development installation
cd python/
pip install -e .
Provider Configuration
# Copy environment template
cp python/.env.example python/.env
# Configure API keys in .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://localhost:11434
Supported Providers & Models
** Provider Flexibility**: Test locally with Ollama or scale with cloud providers
| Provider | Models | Cost Tracking |
|---|---|---|
| OpenAI | GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o4-mini | ✅ |
| Anthropic | Claude-4-Opus, Claude-4-Sonnet | ✅ |
| Ollama | granite3.3:8b, qwen3:30b-a3b, custom | ✅ |
| Mock Mode | Simulated responses | ✅ |
Native ACP/BeeAI Integration
🔗 Ecosystem Native: Purpose-built for the ACP/BeeAI stack
- ACP Message Handling: Native support for ACP communication patterns (example)
- BeeAI Agent Instances: Direct integration with BeeAI Framework agents
- Workflow Evaluation: Built-in support for BeeAI multi-agent workflows
- Event Stream Analysis: Real-time evaluation of agent interactions
Documentation & Examples
| Resource | Description |
|---|---|
| 📚 Architecture Guide | Framework design and components |
| 🚀 Setup Guide | Installation and configuration |
| 🔌 Provider Setup | LLM provider configuration |
| 💡 Examples | 13 comprehensive usage examples |
Quick Start Examples
Essential (Start Here):
- 00_minimal_example.py: 3-line agent evaluation
- 01_quickstart_accuracy.py: Basic accuracy assessment
- 02_multi_agent_evaluation.py: Agent coordination testing
Production Integration:
- 04_continuous_evaluation.py: CI/CD monitoring pipeline
- 12_end_to_end_trace_pipeline.py: Production trace recycling
- 09_real_acp_agents.py: Live ACP agent integration
Advanced:
- 07_adversarial_testing.py: Security robustness evaluation
- 13_synthetic_data_generation.py: Custom dataset creation
Batch Evaluation and Automation
Production agent systems require automated evaluation workflows that can process multiple test cases, generate comprehensive reports, and integrate with continuous integration systems.
Batch Processing
# Evaluate multiple test cases from a dataset
results = AccuracyEval(agent=my_agent).run_batch(
test_data="test_cases.jsonl",
parallel=True,
progress=True,
export="results.json"
)
print(f"Pass rate: {results.pass_rate}%, Average score: {results.avg_score:.2f}")
The JSONL format expects each line to contain a JSON object with input and expected fields:
{"input": "What is machine learning?", "expected": "Machine learning is a method of data analysis..."}
{"input": "Explain neural networks", "expected": "Neural networks are computing systems inspired by..."}
CI/CD Integration
# Integrate with pytest or other testing frameworks
def test_agent_accuracy():
eval = AccuracyEval(agent=my_agent, mock_mode=CI_ENV)
result = eval.run(
input="Test question for CI",
expected="Expected answer"
)
assert result.score > 0.8, f"Agent scored {result.score}, below threshold"
Understanding Results
Evaluation results follow a consistent structure across all evaluator types. Understanding result interpretation enables effective debugging and optimization workflows.
Result Structure
# All evaluators return results with this structure
result = eval.run(input="test", expected="expected")
print(f"Score: {result.score}") # Float 0.0-1.0
print(f"Passed: {result.passed}") # Boolean pass/fail
print(f"Cost: ${result.cost:.4f}") # USD cost
print(f"Tokens: {result.tokens}") # Token usage breakdown
print(f"Latency: {result.latency_ms}ms") # Response time
print(f"Details: {result.details}") # Evaluator-specific metrics
Debugging Failed Evaluations
When evaluations fail or score lower than expected, the details field provides specific feedback:
result = AccuracyEval(agent=my_agent).run(
input="Complex technical question",
expected="Technical answer"
)
if result.score < 0.7:
print("Evaluation feedback:")
print(result.details.get("judge_reasoning", "No reasoning provided"))
print(f"Specific issues: {result.details.get('issues', [])}")
Troubleshooting
Common issues and solutions for evaluation setup and execution.
Provider Configuration Issues
If evaluations fail with authentication errors, verify your provider configuration:
# Test provider connectivity
from acp_evals.providers.factory import ProviderFactory
provider = ProviderFactory.get_provider("openai") # or "anthropic", "ollama"
print(f"Provider status: {provider.health_check()}")
Agent Connection Problems
For ACP agent connectivity issues:
# Test agent health before evaluation
import httpx
async def test_agent_health(agent_url):
async with httpx.AsyncClient() as client:
response = await client.get(f"{agent_url}/health")
return response.status_code == 200
# Use in evaluations
if await test_agent_health("http://localhost:8000/agents/my-agent"):
result = evaluate(AccuracyEval(agent=agent_url), input, expected)
Performance Optimization
For large-scale evaluations:
# Use batch processing with parallelization
results = AccuracyEval(agent=my_agent).run_batch(
test_data="large_dataset.jsonl",
parallel=True,
batch_size=10, # Process 10 at a time
max_workers=4 # Limit concurrent evaluations
)
Project Structure
acp-evals/
├── python/ # Core Python implementation
│ ├── src/acp_evals/
│ │ ├── api.py # Simple developer API
│ │ ├── evaluators/ # Built-in evaluators
│ │ │ ├── accuracy.py # LLM-as-judge evaluation
│ │ │ ├── groundedness.py # Context grounding assessment
│ │ │ ├── retrieval.py # Information retrieval metrics
│ │ │ └── safety.py # Safety and bias detection
│ │ ├── benchmarks/ # Multi-agent benchmarking
│ │ │ ├── datasets/ # Gold standard & adversarial data
│ │ │ │ ├── gold_standard_datasets.py
│ │ │ │ ├── adversarial_datasets.py
│ │ │ │ └── trace_recycler.py
│ │ │ └── multi_agent/ # Agent coordination benchmarks
│ │ ├── patterns/ # Agent architecture patterns
│ │ │ ├── linear.py # Sequential execution
│ │ │ ├── supervisor.py # Centralized coordination
│ │ │ └── swarm.py # Distributed collaboration
│ │ ├── providers/ # LLM provider abstractions
│ │ │ ├── openai.py # OpenAI integration
│ │ │ ├── anthropic.py # Anthropic integration
│ │ │ └── ollama.py # Local model support
│ │ ├── evaluation/ # Advanced evaluation features
│ │ │ ├── continuous.py # Continuous eval pipeline
│ │ │ └── simulator.py # Synthetic data generation
│ │ ├── telemetry/ # Observability integration
│ │ │ └── otel_exporter.py # OpenTelemetry export
│ │ └── cli.py # Command-line interface
│ ├── tests/ # Comprehensive test suite
│ ├── examples/ # Usage examples (13 files)
│ └── docs/ # Architecture & setup guides
Contributing
The framework is designed for extensibility:
- New Evaluators: Add custom evaluation logic in
evaluators/ - Provider Support: Extend
providers/for new LLM providers - Coordination Patterns: Implement new multi-agent patterns in
patterns/ - Dataset Integration: Add external benchmarks in
benchmarks/datasets/
See our contribution guide for detailed guidance.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Part of the BeeAI project, an initiative of the Linux Foundation AI & Data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file acp_evals-0.1.1.tar.gz.
File metadata
- Download URL: acp_evals-0.1.1.tar.gz
- Upload date:
- Size: 206.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
34ea055f999b5abaf07442e6bbf3333c07ab922377bd1a3ba0a00d66fbe3fcdc
|
|
| MD5 |
08816db7854082a64fa2396b1df8701a
|
|
| BLAKE2b-256 |
c8ce33ba285843e771e70037ad612f7b0224ad9d14cad3fc8d8dc8b3b94f82f0
|
File details
Details for the file acp_evals-0.1.1-py3-none-any.whl.
File metadata
- Download URL: acp_evals-0.1.1-py3-none-any.whl
- Upload date:
- Size: 165.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e43ea734e5d81358dd565df3fd7626f68825541aa354a6f7563721b092b7349
|
|
| MD5 |
05753379bad046a3c08bb7d9c472735a
|
|
| BLAKE2b-256 |
fd81efbf2498956a90f4453e506292ac136d1096043e444802fdc5a45cfcd698
|