Skip to main content

Comprehensive evaluation framework for Agent Communication Protocol (ACP) agents

Project description

ACP Evals

Production-ready evaluation framework for multi-agent systems in the ACP/BeeAI ecosystem

Python ACP Compatible License

Overview

ACP Evals is an evaluation framework for multi-agent systems built on the Agent Communication Protocol. Evaluation frameworks measure the quality, performance, and safety of AI agent outputs through automated scoring methods. In production agent systems, these measurements become critical for ensuring reliability, detecting regressions, and optimizing performance at scale.

ACP Evals specializes in the unique challenges of coordinated agent systems. The framework measures how well agents collaborate, preserve information across handoffs, and maintain workflow coherence under production conditions.

Getting Started

The quickest way to understand ACP Evals is through the basic evaluation workflow. Install the framework, configure your LLM provider, and run your first evaluation to establish the fundamental pattern.

Installation

# Basic installation
pip install acp-evals

# Development installation with all providers
cd python/
pip install -e ".[dev,all-providers]"

Provider Configuration

Create a .env file in your project root:

# Copy the example configuration
cp python/.env.example python/.env

# Add your API keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://localhost:11434

Basic Evaluation

from acp_evals import evaluate, AccuracyEval

# Evaluate any ACP agent with three lines
result = evaluate(
    AccuracyEval(agent="http://localhost:8000/agents/my-agent"),
    input="What is the capital of France?",
    expected="Paris"
)
print(f"Score: {result.score:.2f}, Cost: ${result.cost:.4f}")
# Works with any provider out of the box
eval = AccuracyEval(agent=my_agent, provider="anthropic")  # or "openai", "ollama"

# Multi-agent coordination (unique to ACP Evals)
from acp_evals import HandoffEval
result = HandoffEval(agents={"researcher": url1, "writer": url2}).run(task)

This pattern extends to all evaluation types. Replace AccuracyEval with PerformanceEval, SafetyEval, or ReliabilityEval to measure different aspects of agent behavior.

System Architecture

graph TB
    A[ACP Agent] --> B[ACP Evals Framework]
    
    B --> C[Developer API<br/>evaluate Accuracy and Performance]
    B --> D[Multi-Agent Evaluators<br/> Communication Patterns and Framework Integrity]
    B --> E[Production Features<br/>Trace Recycling and Continuous Evaluation]
    
    F[LLM Providers<br/>OpenAI Anthropic Ollama] --> B
    
    B --> G[Results<br/>Eval Performance and Costs Analytics]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
    style F fill:#f1f8e9
    style G fill:#e0f2f1

Core Evaluation Capabilities

Quick Start: Start with 3-line evaluations, scale to enterprise multi-agent benchmarks

Quality & Performance Evaluators

Multi-Agent Specialized Metrics

Industry First: An evaluation framework built specifically for multi-agent coordination

Risk & Safety Evaluators

Quick Start

⚡ Zero to Evaluation: Get comprehensive agent metrics in under 60 seconds

from acp_evals import evaluate, AccuracyEval

# Evaluate any ACP agent in 3 lines
result = evaluate(
    AccuracyEval(agent="http://localhost:8000/agents/research-agent"),
    input="What are the latest developments in quantum computing?",
    expected="Recent quantum computing advances include..."
)
print(f"Score: {result.score}, Cost: ${result.cost}")

Multi-Agent Evaluation

Coordination Testing: Measure how well agents work together, not just individually

from acp_evals.benchmarks import HandoffBenchmark
from acp_evals.patterns import LinearPattern

# Evaluate agent coordination
benchmark = HandoffBenchmark(
    pattern=LinearPattern(["researcher", "analyzer", "synthesizer"]),
    tasks="research_quality",
    endpoint="http://localhost:8000"
)

results = await benchmark.run_batch(
    test_data="multi_agent_tasks.jsonl",
    parallel=True,
    export="coordination_results.json"
)

Advanced Features

Production Integration

Built for real-world deployment monitoring

Adversarial & Robustness Testing

Test against real-world attack patterns, not academic examples

Dataset & Benchmarking

Gold standard datasets to custom synthetic data

Installation & Setup

# Basic installation
pip install acp-evals

# Development installation
cd python/
pip install -e .

Provider Configuration

# Copy environment template
cp python/.env.example python/.env

# Configure API keys in .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://localhost:11434

Supported Providers & Models

** Provider Flexibility**: Test locally with Ollama or scale with cloud providers

Provider Models Cost Tracking
OpenAI GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o4-mini
Anthropic Claude-4-Opus, Claude-4-Sonnet
Ollama granite3.3:8b, qwen3:30b-a3b, custom
Mock Mode Simulated responses

Native ACP/BeeAI Integration

🔗 Ecosystem Native: Purpose-built for the ACP/BeeAI stack

  • ACP Message Handling: Native support for ACP communication patterns (example)
  • BeeAI Agent Instances: Direct integration with BeeAI Framework agents
  • Workflow Evaluation: Built-in support for BeeAI multi-agent workflows
  • Event Stream Analysis: Real-time evaluation of agent interactions

Documentation & Examples

Resource Description
📚 Architecture Guide Framework design and components
🚀 Setup Guide Installation and configuration
🔌 Provider Setup LLM provider configuration
💡 Examples 13 comprehensive usage examples

Quick Start Examples

Essential (Start Here):

Production Integration:

Advanced:

Batch Evaluation and Automation

Production agent systems require automated evaluation workflows that can process multiple test cases, generate comprehensive reports, and integrate with continuous integration systems.

Batch Processing

# Evaluate multiple test cases from a dataset
results = AccuracyEval(agent=my_agent).run_batch(
    test_data="test_cases.jsonl",
    parallel=True,
    progress=True,
    export="results.json"
)

print(f"Pass rate: {results.pass_rate}%, Average score: {results.avg_score:.2f}")

The JSONL format expects each line to contain a JSON object with input and expected fields:

{"input": "What is machine learning?", "expected": "Machine learning is a method of data analysis..."}
{"input": "Explain neural networks", "expected": "Neural networks are computing systems inspired by..."}

CI/CD Integration

# Integrate with pytest or other testing frameworks
def test_agent_accuracy():
    eval = AccuracyEval(agent=my_agent, mock_mode=CI_ENV)
    result = eval.run(
        input="Test question for CI",
        expected="Expected answer"
    )
    assert result.score > 0.8, f"Agent scored {result.score}, below threshold"

Understanding Results

Evaluation results follow a consistent structure across all evaluator types. Understanding result interpretation enables effective debugging and optimization workflows.

Result Structure

# All evaluators return results with this structure
result = eval.run(input="test", expected="expected")

print(f"Score: {result.score}")           # Float 0.0-1.0
print(f"Passed: {result.passed}")         # Boolean pass/fail
print(f"Cost: ${result.cost:.4f}")        # USD cost
print(f"Tokens: {result.tokens}")         # Token usage breakdown
print(f"Latency: {result.latency_ms}ms")  # Response time
print(f"Details: {result.details}")       # Evaluator-specific metrics

Debugging Failed Evaluations

When evaluations fail or score lower than expected, the details field provides specific feedback:

result = AccuracyEval(agent=my_agent).run(
    input="Complex technical question",
    expected="Technical answer"
)

if result.score < 0.7:
    print("Evaluation feedback:")
    print(result.details.get("judge_reasoning", "No reasoning provided"))
    print(f"Specific issues: {result.details.get('issues', [])}")

Troubleshooting

Common issues and solutions for evaluation setup and execution.

Provider Configuration Issues

If evaluations fail with authentication errors, verify your provider configuration:

# Test provider connectivity
from acp_evals.providers.factory import ProviderFactory

provider = ProviderFactory.get_provider("openai")  # or "anthropic", "ollama"
print(f"Provider status: {provider.health_check()}")

Agent Connection Problems

For ACP agent connectivity issues:

# Test agent health before evaluation
import httpx

async def test_agent_health(agent_url):
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{agent_url}/health")
        return response.status_code == 200

# Use in evaluations
if await test_agent_health("http://localhost:8000/agents/my-agent"):
    result = evaluate(AccuracyEval(agent=agent_url), input, expected)

Performance Optimization

For large-scale evaluations:

# Use batch processing with parallelization
results = AccuracyEval(agent=my_agent).run_batch(
    test_data="large_dataset.jsonl",
    parallel=True,
    batch_size=10,  # Process 10 at a time
    max_workers=4   # Limit concurrent evaluations
)

Project Structure

acp-evals/
├── python/                          # Core Python implementation   ├── src/acp_evals/
│      ├── api.py                   # Simple developer API      ├── evaluators/              # Built-in evaluators         ├── accuracy.py          # LLM-as-judge evaluation         ├── groundedness.py      # Context grounding assessment         ├── retrieval.py         # Information retrieval metrics         └── safety.py            # Safety and bias detection      ├── benchmarks/              # Multi-agent benchmarking         ├── datasets/            # Gold standard & adversarial data            ├── gold_standard_datasets.py
│            ├── adversarial_datasets.py
│            └── trace_recycler.py
│         └── multi_agent/         # Agent coordination benchmarks      ├── patterns/                # Agent architecture patterns         ├── linear.py            # Sequential execution         ├── supervisor.py        # Centralized coordination         └── swarm.py             # Distributed collaboration      ├── providers/               # LLM provider abstractions         ├── openai.py            # OpenAI integration         ├── anthropic.py         # Anthropic integration         └── ollama.py            # Local model support      ├── evaluation/              # Advanced evaluation features         ├── continuous.py        # Continuous eval pipeline         └── simulator.py         # Synthetic data generation      ├── telemetry/               # Observability integration         └── otel_exporter.py     # OpenTelemetry export      └── cli.py                   # Command-line interface   ├── tests/                       # Comprehensive test suite   ├── examples/                    # Usage examples (13 files)   └── docs/                        # Architecture & setup guides

Contributing

The framework is designed for extensibility:

  • New Evaluators: Add custom evaluation logic in evaluators/
  • Provider Support: Extend providers/ for new LLM providers
  • Coordination Patterns: Implement new multi-agent patterns in patterns/
  • Dataset Integration: Add external benchmarks in benchmarks/datasets/

See our contribution guide for detailed guidance.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Part of the BeeAI project, an initiative of the Linux Foundation AI & Data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acp_evals-0.1.1.tar.gz (206.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acp_evals-0.1.1-py3-none-any.whl (165.2 kB view details)

Uploaded Python 3

File details

Details for the file acp_evals-0.1.1.tar.gz.

File metadata

  • Download URL: acp_evals-0.1.1.tar.gz
  • Upload date:
  • Size: 206.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-0.1.1.tar.gz
Algorithm Hash digest
SHA256 34ea055f999b5abaf07442e6bbf3333c07ab922377bd1a3ba0a00d66fbe3fcdc
MD5 08816db7854082a64fa2396b1df8701a
BLAKE2b-256 c8ce33ba285843e771e70037ad612f7b0224ad9d14cad3fc8d8dc8b3b94f82f0

See more details on using hashes here.

File details

Details for the file acp_evals-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: acp_evals-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 165.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1e43ea734e5d81358dd565df3fd7626f68825541aa354a6f7563721b092b7349
MD5 05753379bad046a3c08bb7d9c472735a
BLAKE2b-256 fd81efbf2498956a90f4453e506292ac136d1096043e444802fdc5a45cfcd698

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page