Comprehensive evaluation framework for Agent Communication Protocol (ACP) agents

These details have not been verified by PyPI

Project links

Project description

ACP Evals

Production-ready evaluation framework for multi-agent systems in the ACP/BeeAI ecosystem

Overview

ACP Evals is an evaluation framework for multi-agent systems built on the Agent Communication Protocol. Evaluation frameworks measure the quality, performance, and safety of AI agent outputs through automated scoring methods. In production agent systems, these measurements become critical for ensuring reliability, detecting regressions, and optimizing performance at scale.

ACP Evals specializes in the unique challenges of coordinated agent systems. The framework measures how well agents collaborate, preserve information across handoffs, and maintain workflow coherence under production conditions.

Getting Started

The quickest way to understand ACP Evals is through the basic evaluation workflow. Install the framework, configure your LLM provider, and run your first evaluation to establish the fundamental pattern.

Installation

# Basic installation
pip install acp-evals

# Development installation with all providers
cd python/
pip install -e ".[dev,all-providers]"

Provider Configuration

Create a .env file in your project root:

# Copy the example configuration
cp python/.env.example python/.env

# Add your API keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://localhost:11434

Basic Evaluation

from acp_evals import evaluate, AccuracyEval

# Evaluate any ACP agent with three lines
result = evaluate(
    AccuracyEval(agent="http://localhost:8000/agents/my-agent"),
    input="What is the capital of France?",
    expected="Paris"
)
print(f"Score: {result.score:.2f}, Cost: ${result.cost:.4f}")

# Works with any provider out of the box
eval = AccuracyEval(agent=my_agent, provider="anthropic")  # or "openai", "ollama"

# Multi-agent coordination (unique to ACP Evals)
from acp_evals import HandoffEval
result = HandoffEval(agents={"researcher": url1, "writer": url2}).run(task)

This pattern extends to all evaluation types. Replace AccuracyEval with PerformanceEval, SafetyEval, or ReliabilityEval to measure different aspects of agent behavior.

System Architecture

graph TB
    A[ACP Agent] --> B[ACP Evals Framework]
    
    B --> C[Developer API<br/>evaluate Accuracy and Performance]
    B --> D[Multi-Agent Evaluators<br/> Communication Patterns and Framework Integrity]
    B --> E[Production Features<br/>Trace Recycling and Continuous Evaluation]
    
    F[LLM Providers<br/>OpenAI Anthropic Ollama] --> B
    
    B --> G[Results<br/>Eval Performance and Costs Analytics]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
    style F fill:#f1f8e9
    style G fill:#e0f2f1

Core Evaluation Capabilities

Quick Start: Start with 3-line evaluations, scale to enterprise multi-agent benchmarks

Quality & Performance Evaluators

AccuracyEval: LLM-as-judge with customizable rubrics (factual, research, code quality)
GroundednessEvaluator: Context-grounded response validation
RetrievalEvaluator: Information retrieval quality assessment
DocumentRetrievalEvaluator: Full IR metrics (precision, recall, NDCG, MAP, MRR)
PerformanceEval: Token usage, latency, and cost tracking across providers

Multi-Agent Specialized Metrics

Industry First: An evaluation framework built specifically for multi-agent coordination

Handoff Quality: Information preservation across agent transitions
Coordination Patterns: LinearPattern, SupervisorPattern, SwarmPattern evaluation
Context Maintenance: Cross-agent context analysis and noise detection
Decision Preservation: Agent-to-agent decision quality tracking

Risk & Safety Evaluators

SafetyEval: Composite safety and bias detection
Adversarial Testing: Real-world attack pattern resistance (prompt injection, jailbreaks)
ReliabilityEval: Tool usage validation and error handling assessment

Quick Start

⚡ Zero to Evaluation: Get comprehensive agent metrics in under 60 seconds

from acp_evals import evaluate, AccuracyEval

# Evaluate any ACP agent in 3 lines
result = evaluate(
    AccuracyEval(agent="http://localhost:8000/agents/research-agent"),
    input="What are the latest developments in quantum computing?",
    expected="Recent quantum computing advances include..."
)
print(f"Score: {result.score}, Cost: ${result.cost}")

Multi-Agent Evaluation

Coordination Testing: Measure how well agents work together, not just individually

from acp_evals.benchmarks import HandoffBenchmark
from acp_evals.patterns import LinearPattern

# Evaluate agent coordination
benchmark = HandoffBenchmark(
    pattern=LinearPattern(["researcher", "analyzer", "synthesizer"]),
    tasks="research_quality",
    endpoint="http://localhost:8000"
)

results = await benchmark.run_batch(
    test_data="multi_agent_tasks.jsonl",
    parallel=True,
    export="coordination_results.json"
)

Advanced Features

Production Integration

Built for real-world deployment monitoring

Trace Recycling: Convert production telemetry to evaluation datasets (example)
Continuous Evaluation: Automated regression detection and baseline tracking (docs)
OpenTelemetry Export: Real-time metrics to Jaeger, Phoenix, and observability platforms
Cost Optimization: Multi-provider cost comparison and budget alerts

Adversarial & Robustness Testing

Test against real-world attack patterns, not academic examples

Real-World Attack Patterns: Prompt injection, context manipulation, data extraction (example)
Edge Case Generation: Synthetic adversarial scenario creation

Dataset & Benchmarking

Gold standard datasets to custom synthetic data

Gold Standard Datasets: Production-realistic multi-step agent tasks
External Integration: TRAIL, GAIA, SWE-Bench benchmark support
Custom Dataset Loaders: Flexible evaluation data management
Synthetic Data Generation: Automated test case creation

Installation & Setup

# Basic installation
pip install acp-evals

# Development installation
cd python/
pip install -e .

Provider Configuration

# Copy environment template
cp python/.env.example python/.env

# Configure API keys in .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OLLAMA_BASE_URL=http://localhost:11434

Supported Providers & Models

** Provider Flexibility**: Test locally with Ollama or scale with cloud providers

Provider	Models	Cost Tracking
OpenAI	GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o4-mini	✅
Anthropic	Claude-4-Opus, Claude-4-Sonnet	✅
Ollama	granite3.3:8b, qwen3:30b-a3b, custom	✅
Mock Mode	Simulated responses	✅

Native ACP/BeeAI Integration

🔗 Ecosystem Native: Purpose-built for the ACP/BeeAI stack

ACP Message Handling: Native support for ACP communication patterns (example)
BeeAI Agent Instances: Direct integration with BeeAI Framework agents
Workflow Evaluation: Built-in support for BeeAI multi-agent workflows
Event Stream Analysis: Real-time evaluation of agent interactions

Documentation & Examples

Resource	Description
📚 Architecture Guide	Framework design and components
🚀 Setup Guide	Installation and configuration
🔌 Provider Setup	LLM provider configuration
💡 Examples	13 comprehensive usage examples

Quick Start Examples

Essential (Start Here):

00_minimal_example.py: 3-line agent evaluation
01_quickstart_accuracy.py: Basic accuracy assessment
02_multi_agent_evaluation.py: Agent coordination testing

Production Integration:

04_continuous_evaluation.py: CI/CD monitoring pipeline
12_end_to_end_trace_pipeline.py: Production trace recycling
09_real_acp_agents.py: Live ACP agent integration

Advanced:

07_adversarial_testing.py: Security robustness evaluation
13_synthetic_data_generation.py: Custom dataset creation

Batch Evaluation and Automation

Production agent systems require automated evaluation workflows that can process multiple test cases, generate comprehensive reports, and integrate with continuous integration systems.

Batch Processing

# Evaluate multiple test cases from a dataset
results = AccuracyEval(agent=my_agent).run_batch(
    test_data="test_cases.jsonl",
    parallel=True,
    progress=True,
    export="results.json"
)

print(f"Pass rate: {results.pass_rate}%, Average score: {results.avg_score:.2f}")

The JSONL format expects each line to contain a JSON object with input and expected fields:

{"input": "What is machine learning?", "expected": "Machine learning is a method of data analysis..."}
{"input": "Explain neural networks", "expected": "Neural networks are computing systems inspired by..."}

CI/CD Integration

# Integrate with pytest or other testing frameworks
def test_agent_accuracy():
    eval = AccuracyEval(agent=my_agent, mock_mode=CI_ENV)
    result = eval.run(
        input="Test question for CI",
        expected="Expected answer"
    )
    assert result.score > 0.8, f"Agent scored {result.score}, below threshold"

Understanding Results

Evaluation results follow a consistent structure across all evaluator types. Understanding result interpretation enables effective debugging and optimization workflows.

Result Structure

# All evaluators return results with this structure
result = eval.run(input="test", expected="expected")

print(f"Score: {result.score}")           # Float 0.0-1.0
print(f"Passed: {result.passed}")         # Boolean pass/fail
print(f"Cost: ${result.cost:.4f}")        # USD cost
print(f"Tokens: {result.tokens}")         # Token usage breakdown
print(f"Latency: {result.latency_ms}ms")  # Response time
print(f"Details: {result.details}")       # Evaluator-specific metrics

Debugging Failed Evaluations

When evaluations fail or score lower than expected, the details field provides specific feedback:

result = AccuracyEval(agent=my_agent).run(
    input="Complex technical question",
    expected="Technical answer"
)

if result.score < 0.7:
    print("Evaluation feedback:")
    print(result.details.get("judge_reasoning", "No reasoning provided"))
    print(f"Specific issues: {result.details.get('issues', [])}")

Troubleshooting

Common issues and solutions for evaluation setup and execution.

Provider Configuration Issues

If evaluations fail with authentication errors, verify your provider configuration:

# Test provider connectivity
from acp_evals.providers.factory import ProviderFactory

provider = ProviderFactory.get_provider("openai")  # or "anthropic", "ollama"
print(f"Provider status: {provider.health_check()}")

Agent Connection Problems

For ACP agent connectivity issues:

# Test agent health before evaluation
import httpx

async def test_agent_health(agent_url):
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{agent_url}/health")
        return response.status_code == 200

# Use in evaluations
if await test_agent_health("http://localhost:8000/agents/my-agent"):
    result = evaluate(AccuracyEval(agent=agent_url), input, expected)

Performance Optimization

For large-scale evaluations:

# Use batch processing with parallelization
results = AccuracyEval(agent=my_agent).run_batch(
    test_data="large_dataset.jsonl",
    parallel=True,
    batch_size=10,  # Process 10 at a time
    max_workers=4   # Limit concurrent evaluations
)

Project Structure

acp-evals/
├── python/                          # Core Python implementation
│   ├── src/acp_evals/
│   │   ├── api.py                   # Simple developer API
│   │   ├── evaluators/              # Built-in evaluators
│   │   │   ├── accuracy.py          # LLM-as-judge evaluation
│   │   │   ├── groundedness.py      # Context grounding assessment
│   │   │   ├── retrieval.py         # Information retrieval metrics
│   │   │   └── safety.py            # Safety and bias detection
│   │   ├── benchmarks/              # Multi-agent benchmarking
│   │   │   ├── datasets/            # Gold standard & adversarial data
│   │   │   │   ├── gold_standard_datasets.py
│   │   │   │   ├── adversarial_datasets.py
│   │   │   │   └── trace_recycler.py
│   │   │   └── multi_agent/         # Agent coordination benchmarks
│   │   ├── patterns/                # Agent architecture patterns
│   │   │   ├── linear.py            # Sequential execution
│   │   │   ├── supervisor.py        # Centralized coordination
│   │   │   └── swarm.py             # Distributed collaboration
│   │   ├── providers/               # LLM provider abstractions
│   │   │   ├── openai.py            # OpenAI integration
│   │   │   ├── anthropic.py         # Anthropic integration
│   │   │   └── ollama.py            # Local model support
│   │   ├── evaluation/              # Advanced evaluation features
│   │   │   ├── continuous.py        # Continuous eval pipeline
│   │   │   └── simulator.py         # Synthetic data generation
│   │   ├── telemetry/               # Observability integration
│   │   │   └── otel_exporter.py     # OpenTelemetry export
│   │   └── cli.py                   # Command-line interface
│   ├── tests/                       # Comprehensive test suite
│   ├── examples/                    # Usage examples (13 files)
│   └── docs/                        # Architecture & setup guides

Contributing

The framework is designed for extensibility:

New Evaluators: Add custom evaluation logic in evaluators/
Provider Support: Extend providers/ for new LLM providers
Coordination Patterns: Implement new multi-agent patterns in patterns/
Dataset Integration: Add external benchmarks in benchmarks/datasets/

See our contribution guide for detailed guidance.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Part of the BeeAI project, an initiative of the Linux Foundation AI & Data

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Jun 23, 2025

This version

0.1.1

Jun 14, 2025

0.1.0

Jun 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acp_evals-0.1.1.tar.gz (206.1 kB view details)

Uploaded Jun 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

acp_evals-0.1.1-py3-none-any.whl (165.2 kB view details)

Uploaded Jun 14, 2025 Python 3

File details

Details for the file acp_evals-0.1.1.tar.gz.

File metadata

Download URL: acp_evals-0.1.1.tar.gz
Upload date: Jun 14, 2025
Size: 206.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`34ea055f999b5abaf07442e6bbf3333c07ab922377bd1a3ba0a00d66fbe3fcdc`
MD5	`08816db7854082a64fa2396b1df8701a`
BLAKE2b-256	`c8ce33ba285843e771e70037ad612f7b0224ad9d14cad3fc8d8dc8b3b94f82f0`

See more details on using hashes here.

File details

Details for the file acp_evals-0.1.1-py3-none-any.whl.

File metadata

Download URL: acp_evals-0.1.1-py3-none-any.whl
Upload date: Jun 14, 2025
Size: 165.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1e43ea734e5d81358dd565df3fd7626f68825541aa354a6f7563721b092b7349`
MD5	`05753379bad046a3c08bb7d9c472735a`
BLAKE2b-256	`fd81efbf2498956a90f4453e506292ac136d1096043e444802fdc5a45cfcd698`

See more details on using hashes here.

acp-evals 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ACP Evals

Overview

Getting Started

Installation

Provider Configuration

Basic Evaluation

System Architecture

Core Evaluation Capabilities

Quality & Performance Evaluators

Multi-Agent Specialized Metrics

Risk & Safety Evaluators

Quick Start

Multi-Agent Evaluation

Advanced Features

Production Integration

Adversarial & Robustness Testing

Dataset & Benchmarking

Installation & Setup

Provider Configuration

Supported Providers & Models

Native ACP/BeeAI Integration

Documentation & Examples

Quick Start Examples

Batch Evaluation and Automation

Batch Processing

CI/CD Integration

Understanding Results

Result Structure

Debugging Failed Evaluations

Troubleshooting

Provider Configuration Issues

Agent Connection Problems

Performance Optimization

Project Structure

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes