Skip to main content

Production-grade evaluation framework for Agent Communication Protocol (ACP) agents

Project description

ACP Evals

Python 3.11+ License ACP Compatible

ACP Evals is an open framework for evaluating AI agents across accuracy, performance, and reliability dimensions.

Modern AI agents need comprehensive testing before deployment. ACP Evals provides production-grade evaluation using LLM-as-judge methodology, designed to integrate seamlessly with the BeeAI ecosystem and any ACP-compliant agent.

ACP Evals enables you to:

  • Measure response accuracy using configurable LLM judges
  • Track performance metrics including latency and memory usage
  • Validate tool usage patterns and error handling
  • Run batch evaluations for comprehensive test coverage
  • Generate detailed reports for continuous improvement

Core Concepts

Concept Description
Accuracy Evaluates response quality against expected outputs using LLM-as-judge methodology. Supports custom rubrics for domain-specific evaluation.
Performance Measures latency, memory usage, and token efficiency. Essential for production deployments where speed and resource constraints matter.
Reliability Validates tool usage patterns, error handling, and consistency across runs. Critical for agents that interact with external systems.

Quick Example

Evaluate agent accuracy with just a few lines:

from acp_evals import AccuracyEval

evaluation = AccuracyEval(
    agent="http://localhost:8001/agents/my-agent",
    rubric="factual"
)

result = await evaluation.run(
    input="What is 10*5 then to the power of 2? do it step by step",
    expected="2500",
    print_results=True
)
assert result is not None and result.score >= 0.7

Core Features

Installation

pip install acp-evals

Quickstart

1. Configure your LLM provider

echo "OPENAI_API_KEY=your-key-here" > .env
acp-evals check

2. Run your first evaluation

# Test accuracy
acp-evals run accuracy http://localhost:8001/agents/my-agent \
  -i "What is 2+2?" -e "4"

3. Run comprehensive evaluation

acp-evals comprehensive http://localhost:8001/agents/my-agent \
  -i "Calculate compound interest" -e "Detailed calculation"

Examples

Performance Evaluation

from acp_evals import PerformanceEval

evaluation = PerformanceEval(
    agent="http://localhost:8001/agents/my-agent",
    num_iterations=5,
    track_memory=True
)

result = await evaluation.run(
    input_text="What is the capital of France?",
    print_results=True
)

Reliability Evaluation

from acp_evals import ReliabilityEval

evaluation = ReliabilityEval(
    agent="http://localhost:8001/agents/my-agent",
    tool_definitions=["search", "calculator"]
)

result = await evaluation.run(
    input="Search for AAPL price and calculate P/E ratio",
    expected_tools=["search", "calculator"],
    print_results=True
)
assert result.passed

Agent Formats

ACP Evals works with any agent implementation:

  • ACP-compliant agents: http://localhost:8001/agents/my-agent
  • Python functions: agent.py:function_name
  • Python modules: mymodule.agent_function

CLI Reference

# Check setup
acp-evals check

# Run evaluations
acp-evals run accuracy <agent> -i <input> -e <expected>
acp-evals run performance <agent> -i <input>
acp-evals run reliability <agent> -i <input> --expected-tools <tool>

# Comprehensive testing
acp-evals comprehensive <agent> -i <input> -e <expected>

# Batch testing
acp-evals run accuracy <agent> --test-file tests.jsonl

Resources

License

Apache 2.0 - see LICENSE


Developed by contributors to the BeeAI project, this initiative is part of the Linux Foundation AI & Data program. Its development follows open, collaborative, and community-driven practices.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acp_evals-1.0.0.tar.gz (98.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acp_evals-1.0.0-py3-none-any.whl (82.5 kB view details)

Uploaded Python 3

File details

Details for the file acp_evals-1.0.0.tar.gz.

File metadata

  • Download URL: acp_evals-1.0.0.tar.gz
  • Upload date:
  • Size: 98.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-1.0.0.tar.gz
Algorithm Hash digest
SHA256 0eb16f134c485f53639eabe359d609b4c36a11b3a9f5e89b72180de0c90a40dc
MD5 7ef5cc6da1281ed82f62fe32deeae155
BLAKE2b-256 2a5c327507d3f77b0bd0360551edfd6251fa824f6bb8ab7b119e70a07cf98433

See more details on using hashes here.

File details

Details for the file acp_evals-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: acp_evals-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 82.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for acp_evals-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dca995be483c72fb13c3e8bb27d47043b0bc0ab14609a04d0f338b283c0470e1
MD5 03a5b9d2e8774c3ba3fc65ce678cd7cb
BLAKE2b-256 3fedf7108cb6644a976a717f449900bf08db82b7b17624c5bbe85997925f7258

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page