Skip to main content

Standalone Agent Evaluation Framework (AEF)

Project description

AEF - Agent Evaluation Framework

AEF is a framework to generate tests, run/evaluate trajectories, collect feedback, and self-evolve agent behavior.

The workflow is intentionally minimal and framework-agnostic:

  • aef generate calls the generation component/tool
  • aef evaluate calls the evaluation component/tool
  • aef feedback calls the feedback component/tool
  • aef evolve calls the evolution component/tool

Internally, these are routed through an A2A bus so the same flow works for sub-agents implemented with different frameworks.


Installation

From PyPI

Install via pip or uv:

pip install aef-framework

or with uv:

uv pip install aef-framework

Local Development Install with uv

AEF uses uv for fast, reliable Python package management.

1. Install uv (if not already installed)

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Create a virtual environment

cd AEF
uv venv --python=3.11

This creates a .venv directory with Python 3.11 (or use 3.10, 3.12 as needed).

3. Activate the virtual environment

source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate     # Windows

4. Install AEF in editable mode

uv pip install -e .

This installs AEF and all dependencies, making the aef command available.

5. Verify installation

aef --help

Traditional pip install (local)

If you prefer using pip:

python -m venv .venv
source .venv/bin/activate
pip install -e .

Core Principles

  • Universal sub-agent support via adapter contract (python, cli, http)
  • Single essential loop: Generate → Evaluate → Feedback → Evolve
  • Composable A2A components instead of tightly-coupled command logic
  • Versioned evolution profiles with before/after evaluation comparison

Basic Workflow

1) Generate trajectories

aef generate --config configs/fleet_ccc_run.json --n 10

2) Evaluate against a golden run

aef evaluate --config configs/fleet_ccc_run.json --golden run_YYYYMMDD_xxxxxx

3) Submit feedback

aef feedback --agent fleet_ccc --text "Agent should ask confirmation before delete operations"

4) Evolve (auto-apply + compare)

aef evolve --config configs/fleet_ccc_run.json --n 10

aef evolve now performs:

  1. baseline evaluate
  2. classify feedback into amendments
  3. apply evolution profile
  4. re-evaluate and report before/after score delta

Use AEF With Any Sub-Agent

Set agent.adapter_type in your config:

  • python: ADK/Python agent entrypoint module_or_file.py:agent_var
  • cli: shell command template using {step} / {goal} placeholders
  • http: endpoint that accepts { goal, step, session_id? }

See detailed usage in docs/USING_ANY_SUBAGENT.md.

Runtime endpoint mode (--agent_endpoint)

In addition to config-defined adapters, you can override execution at runtime:

  • If --agent_endpoint is provided, AEF routes AUT calls through HTTP endpoint mode.
  • If --agent_endpoint is not provided, AEF keeps existing behavior (for example local --sub_agent / config adapter).

Endpoint-mode guarantee:

  • AEF uses the hosted endpoint runner path for AUT execution.
  • Local Python entrypoint loading is not required in this mode.
  • Local entrypoint import/path issues do not block endpoint-mode execution.
  • Endpoint runner uses ADK server contract: create/reuse session, then call POST /run.

This is useful when the same agent can run locally in development and remotely in a hosted ADK/A2A service.

Examples:

# Generate through hosted endpoint
aef generate --config configs/fleet_ilo_run.json \
	--agent_endpoint http://localhost:8086/docs/ --n 2

# Evaluate through hosted endpoint
aef evaluate --config configs/fleet_ilo_run.json \
	--agent_endpoint http://localhost:8086/docs/ --golden run_YYYYMMDD_xxxxxx

# Evolve through hosted endpoint
aef evolve --config configs/fleet_ilo_run.json \
	--agent_endpoint http://localhost:8086/docs/ --n 5

Notes:

  • /docs/ URLs are supported (AEF resolves them to API base).
  • Endpoint mode is available on generate, evaluate, run, and evolve.
  • Endpoint mode is intended for ADK/A2A-hosted AUTs where the AUT is reachable over API.

ADK endpoint contract used by --agent_endpoint:

  • Session bootstrap: POST /apps/{app_name}/users/{user_id}/sessions/{session_id}
  • Inference: POST /run
  • Request payload: {"appName","userId","sessionId","newMessage":{"role":"user","parts":[{"text":...}]},"streaming":false}
  • A 409 Conflict during session creation is treated as expected session reuse.

Trajectory logging in endpoint mode:

  • steps[].content stores assistant text response.
  • steps[].tool_calls, steps[].tool_responses, and steps[].tools_used store tool trace data when present.

Full prerequisites and onboarding checklist:


A2A Components

AEF components exposed through the internal bus:

  • generation.generate
  • evaluation.evaluate
  • feedback.submit_text
  • feedback.submit_annotations
  • evolution.evolve

See docs/A2A_COMPONENTS.md.


Evolution Outputs

Evolution applies and versions runtime amendments per agent under:

  • prompts/evolution_profiles/<agent>/latest.json
  • prompts/evolution_profiles/<agent>/profile_<timestamp>.json

These profiles contain:

  • prompt addenda
  • tool policies
  • generator hints
  • agent hints
  • rubric updates

See docs/SELF_EVOLUTION.md.


Web UI

AEF includes a Next.js web interface for managing agents, running benchmarks, reviewing trajectories, and tracking evolution.

Option A — Docker (recommended)

Run both the backend API and the web UI with a single command:

docker compose up -d

To rebuild after code changes:

docker compose up -d --build

To stop:

docker compose down

Your run database (aef_runs.db), configs, outputs, and annotated data are bind-mounted from the repo root and persist across restarts.

Option B — Local development

Start the backend:

uvicorn aef.api.main:app --reload --port 8001

Start the UI:

cd aef-ui
npm install
npm run dev

Open http://localhost:3010. See aef-ui/README.md for full details.

UI Pages

Page Description
Dashboard Score trends, model cost breakdown, runs-over-time — shows only COMPLETE runs
Agent Config Register local Python agents or HTTP endpoints
Generate Run trajectory generation with live progress, view past runs with expandable step-by-step conversations
Feedback Review GENERATED trajectories with full multi-step detail, submit per-trajectory quality ratings
Evaluate Select a golden trajectory run, re-execute and score against it with dimension radar charts
Evolve Run self-improvement cycles on evaluated runs, manage memory deltas
Query Browse runs, trajectories, evaluations and execute raw SQL

Pipeline Flow

Generate → (GENERATED) → Feedback → (COMPLETE) → Evaluate → (COMPLETE/REGRESSION) → Evolve

Each UI page filters to the appropriate pipeline stage so you only see runs relevant to that step.


Minimal Command Reference

# Generate
aef generate --config <config.json> --n 10

# Generate via hosted endpoint override
aef generate --config <config.json> --agent_endpoint <http://host:port/docs/> --n 10

# Direct A2A tool call
aef a2a --config <config.json> --component generation --tool generate --payload '{"n": 2}'

# Evaluate golden by run id
aef evaluate --config <config.json> --golden <run_id>

# Evaluate via hosted endpoint override
aef evaluate --config <config.json> --agent_endpoint <http://host:port/docs/> --golden <run_id>

# Feedback
aef feedback --agent <agent_name> --text "..."

# Evolve
aef evolve --config <config.json> --n 10

# Evolve via hosted endpoint override
aef evolve --config <config.json> --agent_endpoint <http://host:port/docs/> --n 10

# Compare two eval runs
aef compare --run <run_a> --vs <run_b>

# Query runs / memory
aef query runs --agent <agent_name>
aef query memory --agent <agent_name> --all-memory
aef query memory --agent <agent_name> --history

Documentation


Contributing

Contributions are welcome! See CONTRIBUTING.md for development setup and guidelines.


License

AEF is released under the Apache License 2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aef_framework-0.1.5.tar.gz (2.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aef_framework-0.1.5-py3-none-any.whl (122.8 kB view details)

Uploaded Python 3

File details

Details for the file aef_framework-0.1.5.tar.gz.

File metadata

  • Download URL: aef_framework-0.1.5.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for aef_framework-0.1.5.tar.gz
Algorithm Hash digest
SHA256 58f30f830c00c29c454f7e2355a699ca9dc5d6a0dc72903e0c2cf5bb7dd49e9b
MD5 8341acf8ce5b65a59c442af3e2587158
BLAKE2b-256 55fa4822b52ae09baea36fafad15c0ca98e89320a9a36effe9734c8f0a9e56f7

See more details on using hashes here.

File details

Details for the file aef_framework-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: aef_framework-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 122.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.8

File hashes

Hashes for aef_framework-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4ea2e84e578e261500de0e8c4d1745818b71cd3b85c916ee26983e572dd38cec
MD5 bc803139070561a21efac0c794631fb3
BLAKE2b-256 18080c8e9936aeefe095dd24820a3933e6c830c0bdb38f3fd6b3beed4972aa1c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page