Standalone Agent Evaluation Framework (AEF)
Project description
AEF - Agent Evaluation Framework
AEF is a framework to generate tests, run/evaluate trajectories, collect feedback, and self-evolve agent behavior.
The workflow is intentionally minimal and framework-agnostic:
aef generatecalls the generation component/toolaef evaluatecalls the evaluation component/toolaef feedbackcalls the feedback component/toolaef evolvecalls the evolution component/tool
Internally, these are routed through an A2A bus so the same flow works for sub-agents implemented with different frameworks.
Installation
From PyPI
Install via pip or uv:
pip install aef-framework
or with uv:
uv pip install aef-framework
Local Development Install with uv
AEF uses uv for fast, reliable Python package management.
1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
2. Create a virtual environment
cd AEF
uv venv --python=3.11
This creates a .venv directory with Python 3.11 (or use 3.10, 3.12 as needed).
3. Activate the virtual environment
source .venv/bin/activate # Linux/macOS
# or
.venv\Scripts\activate # Windows
4. Install AEF in editable mode
uv pip install -e .
This installs AEF and all dependencies, making the aef command available.
5. Verify installation
aef --help
Traditional pip install (local)
If you prefer using pip:
python -m venv .venv
source .venv/bin/activate
pip install -e .
Core Principles
- Universal sub-agent support via adapter contract (
python,cli,http) - Single essential loop: Generate → Evaluate → Feedback → Evolve
- Composable A2A components instead of tightly-coupled command logic
- Versioned evolution profiles with before/after evaluation comparison
Basic Workflow
1) Generate trajectories
aef generate --config configs/fleet_ccc_run.json --n 10
2) Evaluate against a golden run
aef evaluate --config configs/fleet_ccc_run.json --golden run_YYYYMMDD_xxxxxx
3) Submit feedback
aef feedback --agent fleet_ccc --text "Agent should ask confirmation before delete operations"
4) Evolve (auto-apply + compare)
aef evolve --config configs/fleet_ccc_run.json --n 10
aef evolve now performs:
- baseline evaluate
- classify feedback into amendments
- apply evolution profile
- re-evaluate and report before/after score delta
Use AEF With Any Sub-Agent
Set agent.adapter_type in your config:
python: ADK/Python agent entrypointmodule_or_file.py:agent_varcli: shell command template using{step}/{goal}placeholdershttp: endpoint that accepts{ goal, step, session_id? }
See detailed usage in docs/USING_ANY_SUBAGENT.md.
Runtime endpoint mode (--agent_endpoint)
In addition to config-defined adapters, you can override execution at runtime:
- If
--agent_endpointis provided, AEF routes AUT calls through HTTP endpoint mode. - If
--agent_endpointis not provided, AEF keeps existing behavior (for example local--sub_agent/ config adapter).
Endpoint-mode guarantee:
- AEF uses the hosted endpoint runner path for AUT execution.
- Local Python entrypoint loading is not required in this mode.
- Local entrypoint import/path issues do not block endpoint-mode execution.
- Endpoint runner uses ADK server contract: create/reuse session, then call
POST /run.
This is useful when the same agent can run locally in development and remotely in a hosted ADK/A2A service.
Examples:
# Generate through hosted endpoint
aef generate --config configs/fleet_ilo_run.json \
--agent_endpoint http://localhost:8086/docs/ --n 2
# Evaluate through hosted endpoint
aef evaluate --config configs/fleet_ilo_run.json \
--agent_endpoint http://localhost:8086/docs/ --golden run_YYYYMMDD_xxxxxx
# Evolve through hosted endpoint
aef evolve --config configs/fleet_ilo_run.json \
--agent_endpoint http://localhost:8086/docs/ --n 5
Notes:
/docs/URLs are supported (AEF resolves them to API base).- Endpoint mode is available on
generate,evaluate,run, andevolve. - Endpoint mode is intended for ADK/A2A-hosted AUTs where the AUT is reachable over API.
ADK endpoint contract used by --agent_endpoint:
- Session bootstrap:
POST /apps/{app_name}/users/{user_id}/sessions/{session_id} - Inference:
POST /run - Request payload:
{"appName","userId","sessionId","newMessage":{"role":"user","parts":[{"text":...}]},"streaming":false} - A
409 Conflictduring session creation is treated as expected session reuse.
Trajectory logging in endpoint mode:
steps[].contentstores assistant text response.steps[].tool_calls,steps[].tool_responses, andsteps[].tools_usedstore tool trace data when present.
Full prerequisites and onboarding checklist:
A2A Components
AEF components exposed through the internal bus:
generation.generateevaluation.evaluatefeedback.submit_textfeedback.submit_annotationsevolution.evolve
Evolution Outputs
Evolution applies and versions runtime amendments per agent under:
prompts/evolution_profiles/<agent>/latest.jsonprompts/evolution_profiles/<agent>/profile_<timestamp>.json
These profiles contain:
- prompt addenda
- tool policies
- generator hints
- agent hints
- rubric updates
Web UI
AEF includes a Next.js web interface for managing agents, running benchmarks, reviewing trajectories, and tracking evolution.
Option A — Docker (recommended)
Run both the backend API and the web UI with a single command:
docker compose up -d
- Frontend: http://localhost:3010
- Backend API: http://localhost:8001
To rebuild after code changes:
docker compose up -d --build
To stop:
docker compose down
Your run database (
aef_runs.db), configs, outputs, and annotated data are bind-mounted from the repo root and persist across restarts.
Option B — Local development
Start the backend:
uvicorn aef.api.main:app --reload --port 8001
Start the UI:
cd aef-ui
npm install
npm run dev
Open http://localhost:3010. See aef-ui/README.md for full details.
UI Pages
| Page | Description |
|---|---|
| Dashboard | Score trends, model cost breakdown, runs-over-time — shows only COMPLETE runs |
| Agent Config | Register local Python agents or HTTP endpoints |
| Generate | Run trajectory generation with live progress, view past runs with expandable step-by-step conversations |
| Feedback | Review GENERATED trajectories with full multi-step detail, submit per-trajectory quality ratings |
| Evaluate | Select a golden trajectory run, re-execute and score against it with dimension radar charts |
| Evolve | Run self-improvement cycles on evaluated runs, manage memory deltas |
| Query | Browse runs, trajectories, evaluations and execute raw SQL |
Pipeline Flow
Generate → (GENERATED) → Feedback → (COMPLETE) → Evaluate → (COMPLETE/REGRESSION) → Evolve
Each UI page filters to the appropriate pipeline stage so you only see runs relevant to that step.
Minimal Command Reference
# Generate
aef generate --config <config.json> --n 10
# Generate via hosted endpoint override
aef generate --config <config.json> --agent_endpoint <http://host:port/docs/> --n 10
# Direct A2A tool call
aef a2a --config <config.json> --component generation --tool generate --payload '{"n": 2}'
# Evaluate golden by run id
aef evaluate --config <config.json> --golden <run_id>
# Evaluate via hosted endpoint override
aef evaluate --config <config.json> --agent_endpoint <http://host:port/docs/> --golden <run_id>
# Feedback
aef feedback --agent <agent_name> --text "..."
# Evolve
aef evolve --config <config.json> --n 10
# Evolve via hosted endpoint override
aef evolve --config <config.json> --agent_endpoint <http://host:port/docs/> --n 10
# Compare two eval runs
aef compare --run <run_a> --vs <run_b>
# Query runs / memory
aef query runs --agent <agent_name>
aef query memory --agent <agent_name> --all-memory
aef query memory --agent <agent_name> --history
Documentation
- docs/AEF_WORKFLOW.md
- docs/A2A_COMPONENTS.md
- docs/USING_ANY_SUBAGENT.md
- docs/SELF_EVOLUTION.md
- docs/PUBLISHING.md - PyPI package publishing guide
Contributing
Contributions are welcome! See CONTRIBUTING.md for development setup and guidelines.
License
AEF is released under the Apache License 2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aef_framework-0.1.5.tar.gz.
File metadata
- Download URL: aef_framework-0.1.5.tar.gz
- Upload date:
- Size: 2.3 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58f30f830c00c29c454f7e2355a699ca9dc5d6a0dc72903e0c2cf5bb7dd49e9b
|
|
| MD5 |
8341acf8ce5b65a59c442af3e2587158
|
|
| BLAKE2b-256 |
55fa4822b52ae09baea36fafad15c0ca98e89320a9a36effe9734c8f0a9e56f7
|
File details
Details for the file aef_framework-0.1.5-py3-none-any.whl.
File metadata
- Download URL: aef_framework-0.1.5-py3-none-any.whl
- Upload date:
- Size: 122.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ea2e84e578e261500de0e8c4d1745818b71cd3b85c916ee26983e572dd38cec
|
|
| MD5 |
bc803139070561a21efac0c794631fb3
|
|
| BLAKE2b-256 |
18080c8e9936aeefe095dd24820a3933e6c830c0bdb38f3fd6b3beed4972aa1c
|