Local-first Agent batch evaluation and failure analysis CLI
Project description
Agent-Eval-CLI
Agent-Eval-CLI is a local-first command-line tool for batch testing Agent/LLM applications, evaluating results, clustering failures, and generating local reports for later tuning work.
The Python package is published as agent-deepeval; the installed console command is agent-eval. This repository currently implements the MVP described in docs/prd.md.
MVP capabilities
agent-eval init— create a local evaluation project witheval.yaml, sample cases,runs/,reports/, and a sample script target.agent-eval run— execute cases against a script or HTTP target, evaluate assertions, cluster failures, and write local artifacts.agent-eval inspect— inspect a run, case, or cluster from local files.agent-eval compare— compare two local runs and report pass-rate, case-transition, and cluster-transition deltas.agent-eval export— locate or printrepair_input.jsonfor downstream tuning tools.
Default generated projects run fully offline. LLM judging defaults to a stub/disabled configuration; no API key, DeepEval install, or live LLM call is required for the sample workflow or tests. V1 also supports opt-in DeepEval judging for answer_relevancy when installed and explicitly enabled.
Install
python3 -m pip install agent-deepeval
agent-eval --help
Install for development
python3 -m pip install -e '.[dev,release]'
Quick start
mkdir /tmp/agent-eval-demo
cd /tmp/agent-eval-demo
agent-eval init
agent-eval run
agent-eval inspect --run latest
agent-eval compare --base latest --target latest
agent-eval export --run latest
A run writes:
runs/<run_id>/manifest.jsonruns/<run_id>/raw_results.jsonlruns/<run_id>/eval_results.jsonlruns/<run_id>/failures.jsonlruns/<run_id>/clusters.jsonruns/<run_id>/summary.mdruns/<run_id>/repair_input.jsonreports/latest.md
Run comparison
V1.5 adds deterministic local run comparison without changing the normal run artifact contract:
agent-eval compare --base baseline --target target
agent-eval compare --base baseline --target target --show
agent-eval compare --base baseline --target target --output comparison.json
Default output is a concise human summary. --show prints machine-readable JSON, and --output writes that JSON to a requested path. Comparison output includes cluster_key_version: "v1", pass-rate deltas, per-case transitions, and added/removed/persisted cluster IDs. Normal agent-eval run does not write comparison artifacts.
Failure signatures and repair input are enriched with optional, namespaced analysis fields while preserving existing fields for downstream compatibility. Cluster IDs keep the V1 grouping identity; richer signature fields improve titles and summaries but do not change the hash key.
Target modes
Script mode
eval.yaml can configure a command template:
project:
mode: script
target:
script:
command: "{python} sample_agent.py --input-file {input_file}"
The script receives a temporary case JSON file. It should print JSON to stdout, either:
{"response": {"answer": "..."}, "debug_meta": {"route": "knowledge_qa"}}
or any plain JSON object, which is treated as the response.
HTTP mode
HTTP mode supports URL, method, headers, timeout/retry settings, and a minimal payload mapping subset such as $.inputs.query.
If a JSON response contains debug_meta, Agent-Eval uses it for execution-semantic checks; otherwise the run behaves as a black-box evaluation.
Assertions
MVP deterministic assertions include:
field_exists/jsonpath_existsjson_schema_match/schema_keysfor the documented object-key/type subsetcontainsexact_matchhttp_statusnumeric_thresholdwithgt/gte/lt/lte/eq- execution checks from
expected_executionsuch asexpected_route,must_call_tools,forbid_tools,max_tool_calls, andmin_retrieval_docs llm_judgeas an offline stub by default, plus opt-inprovider: deepevalforanswer_relevancy
Opt-in DeepEval judging
DeepEval support is optional and lazy-loaded. Install the optional extra and enable it explicitly:
python3 -m pip install -e '.[deepeval]'
evaluation:
llm_judge:
enabled: true
provider: deepeval
model: gpt-4.1
threshold: 0.7
V1 supports llm_judge metric answer_relevancy. The evaluator maps inputs.query to DeepEval input and response.answer to actual_output; if those fields are absent it falls back to redacted stable JSON.
Privacy defaults
Artifacts are local files. Before writing request/response/debug/error/report data, Agent-Eval redacts common sensitive keys such as Authorization, Cookie, API keys, tokens, passwords, secrets, full prompts, and full intermediate context fields.
Current non-goals
- No SaaS service or Web Dashboard
- No online observability/Langfuse dependency
- No automatic code patching, prompt modification, or PR creation
- No mandatory Claude Code runtime dependency
- No complete local adapter mode in this MVP
- No mandatory live LLM or DeepEval call in default workflows
- No LLM cluster naming/summary or full deep report in V1
Release and verification
Local release readiness is checked with:
python scripts/check-release.py
That gate runs tests, compileall, wheel/sdist build, twine check,
installed-wheel smoke, and a fresh-project agent-eval init/run/inspect/export/compare
flow. See docs/release-checklist.md for TestPyPI and PyPI publishing gates.
TestPyPI must pass before uploading to PyPI.
Development-only verification remains:
python3 -m pip install -e '.[dev,release]'
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_deepeval-0.1.0.tar.gz.
File metadata
- Download URL: agent_deepeval-0.1.0.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
40093610db2b15475c1b3d5bfa121c158118436c431848b9e89d8a1f14fb6218
|
|
| MD5 |
8e6a92e069bf447e3335ead8e95458a6
|
|
| BLAKE2b-256 |
95ecdba20dad37ec0d555c5fd8b1a9ffad6c4edb0efc88c7cc4e74bab749fee6
|
File details
Details for the file agent_deepeval-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agent_deepeval-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
148f6bbf96f29515715da35b504ea93bb9c793aad66f1d1b592a94e40092a2fa
|
|
| MD5 |
3387b25a4de8bf13d0dd117115a6ea72
|
|
| BLAKE2b-256 |
cd4e3dbf9f2d5ff84fe7420a4fec706c068e20401fe6553176fb4ba231a54533
|