Local-first Agent batch evaluation and failure analysis CLI
Project description
Agent-Eval
Local-first CLI for batch-evaluating AI Agents, clustering failures, and tracking regressions — all without leaving your machine.
The package is published as agent-deepeval on PyPI; the installed command is agent-eval.
Highlights
- Fully offline by default — no API keys, no SaaS, no live LLM calls required. Every evaluation runs on local files with deterministic rule assertions.
- Three target modes — test local scripts (
mode: script), HTTP APIs (mode: http), or in-process Python functions (mode: adapter) without changing your test cases. - Automatic failure clustering — failed cases are grouped by failure signature (error code, route, tool, assertion type, tags, etc.) so you can fix batches of problems at once.
- Run comparison & regression tracking —
agent-eval compareshows pass-rate deltas, per-case transitions (passed↔failed), and cluster evolution between any two runs. - Repair export —
agent-eval exportproduces a structuredrepair_input.jsonwith clustered evidence for downstream tuning pipelines. - Privacy-first — artifacts are local files; sensitive keys (Authorization, Cookie, API keys, tokens, passwords, prompts) are redacted before writing.
- Opt-in LLM judging — swap the stub judge for DeepEval's
answer_relevancymetric when you need semantic scoring.
Install
pip install agent-deepeval
agent-eval --help
For development:
pip install -e '.[dev,release]'
Core workflow
init → write test cases → run (execute + evaluate + cluster) → inspect / compare → export
10-minute local onboarding smoke
Use this path when you want to prove the package works locally with no API keys, SaaS account, live LLM call, or publish step. It runs the generated sample Agent and exercises the core artifacts a new user needs first.
mkdir my-agent-eval && cd my-agent-eval
agent-eval init
agent-eval run
agent-eval inspect --run latest
agent-eval export --run latest
agent-eval compare --base latest --target latest
The default generated project is intentionally small: one sample case passes and one sample case fails. That expected failure is useful because it proves failure clustering, summary.md, and repair_input.json are populated during the first local run. After the smoke completes, inspect these landmarks:
runs/latest.txt— points at the latest run directory.runs/<run_id>/summary.md— human-readable local failure analysis.runs/<run_id>/repair_input.json— machine-readable repair/tuning input.reports/latest.md— shortcut copy of the latest Markdown report.
1. Initialize a project
mkdir my-agent-eval && cd my-agent-eval
agent-eval init
This scaffolds:
| File/Dir | Purpose |
|---|---|
eval.yaml |
Project configuration (target mode, evaluation rules, clustering settings) |
cases/sample.jsonl |
Example test cases |
sample_agent.py |
Example Agent script |
runs/ |
Artifact output directory |
reports/ |
Human-readable summary reports |
2. Write test cases
Each line in cases/*.jsonl is a test case:
{
"id": "pricing-query",
"tags": ["smoke", "rag"],
"priority": "p1",
"inputs": { "query": "What is the pricing for product X?" },
"assertions": [
{ "type": "contains", "target": "$.answer", "expected": "pricing" }
],
"expected_execution": {
"expected_route": "knowledge_qa",
"must_call_tools": ["retriever.search"],
"min_retrieval_docs": 1
}
}
Optional evaluation_policy controls deterministic aggregation:
{
"evaluation_policy": {
"reruns": 2,
"pass_rule": "majority"
}
}
runner.retry_timesretries a failed script/HTTP/adapter call inside one evaluation attempt.evaluation_policy.rerunsruns additional independent evaluation attempts for the case (reruns: 2means 3 attempts total).pass_ruleis applied in two layers: first to assertions inside each attempt, then to pass/fail outcomes across rerun attempts (all,any, or strictmajority).- Rerun-enabled runs keep the normal case-level artifacts compatible and add
attempts.jsonlwith per-attempt raw/eval details.
Assertions supported out of the box:
contains/exact_match— string and value matchingfield_exists/jsonpath_exists— presence checksjson_schema_match/schema_keys— object shape validationnumeric_threshold— numeric comparisons (gt,gte,lt,lte,eq)http_status— HTTP response code checksexpected_execution— semantic checks (route, tool calls, retrieval doc count, fallback behavior)llm_judge— LLM-as-judge (stub by default; opt-in DeepEval)
3. Run evaluation
agent-eval run
The pipeline executes in sequence:
- Run Agent — sends each test case to your Agent (script, HTTP, or Python adapter)
- Evaluate — checks every assertion against the Agent's response
- Cluster failures — groups failed cases by their failure signature
- Write artifacts — saves all results to
runs/<run_id>/
runs/<run_id>/
├── manifest.json # run metadata and config snapshot
├── raw_results.jsonl # raw Agent responses
├── eval_results.jsonl # assertion pass/fail details
├── attempts.jsonl # per-rerun attempt details, only when reruns are enabled
├── failures.jsonl # failed cases with failure signatures
├── clusters.json # grouped failure clusters
├── summary.md # human-readable run summary and failure analysis
└── repair_input.json # export-ready repair analysis with matching structured evidence
Options:
agent-eval run --config eval.yaml # custom config path
agent-eval run --dataset cases/extra.jsonl # override dataset
agent-eval run --run-name baseline # name this run
agent-eval run --concurrency 4 # parallel execution
4. Inspect & compare
# Inspect a specific run, case, or cluster
agent-eval inspect --run latest
agent-eval inspect --run latest --case pricing-query
agent-eval inspect --run latest --cluster c1
# Compare two runs (e.g. before/after a prompt change)
agent-eval compare --base baseline --target improved
agent-eval compare --base baseline --target improved --output comparison.json
agent-eval compare --base baseline --target improved --show
Comparison output includes:
- Pass-rate delta between runs
- Per-case transitions:
passed→failed,failed→passed, unchanged - Cluster transitions: added, removed, persisted failure groups
5. Export for tuning
agent-eval export --run latest
Produces repair_input.json with clustered failure evidence, representative cases, signature explanations, affected areas, and suggested investigation steps — ready for downstream prompt-tuning or code-fix pipelines.
summary.md and repair_input.json are generated from the same local analysis layer. The Markdown report is optimized for human triage, while the JSON keeps additive structured fields under analysis for automation. Legacy fields such as clusters[].cases, common_signature, evidence, and suspected_modules remain stable. When local evidence cannot identify a suspected module, the CLI leaves suspected_modules empty instead of guessing.
Target modes
Script mode (default)
project:
mode: script
target:
script:
command: "{python} sample_agent.py --input-file {input_file}"
Your script receives a temp JSON file, prints JSON to stdout:
{"response": {"answer": "..."}, "debug_meta": {"route": "knowledge_qa"}}
HTTP mode
project:
mode: http
target:
http:
url: "http://localhost:8000/chat"
method: "POST"
headers:
Content-Type: "application/json"
payload_mapping:
query: "$.inputs.query"
If the JSON response contains debug_meta, Agent-Eval uses it for execution-semantic checks; otherwise evaluation runs in black-box mode.
Python adapter mode
project:
mode: adapter
target:
adapter:
module: my_agent_adapter
function: run
Your adapter function is imported from the project root and receives the full case as a dictionary:
def run(case: dict) -> dict:
query = case["inputs"]["query"]
return {
"response": {"answer": f"answer for {query}"},
"debug_meta": {"route": "knowledge_qa"},
}
You may also return a raw response object directly, for black-box evaluation without debug_meta. Adapter mode is synchronous and in-process: a raised TimeoutError is recorded as a timeout result, but runner.timeout_seconds is not a hard cancellation mechanism for hung Python code in this MVP. If you run adapter cases concurrently, the adapter function must be thread-safe.
Opt-in DeepEval judging
pip install -e '.[deepeval]'
evaluation:
llm_judge:
enabled: true
provider: deepeval
model: gpt-4.1
threshold: 0.7
Release & verification
python scripts/check-release.py # full local gate: tests, build, twine, wheel smoke, e2e; no upload
python scripts/publish-release.py # dry-run PyPI checks for a separately authorized release
python scripts/publish-release.py --publish # upload only when a real release is explicitly authorized
For adoption-polish work, stop at local and dry-run gates. Do not run the --publish command unless a separate release decision grants publish authority.
Non-goals
- No SaaS service or web dashboard
- No online observability / Langfuse dependency
- No automatic code patching or prompt modification
- No mandatory live LLM or DeepEval dependency
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_deepeval-0.2.0.tar.gz.
File metadata
- Download URL: agent_deepeval-0.2.0.tar.gz
- Upload date:
- Size: 41.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1667c7618a4688ba26d8255448ccd9c6fd725e54a07336e2f014eda57aa10a53
|
|
| MD5 |
129df4ffb865fbe3c971289be810e741
|
|
| BLAKE2b-256 |
13fe449c645cdfab010d609f0f3ef4daa47383e12ca2735ddd0cca90f2637ef5
|
File details
Details for the file agent_deepeval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agent_deepeval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3177934f3d50685424228b918a12fca41aaff5a3827ce0fc3bbec3d1696aa219
|
|
| MD5 |
00b357fd98551f90f11ca2bc69071648
|
|
| BLAKE2b-256 |
195708bc77f2378a9160ae1ea019f57f6015bbdb79ebdfbda4c7c68f43bb7556
|