Local-first Agent batch evaluation and failure analysis CLI

Project description

Agent-Eval-CLI

Agent-Eval-CLI is a local-first command-line tool for batch testing Agent/LLM applications, evaluating results, clustering failures, and generating local reports for later tuning work.

The Python package is published as agent-deepeval; the installed console command is agent-eval. This repository currently implements the MVP described in docs/prd.md.

MVP capabilities

agent-eval init — create a local evaluation project with eval.yaml, sample cases, runs/, reports/, and a sample script target.
agent-eval run — execute cases against a script or HTTP target, evaluate assertions, cluster failures, and write local artifacts.
agent-eval inspect — inspect a run, case, or cluster from local files.
agent-eval compare — compare two local runs and report pass-rate, case-transition, and cluster-transition deltas.
agent-eval export — locate or print repair_input.json for downstream tuning tools.

Default generated projects run fully offline. LLM judging defaults to a stub/disabled configuration; no API key, DeepEval install, or live LLM call is required for the sample workflow or tests. V1 also supports opt-in DeepEval judging for answer_relevancy when installed and explicitly enabled.

Install

python3 -m pip install agent-deepeval
agent-eval --help

Install for development

python3 -m pip install -e '.[dev,release]'

Quick start

mkdir /tmp/agent-eval-demo
cd /tmp/agent-eval-demo
agent-eval init
agent-eval run
agent-eval inspect --run latest
agent-eval compare --base latest --target latest
agent-eval export --run latest

A run writes:

runs/<run_id>/manifest.json
runs/<run_id>/raw_results.jsonl
runs/<run_id>/eval_results.jsonl
runs/<run_id>/failures.jsonl
runs/<run_id>/clusters.json
runs/<run_id>/summary.md
runs/<run_id>/repair_input.json
reports/latest.md

Run comparison

V1.5 adds deterministic local run comparison without changing the normal run artifact contract:

agent-eval compare --base baseline --target target
agent-eval compare --base baseline --target target --show
agent-eval compare --base baseline --target target --output comparison.json

Default output is a concise human summary. --show prints machine-readable JSON, and --output writes that JSON to a requested path. Comparison output includes cluster_key_version: "v1", pass-rate deltas, per-case transitions, and added/removed/persisted cluster IDs. Normal agent-eval run does not write comparison artifacts.

Failure signatures and repair input are enriched with optional, namespaced analysis fields while preserving existing fields for downstream compatibility. Cluster IDs keep the V1 grouping identity; richer signature fields improve titles and summaries but do not change the hash key.

Target modes

Script mode

eval.yaml can configure a command template:

project:
  mode: script
target:
  script:
    command: "{python} sample_agent.py --input-file {input_file}"

The script receives a temporary case JSON file. It should print JSON to stdout, either:

{"response": {"answer": "..."}, "debug_meta": {"route": "knowledge_qa"}}

or any plain JSON object, which is treated as the response.

HTTP mode

HTTP mode supports URL, method, headers, timeout/retry settings, and a minimal payload mapping subset such as $.inputs.query.

If a JSON response contains debug_meta, Agent-Eval uses it for execution-semantic checks; otherwise the run behaves as a black-box evaluation.

Assertions

MVP deterministic assertions include:

field_exists / jsonpath_exists
json_schema_match / schema_keys for the documented object-key/type subset
contains
exact_match
http_status
numeric_threshold with gt / gte / lt / lte / eq
execution checks from expected_execution such as expected_route, must_call_tools, forbid_tools, max_tool_calls, and min_retrieval_docs
llm_judge as an offline stub by default, plus opt-in provider: deepeval for answer_relevancy

Opt-in DeepEval judging

DeepEval support is optional and lazy-loaded. Install the optional extra and enable it explicitly:

python3 -m pip install -e '.[deepeval]'

evaluation:
  llm_judge:
    enabled: true
    provider: deepeval
    model: gpt-4.1
    threshold: 0.7

V1 supports llm_judge metric answer_relevancy. The evaluator maps inputs.query to DeepEval input and response.answer to actual_output; if those fields are absent it falls back to redacted stable JSON.

Privacy defaults

Artifacts are local files. Before writing request/response/debug/error/report data, Agent-Eval redacts common sensitive keys such as Authorization, Cookie, API keys, tokens, passwords, secrets, full prompts, and full intermediate context fields.

Current non-goals

No SaaS service or Web Dashboard
No online observability/Langfuse dependency
No automatic code patching, prompt modification, or PR creation
No mandatory Claude Code runtime dependency
No complete local adapter mode in this MVP
No mandatory live LLM or DeepEval call in default workflows
No LLM cluster naming/summary or full deep report in V1

Release and verification

Local release readiness is checked with:

python scripts/check-release.py

That gate runs tests, compileall, wheel/sdist build, twine check, installed-wheel smoke, and a fresh-project agent-eval init/run/inspect/export/compare flow. See docs/release-checklist.md for TestPyPI and PyPI publishing gates. TestPyPI must pass before uploading to PyPI.

Development-only verification remains:

python3 -m pip install -e '.[dev,release]'
pytest

Project details

Release history Release notifications | RSS feed

0.2.0

May 10, 2026

This version

0.1.0

May 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_deepeval-0.1.0.tar.gz (29.8 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_deepeval-0.1.0-py3-none-any.whl (29.8 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file agent_deepeval-0.1.0.tar.gz.

File metadata

Download URL: agent_deepeval-0.1.0.tar.gz
Upload date: May 10, 2026
Size: 29.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for agent_deepeval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`40093610db2b15475c1b3d5bfa121c158118436c431848b9e89d8a1f14fb6218`
MD5	`8e6a92e069bf447e3335ead8e95458a6`
BLAKE2b-256	`95ecdba20dad37ec0d555c5fd8b1a9ffad6c4edb0efc88c7cc4e74bab749fee6`

See more details on using hashes here.

File details

Details for the file agent_deepeval-0.1.0-py3-none-any.whl.

File metadata

Download URL: agent_deepeval-0.1.0-py3-none-any.whl
Upload date: May 10, 2026
Size: 29.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for agent_deepeval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`148f6bbf96f29515715da35b504ea93bb9c793aad66f1d1b592a94e40092a2fa`
MD5	`3387b25a4de8bf13d0dd117115a6ea72`
BLAKE2b-256	`cd4e3dbf9f2d5ff84fe7420a4fec706c068e20401fe6553176fb4ba231a54533`

See more details on using hashes here.

agent-deepeval 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Agent-Eval-CLI

MVP capabilities

Install

Install for development

Quick start

Run comparison

Target modes

Script mode

HTTP mode

Assertions

Opt-in DeepEval judging

Privacy defaults

Current non-goals

Release and verification

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes