Skip to main content

Local-first Agent batch evaluation and failure analysis CLI

Project description

Agent-Eval-CLI

Agent-Eval-CLI is a local-first command-line tool for batch testing Agent/LLM applications, evaluating results, clustering failures, and generating local reports for later tuning work.

The Python package is published as agent-deepeval; the installed console command is agent-eval. This repository currently implements the MVP described in docs/prd.md.

MVP capabilities

  • agent-eval init — create a local evaluation project with eval.yaml, sample cases, runs/, reports/, and a sample script target.
  • agent-eval run — execute cases against a script or HTTP target, evaluate assertions, cluster failures, and write local artifacts.
  • agent-eval inspect — inspect a run, case, or cluster from local files.
  • agent-eval compare — compare two local runs and report pass-rate, case-transition, and cluster-transition deltas.
  • agent-eval export — locate or print repair_input.json for downstream tuning tools.

Default generated projects run fully offline. LLM judging defaults to a stub/disabled configuration; no API key, DeepEval install, or live LLM call is required for the sample workflow or tests. V1 also supports opt-in DeepEval judging for answer_relevancy when installed and explicitly enabled.

Install

python3 -m pip install agent-deepeval
agent-eval --help

Install for development

python3 -m pip install -e '.[dev,release]'

Quick start

mkdir /tmp/agent-eval-demo
cd /tmp/agent-eval-demo
agent-eval init
agent-eval run
agent-eval inspect --run latest
agent-eval compare --base latest --target latest
agent-eval export --run latest

A run writes:

  • runs/<run_id>/manifest.json
  • runs/<run_id>/raw_results.jsonl
  • runs/<run_id>/eval_results.jsonl
  • runs/<run_id>/failures.jsonl
  • runs/<run_id>/clusters.json
  • runs/<run_id>/summary.md
  • runs/<run_id>/repair_input.json
  • reports/latest.md

Run comparison

V1.5 adds deterministic local run comparison without changing the normal run artifact contract:

agent-eval compare --base baseline --target target
agent-eval compare --base baseline --target target --show
agent-eval compare --base baseline --target target --output comparison.json

Default output is a concise human summary. --show prints machine-readable JSON, and --output writes that JSON to a requested path. Comparison output includes cluster_key_version: "v1", pass-rate deltas, per-case transitions, and added/removed/persisted cluster IDs. Normal agent-eval run does not write comparison artifacts.

Failure signatures and repair input are enriched with optional, namespaced analysis fields while preserving existing fields for downstream compatibility. Cluster IDs keep the V1 grouping identity; richer signature fields improve titles and summaries but do not change the hash key.

Target modes

Script mode

eval.yaml can configure a command template:

project:
  mode: script
target:
  script:
    command: "{python} sample_agent.py --input-file {input_file}"

The script receives a temporary case JSON file. It should print JSON to stdout, either:

{"response": {"answer": "..."}, "debug_meta": {"route": "knowledge_qa"}}

or any plain JSON object, which is treated as the response.

HTTP mode

HTTP mode supports URL, method, headers, timeout/retry settings, and a minimal payload mapping subset such as $.inputs.query.

If a JSON response contains debug_meta, Agent-Eval uses it for execution-semantic checks; otherwise the run behaves as a black-box evaluation.

Assertions

MVP deterministic assertions include:

  • field_exists / jsonpath_exists
  • json_schema_match / schema_keys for the documented object-key/type subset
  • contains
  • exact_match
  • http_status
  • numeric_threshold with gt / gte / lt / lte / eq
  • execution checks from expected_execution such as expected_route, must_call_tools, forbid_tools, max_tool_calls, and min_retrieval_docs
  • llm_judge as an offline stub by default, plus opt-in provider: deepeval for answer_relevancy

Opt-in DeepEval judging

DeepEval support is optional and lazy-loaded. Install the optional extra and enable it explicitly:

python3 -m pip install -e '.[deepeval]'
evaluation:
  llm_judge:
    enabled: true
    provider: deepeval
    model: gpt-4.1
    threshold: 0.7

V1 supports llm_judge metric answer_relevancy. The evaluator maps inputs.query to DeepEval input and response.answer to actual_output; if those fields are absent it falls back to redacted stable JSON.

Privacy defaults

Artifacts are local files. Before writing request/response/debug/error/report data, Agent-Eval redacts common sensitive keys such as Authorization, Cookie, API keys, tokens, passwords, secrets, full prompts, and full intermediate context fields.

Current non-goals

  • No SaaS service or Web Dashboard
  • No online observability/Langfuse dependency
  • No automatic code patching, prompt modification, or PR creation
  • No mandatory Claude Code runtime dependency
  • No complete local adapter mode in this MVP
  • No mandatory live LLM or DeepEval call in default workflows
  • No LLM cluster naming/summary or full deep report in V1

Release and verification

Local release readiness is checked with:

python scripts/check-release.py

That gate runs tests, compileall, wheel/sdist build, twine check, installed-wheel smoke, and a fresh-project agent-eval init/run/inspect/export/compare flow. See docs/release-checklist.md for TestPyPI and PyPI publishing gates. TestPyPI must pass before uploading to PyPI.

Development-only verification remains:

python3 -m pip install -e '.[dev,release]'
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_deepeval-0.1.0.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_deepeval-0.1.0-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file agent_deepeval-0.1.0.tar.gz.

File metadata

  • Download URL: agent_deepeval-0.1.0.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for agent_deepeval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 40093610db2b15475c1b3d5bfa121c158118436c431848b9e89d8a1f14fb6218
MD5 8e6a92e069bf447e3335ead8e95458a6
BLAKE2b-256 95ecdba20dad37ec0d555c5fd8b1a9ffad6c4edb0efc88c7cc4e74bab749fee6

See more details on using hashes here.

File details

Details for the file agent_deepeval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: agent_deepeval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for agent_deepeval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 148f6bbf96f29515715da35b504ea93bb9c793aad66f1d1b592a94e40092a2fa
MD5 3387b25a4de8bf13d0dd117115a6ea72
BLAKE2b-256 cd4e3dbf9f2d5ff84fe7420a4fec706c068e20401fe6553176fb4ba231a54533

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page