Run the AI Agent QA Eval Pack against your tool-using LLM agent + grade it with a shareable scorecard badge. Deterministic, OWASP Agentic Top 10 aligned, no LLM-judge.
Project description
agent-eval-runner
Run the AI Agent QA Eval Pack (vendor-agnostic YAML eval cases) against your tool-using LLM agent. Deterministic checks — no LLM-as-judge — so the report is defensible for production sign-off. Cross-platform (pure Python: Win / macOS / Linux).
Install
pip install agent-eval-runner # core
pip install "agent-eval-runner[openai]" # + built-in OpenAI demo adapter
pip install "agent-eval-runner[anthropic]"
30-second demo (no code)
Run text-based cases straight against a model:
export OPENAI_API_KEY=sk-...
agent-eval run --cases ./cases --adapter openai:gpt-4o --dimension safety
Real usage (all 5 case types, incl. tool-trace assertions)
Write a 10-line adapter wrapping your agent:
# my_adapter.py
from agent_eval_runner import AgentResult, ToolCall
def agent(case: dict) -> AgentResult:
inp = case["input"]
final_text, tool_calls = run_my_agent( # <- your agent
system=inp.get("system_message"),
user=inp["user_message"],
context=inp.get("context"),
)
return AgentResult(
output_text=final_text,
trace=[ToolCall(name=c.name, args=c.args) for c in tool_calls],
)
agent-eval run --cases ./cases --adapter my_adapter:agent --report signoff.md
Exit code is non-zero if any high-severity case fails — drop it straight into CI.
What it checks
| method | what it asserts |
|---|---|
keyword_match |
required / forbidden substrings in the answer |
regex_match |
answer matches a regex |
refusal_detection |
agent refused (or complied) as expected |
trace_count |
tool-call counts, required/forbidden tools, args seen |
trace_invariant |
structural: no error-loops, parallel-when-possible, step caps |
Output
- Terminal pass/fail summary grouped by dimension, with production-blockers flagged
--report out.md→ a Markdown sign-off report (share with your team / customer)
Full 23-case pack: https://weiseer.gumroad.com/l/dcipxt · Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agent_eval_runner-0.2.0.tar.gz.
File metadata
- Download URL: agent_eval_runner-0.2.0.tar.gz
- Upload date:
- Size: 9.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf7a6b211cb975e33246ea83bcd1a80f206a68a62fc4760eb068c01ac1168fa7
|
|
| MD5 |
3550239b355b3687797204cff93f0e1c
|
|
| BLAKE2b-256 |
354a295a93cf3f8d1fee8295611e8711537c8f8c86e3f1aaccd86aa314f716b5
|
File details
Details for the file agent_eval_runner-0.2.0-py3-none-any.whl.
File metadata
- Download URL: agent_eval_runner-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45e3e09a4132d7c7a631248a4e6cced8feaca8487513533ad3ba02db36f3d70e
|
|
| MD5 |
7bc5d35bb9ae7d99e3ed4a179a8a9762
|
|
| BLAKE2b-256 |
61068eb6fd7077b4cf8f2c78084c005b8ec1f02e6230020eb7d858b11ffa4cf7
|