Skip to main content

Run the AI Agent QA Eval Pack against your tool-using LLM agent + grade it with a shareable scorecard badge. Deterministic, OWASP Agentic Top 10 aligned, no LLM-judge.

Project description

agent-eval-runner

Run the AI Agent QA Eval Pack (vendor-agnostic YAML eval cases) against your tool-using LLM agent. Deterministic checks — no LLM-as-judge — so the report is defensible for production sign-off. Cross-platform (pure Python: Win / macOS / Linux).

Install

pip install agent-eval-runner            # core
pip install "agent-eval-runner[openai]"  # + built-in OpenAI demo adapter
pip install "agent-eval-runner[anthropic]"

30-second demo (no code)

Run text-based cases straight against a model:

export OPENAI_API_KEY=sk-...
agent-eval run --cases ./cases --adapter openai:gpt-4o --dimension safety

Real usage (all 5 case types, incl. tool-trace assertions)

Write a 10-line adapter wrapping your agent:

# my_adapter.py
from agent_eval_runner import AgentResult, ToolCall

def agent(case: dict) -> AgentResult:
    inp = case["input"]
    final_text, tool_calls = run_my_agent(           # <- your agent
        system=inp.get("system_message"),
        user=inp["user_message"],
        context=inp.get("context"),
    )
    return AgentResult(
        output_text=final_text,
        trace=[ToolCall(name=c.name, args=c.args) for c in tool_calls],
    )
agent-eval run --cases ./cases --adapter my_adapter:agent --report signoff.md

Exit code is non-zero if any high-severity case fails — drop it straight into CI.

What it checks

method what it asserts
keyword_match required / forbidden substrings in the answer
regex_match answer matches a regex
refusal_detection agent refused (or complied) as expected
trace_count tool-call counts, required/forbidden tools, args seen
trace_invariant structural: no error-loops, parallel-when-possible, step caps

Output

  • Terminal pass/fail summary grouped by dimension, with production-blockers flagged
  • --report out.md → a Markdown sign-off report (share with your team / customer)

Full 23-case pack: https://weiseer.gumroad.com/l/dcipxt · Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_runner-0.2.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval_runner-0.2.0-py3-none-any.whl (12.7 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval_runner-0.2.0.tar.gz.

File metadata

  • Download URL: agent_eval_runner-0.2.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for agent_eval_runner-0.2.0.tar.gz
Algorithm Hash digest
SHA256 bf7a6b211cb975e33246ea83bcd1a80f206a68a62fc4760eb068c01ac1168fa7
MD5 3550239b355b3687797204cff93f0e1c
BLAKE2b-256 354a295a93cf3f8d1fee8295611e8711537c8f8c86e3f1aaccd86aa314f716b5

See more details on using hashes here.

File details

Details for the file agent_eval_runner-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_eval_runner-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 45e3e09a4132d7c7a631248a4e6cced8feaca8487513533ad3ba02db36f3d70e
MD5 7bc5d35bb9ae7d99e3ed4a179a8a9762
BLAKE2b-256 61068eb6fd7077b4cf8f2c78084c005b8ec1f02e6230020eb7d858b11ffa4cf7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page