Skip to main content

Run the AI Agent QA Eval Pack against your tool-using LLM agent + grade it with a shareable scorecard badge. Deterministic, OWASP Agentic Top 10 aligned, no LLM-judge.

Project description

agent-eval-runner

Run the AI Agent QA Eval Pack (vendor-agnostic YAML eval cases) against your tool-using LLM agent. Deterministic checks — no LLM-as-judge — so the report is defensible for production sign-off. Cross-platform (pure Python: Win / macOS / Linux).

Install

pip install agent-eval-runner            # core
pip install "agent-eval-runner[openai]"  # + built-in OpenAI demo adapter
pip install "agent-eval-runner[anthropic]"

30-second demo (no code)

Run text-based cases straight against a model:

export OPENAI_API_KEY=sk-...
agent-eval run --cases ./cases --adapter openai:gpt-4o --dimension safety

Real usage (all 5 case types, incl. tool-trace assertions)

Write a 10-line adapter wrapping your agent:

# my_adapter.py
from agent_eval_runner import AgentResult, ToolCall

def agent(case: dict) -> AgentResult:
    inp = case["input"]
    final_text, tool_calls = run_my_agent(           # <- your agent
        system=inp.get("system_message"),
        user=inp["user_message"],
        context=inp.get("context"),
    )
    return AgentResult(
        output_text=final_text,
        trace=[ToolCall(name=c.name, args=c.args) for c in tool_calls],
    )
agent-eval run --cases ./cases --adapter my_adapter:agent --report signoff.md

Exit code is non-zero if any high-severity case fails — drop it straight into CI.

What it checks

method what it asserts
keyword_match required / forbidden substrings in the answer
regex_match answer matches a regex
refusal_detection agent refused (or complied) as expected
trace_count tool-call counts, required/forbidden tools, args seen
trace_invariant structural: no error-loops, parallel-when-possible, step caps

Output

  • Terminal pass/fail summary grouped by dimension, with production-blockers flagged
  • --report out.md → a Markdown sign-off report (share with your team / customer)

Full 23-case pack: https://weiseer.gumroad.com/l/dcipxt · Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_runner-0.2.1.tar.gz (9.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval_runner-0.2.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval_runner-0.2.1.tar.gz.

File metadata

  • Download URL: agent_eval_runner-0.2.1.tar.gz
  • Upload date:
  • Size: 9.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for agent_eval_runner-0.2.1.tar.gz
Algorithm Hash digest
SHA256 f9c3129d8faf6d851e3eb6acb03d5fab14fae09442e358d495c4120878b50b97
MD5 648c19cc615947296c16b1d8e108ea66
BLAKE2b-256 be45b3c6b86b73b62e33b07e0c044243ee0a23965ddf8fb622513e0c093f0904

See more details on using hashes here.

File details

Details for the file agent_eval_runner-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_eval_runner-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f879d0fc9a1b751465415957d4d3459cabbfbb5996d549c9457f897ff94a9d89
MD5 77cda74765c4c7f25012d96fce5b4c29
BLAKE2b-256 4270120941ab40c2af20145b01200174241e899f9f1dfced720f21eac2daad37

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page