Skip to main content

Run the AI Agent QA Eval Pack (vendor-agnostic YAML eval cases) against your tool-using LLM agent. Deterministic, no LLM-judge.

Project description

agent-eval-runner

Run the AI Agent QA Eval Pack (vendor-agnostic YAML eval cases) against your tool-using LLM agent. Deterministic checks — no LLM-as-judge — so the report is defensible for production sign-off. Cross-platform (pure Python: Win / macOS / Linux).

Install

pip install agent-eval-runner            # core
pip install "agent-eval-runner[openai]"  # + built-in OpenAI demo adapter
pip install "agent-eval-runner[anthropic]"

30-second demo (no code)

Run text-based cases straight against a model:

export OPENAI_API_KEY=sk-...
agent-eval run --cases ./cases --adapter openai:gpt-4o --dimension safety

Real usage (all 5 case types, incl. tool-trace assertions)

Write a 10-line adapter wrapping your agent:

# my_adapter.py
from agent_eval_runner import AgentResult, ToolCall

def agent(case: dict) -> AgentResult:
    inp = case["input"]
    final_text, tool_calls = run_my_agent(           # <- your agent
        system=inp.get("system_message"),
        user=inp["user_message"],
        context=inp.get("context"),
    )
    return AgentResult(
        output_text=final_text,
        trace=[ToolCall(name=c.name, args=c.args) for c in tool_calls],
    )
agent-eval run --cases ./cases --adapter my_adapter:agent --report signoff.md

Exit code is non-zero if any high-severity case fails — drop it straight into CI.

What it checks

method what it asserts
keyword_match required / forbidden substrings in the answer
regex_match answer matches a regex
refusal_detection agent refused (or complied) as expected
trace_count tool-call counts, required/forbidden tools, args seen
trace_invariant structural: no error-loops, parallel-when-possible, step caps

Output

  • Terminal pass/fail summary grouped by dimension, with production-blockers flagged
  • --report out.md → a Markdown sign-off report (share with your team / customer)

Full 23-case pack: https://weiseer.gumroad.com/l/dcipxt · Free 5-case starter: https://github.com/weiseer/ai-agent-qa-eval-pack-starter

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_eval_runner-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

agent_eval_runner-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file agent_eval_runner-0.1.0.tar.gz.

File metadata

  • Download URL: agent_eval_runner-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for agent_eval_runner-0.1.0.tar.gz
Algorithm Hash digest
SHA256 26b4a7281ac51797eda54202071717329ab23a3567964c703b6987011a69ce5f
MD5 bf657904ef007c0851c6dbcabc9b6320
BLAKE2b-256 a2e93a1abba43311a0e06091c8e15fe8af1b589dc8894fc22438ed3be665a023

See more details on using hashes here.

File details

Details for the file agent_eval_runner-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for agent_eval_runner-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 992132615f8c9a3507bde58afc2b8c331b888e576050c5c68f058f096481128f
MD5 fcf7e7f6bd1b0942c9ad55db50cf4d92
BLAKE2b-256 c6c2bc4161f76cb08713d2018a741165be077600cd3b1329e32cc82cc719821f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page