Skip to main content

Robot Framework library for testing Agent Skills, Hooks, SubAgents, and MCP Servers — provider-agnostic, BFCL-grade tool-call matching, #42796 behavioral metrics, statistical non-determinism handling.

Project description

robotframework-agentguard

License: Apache-2.0 CI

A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.

What it tests

  • MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
  • Agent SkillsSKILL.md discovery, frontmatter validation, Inspect-AI grading
  • Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
  • SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
  • Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
  • Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
  • Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
  • Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N

Installation

Note: the package is pending publication to PyPI. Until then, install from the GitHub source.

From PyPI (once published):

pip install robotframework-agentguard
# or
uv add robotframework-agentguard

From source (today):

pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git

Optional extras:

pip install 'robotframework-agentguard[bridges]'      # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]'   # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench

Configure the LLM provider with a .env file in your project root:

OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini

Verify the install:

agentguard doctor    # provider, env, MCP reachability
agentguard version

Quickstart

*** Settings ***
Library    AgentGuard    provider=litellm    model=openrouter/anthropic/claude-sonnet-4-5

*** Test Cases ***
AgentGuard Should Be Loaded
    ${info}=    Get AgentGuard Info
    Log    ${info}

The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:

*** Settings ***
Library    AgentGuard.MCP
Library    AgentGuard.Skill
Library    AgentGuard.Stats

Sub-library imports

Import line Purpose
Library AgentGuard Kitchen-sink — every keyword reachable
Library AgentGuard.MCP Test MCP servers (stdio / SSE / streamable-HTTP / in-memory)
Library AgentGuard.Skill Discover, parse, validate, grade Agent Skills
Library AgentGuard.Tool BFCL-style tool-call AST + trajectory matching
Library AgentGuard.Stats Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
Library AgentGuard.Judge Classification-based LLM-as-Judge with Cohen's κ calibration
Library AgentGuard.Security Default-deny skill scanner, redactor, sandbox, AIDefence
Library AgentGuard.Hook Claude Code hook lifecycle (12 events × 4 handler types)
Library AgentGuard.SubAgent A2A 1.0 task lifecycle, framework bridges
Library AgentGuard.Coding Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack
Library AgentGuard.Benchmark SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
Library AgentGuard.Scenario Unified scenario harness — drop-in for manykarim/rf-mcp tests/e2e/

Keyword documentation (browsable HTML, generated by Robot Framework libdoc):

Operator-driven assertions

Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:

Tool Hit Rate              ${result}    >=    ${0.7}
Failed Tool Call Count     ${result}    ==    ${0}
Read Edit Ratio            ${session}   >=    ${0.5}

Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.

Examples

Runnable Robot suites under examples/:

File Topic
01_mcp_server_basics.robot Connect to an MCP server, list and call tools
02_skill_grading.robot Grade a SKILL.md against an LLM with Cohen's κ calibration
03_hook_block_destructive.robot Synthesise hook events, assert blocking decisions
04_subagent_a2a.robot A2A subagent task lifecycle + trajectory matching
05_coding_agent_metrics.robot Compute the 12 #42796 behavioural metrics from a session
06_bfcl_tool_selection.robot BFCL AST equality + trajectory comparison
07_sandbox_run.robot Run untrusted code under default-deny Docker sandbox
08_swe_bench.robot SWE-bench Verified loader + pass@k gate
09_humaneval_live.robot HumanEval live grading
10_rf_mcp_integration.robot Drop-in replacement for manykarim/rf-mcp e2e patterns
11_agentskills_grading.robot Grade manykarim/robotframework-agentskills SKILL.md files
12_mcp_scenario_replacement.robot YAML-driven scenarios + live LLM driver
13_assertion_engine_idiom.robot Side-by-side: operator form vs old Should-pair form
14_facade_imports.robot Side-by-side import variants

Run any example:

PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot

Architecture (12 bounded contexts)

Context Purpose
Provider LiteLLM-backed LLMProviderAdapter + thin vendor adapters
MCP FastMCP client wrapper for stdio / SSE / streamable-http / in-memory
Skills SKILL.md discovery, frontmatter validation, Inspect-AI grading
Hooks Synthesise the 12 Claude Code hook events; assert handler decisions
SubAgents A2A 1.0 task lifecycle + delegation-chain assertions
CodingAgent Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL
Statistics scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N
Judge Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7)
Security Default-deny skill scanner, redactor, sandbox policy, AIDefence integration
Telemetry OTel spans + Robot Framework listener embedding scorecards in log.html
BehavioralMetrics The 12 calculators from anthropic/claude-code#42796
ToolCallCorrectness BFCL AST/trajectory matcher used by MCP, Skills, SubAgents

Aggregates, value objects, and ACLs are in docs/ddd/.

Performance

Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.

Surface Budget
MCP in-memory roundtrip (p50 / p95) ≤ 5 / 10 ms
MCP stdio roundtrip (p50) ≤ 50 ms
BFCL AST match (mean per call) ≤ 1 ms
mannwhitneyu n=30/30 ≤ 5 ms
bootstrap n=30 / 1000 resamples ≤ 100 ms
Library import + Suite Setup (cold) ≤ 2 s

Run the suite locally:

uv run pytest benchmarks/ --benchmark-only

Full budget table and cost model: docs/performance/.

Documentation

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotframework_agentguard-0.2.0.tar.gz (2.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotframework_agentguard-0.2.0-py3-none-any.whl (292.6 kB view details)

Uploaded Python 3

File details

Details for the file robotframework_agentguard-0.2.0.tar.gz.

File metadata

  • Download URL: robotframework_agentguard-0.2.0.tar.gz
  • Upload date:
  • Size: 2.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for robotframework_agentguard-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5b726f3ae548749d2772c06f7e8a732580325f00fd7dd60fa8e3c219e9acc2f6
MD5 76f39623123254e0d5f8f76356688330
BLAKE2b-256 3ec9fcd8dcafe70e596cc132b62787f59c144435f1abebed95ae2708f4159ed6

See more details on using hashes here.

File details

Details for the file robotframework_agentguard-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: robotframework_agentguard-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 292.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for robotframework_agentguard-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0ead5894e3107cb6a6c96472e11af6e00d9d4ed9fa83ba46fdad7e385152bd8c
MD5 ceb4cf83894bc4e44e19ded86e0c3d6a
BLAKE2b-256 5438c5bb2a59906d0531b6dfa0222a26e2b8aed1b4c9283d9eb456993466de0c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page