Skip to main content

Robot Framework library for testing Agent Skills, Hooks, SubAgents, and MCP Servers — provider-agnostic, BFCL-grade tool-call matching, #42796 behavioral metrics, statistical non-determinism handling.

Project description

robotframework-agentguard

License: Apache-2.0 CI

A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.

What it tests

  • MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
  • Agent SkillsSKILL.md discovery, frontmatter validation, Inspect-AI grading
  • Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
  • SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
  • Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
  • Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
  • Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
  • Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N

Installation

Note: the package is pending publication to PyPI. Until then, install from the GitHub source.

From PyPI (once published):

pip install robotframework-agentguard
# or
uv add robotframework-agentguard

From source (today):

pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git

Optional extras:

pip install 'robotframework-agentguard[bridges]'      # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]'   # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench

Configure the LLM provider with a .env file in your project root:

OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini

Verify the install:

agentguard doctor    # provider, env, MCP reachability
agentguard version

Quickstart

*** Settings ***
Library    AgentGuard    provider=litellm    model=openrouter/anthropic/claude-sonnet-4-5

*** Test Cases ***
AgentGuard Should Be Loaded
    ${info}=    Get AgentGuard Info
    Log    ${info}

The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:

*** Settings ***
Library    AgentGuard.MCP
Library    AgentGuard.Skill
Library    AgentGuard.Stats

Sub-library imports

Import line Purpose
Library AgentGuard Kitchen-sink — every keyword reachable
Library AgentGuard.MCP Test MCP servers (stdio / SSE / streamable-HTTP / in-memory)
Library AgentGuard.Skill Discover, parse, validate, grade Agent Skills
Library AgentGuard.Tool BFCL-style tool-call AST + trajectory matching
Library AgentGuard.Stats Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
Library AgentGuard.Judge Classification-based LLM-as-Judge with Cohen's κ calibration
Library AgentGuard.Security Default-deny skill scanner, redactor, sandbox, AIDefence
Library AgentGuard.Hook Claude Code hook lifecycle (12 events × 4 handler types)
Library AgentGuard.SubAgent A2A 1.0 task lifecycle, framework bridges
Library AgentGuard.Coding Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack
Library AgentGuard.Benchmark SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
Library AgentGuard.Scenario Unified scenario harness — drop-in for manykarim/rf-mcp tests/e2e/

Keyword documentation (browsable HTML, generated by Robot Framework libdoc):

Operator-driven assertions

Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:

Tool Hit Rate              ${result}    >=    ${0.7}
Failed Tool Call Count     ${result}    ==    ${0}
Read Edit Ratio            ${session}   >=    ${0.5}

Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.

Examples

Runnable Robot suites under examples/:

File Topic
01_mcp_server_basics.robot Connect to an MCP server, list and call tools
02_skill_grading.robot Grade a SKILL.md against an LLM with Cohen's κ calibration
03_hook_block_destructive.robot Synthesise hook events, assert blocking decisions
04_subagent_a2a.robot A2A subagent task lifecycle + trajectory matching
05_coding_agent_metrics.robot Compute the 12 #42796 behavioural metrics from a session
06_bfcl_tool_selection.robot BFCL AST equality + trajectory comparison
07_sandbox_run.robot Run untrusted code under default-deny Docker sandbox
08_swe_bench.robot SWE-bench Verified loader + pass@k gate
09_humaneval_live.robot HumanEval live grading
10_rf_mcp_integration.robot Drop-in replacement for manykarim/rf-mcp e2e patterns
11_agentskills_grading.robot Grade manykarim/robotframework-agentskills SKILL.md files
12_mcp_scenario_replacement.robot YAML-driven scenarios + live LLM driver
13_assertion_engine_idiom.robot Side-by-side: operator form vs old Should-pair form
14_facade_imports.robot Side-by-side import variants

Run any example:

PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot

Architecture (12 bounded contexts)

Context Purpose
Provider LiteLLM-backed LLMProviderAdapter + thin vendor adapters
MCP FastMCP client wrapper for stdio / SSE / streamable-http / in-memory
Skills SKILL.md discovery, frontmatter validation, Inspect-AI grading
Hooks Synthesise the 12 Claude Code hook events; assert handler decisions
SubAgents A2A 1.0 task lifecycle + delegation-chain assertions
CodingAgent Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL
Statistics scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N
Judge Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7)
Security Default-deny skill scanner, redactor, sandbox policy, AIDefence integration
Telemetry OTel spans + Robot Framework listener embedding scorecards in log.html
BehavioralMetrics The 12 calculators from anthropic/claude-code#42796
ToolCallCorrectness BFCL AST/trajectory matcher used by MCP, Skills, SubAgents

Aggregates, value objects, and ACLs are in docs/ddd/.

Performance

Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.

Surface Budget
MCP in-memory roundtrip (p50 / p95) ≤ 5 / 10 ms
MCP stdio roundtrip (p50) ≤ 50 ms
BFCL AST match (mean per call) ≤ 1 ms
mannwhitneyu n=30/30 ≤ 5 ms
bootstrap n=30 / 1000 resamples ≤ 100 ms
Library import + Suite Setup (cold) ≤ 2 s

Run the suite locally:

uv run pytest benchmarks/ --benchmark-only

Full budget table and cost model: docs/performance/.

Documentation

License

Apache-2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotframework_agentguard-0.2.1.tar.gz (364.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

robotframework_agentguard-0.2.1-py3-none-any.whl (292.6 kB view details)

Uploaded Python 3

File details

Details for the file robotframework_agentguard-0.2.1.tar.gz.

File metadata

  • Download URL: robotframework_agentguard-0.2.1.tar.gz
  • Upload date:
  • Size: 364.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for robotframework_agentguard-0.2.1.tar.gz
Algorithm Hash digest
SHA256 86f0787e21e937e09c157ed488c030da254dd2eaff7a4253adfd545745086b0e
MD5 a00cd1bb0be2424e6fb69ba88fe40294
BLAKE2b-256 55e6ffcb5308828e71ab1e2a5e1d5b5cf761901baba37244c30d8fecad4ace95

See more details on using hashes here.

File details

Details for the file robotframework_agentguard-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: robotframework_agentguard-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 292.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for robotframework_agentguard-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7d078c1ff46b2fdd6cd83b506040e287549a0989c33cb5fbbed7155bf2b8bd58
MD5 f7b5ed6c540ed39f4598092513bc80ba
BLAKE2b-256 a8d7fb57a13e570151633cc70ca9113864c4982f9f5cf60edbd451e9e512c70b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page