Robot Framework library for testing Agent Skills, Hooks, SubAgents, and MCP Servers — provider-agnostic, BFCL-grade tool-call matching, #42796 behavioral metrics, statistical non-determinism handling.
Project description
robotframework-agentguard
A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.
AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.
What it tests
- MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
- Agent Skills —
SKILL.mddiscovery, frontmatter validation, Inspect-AI grading - Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
- SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
- Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
- Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
- Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
- Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
Installation
Note: the package is pending publication to PyPI. Until then, install from the GitHub source.
From PyPI (once published):
pip install robotframework-agentguard
# or
uv add robotframework-agentguard
From source (today):
pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git
Optional extras:
pip install 'robotframework-agentguard[bridges]' # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]' # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench
Configure the LLM provider with a .env file in your project root:
OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini
Verify the install:
agentguard doctor # provider, env, MCP reachability
agentguard version
Quickstart
*** Settings ***
Library AgentGuard provider=litellm model=openrouter/anthropic/claude-sonnet-4-5
*** Test Cases ***
AgentGuard Should Be Loaded
${info}= Get AgentGuard Info
Log ${info}
The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:
*** Settings ***
Library AgentGuard.MCP
Library AgentGuard.Skill
Library AgentGuard.Stats
Sub-library imports
| Import line | Purpose |
|---|---|
Library AgentGuard |
Kitchen-sink — every keyword reachable |
Library AgentGuard.MCP |
Test MCP servers (stdio / SSE / streamable-HTTP / in-memory) |
Library AgentGuard.Skill |
Discover, parse, validate, grade Agent Skills |
Library AgentGuard.Tool |
BFCL-style tool-call AST + trajectory matching |
Library AgentGuard.Stats |
Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N |
Library AgentGuard.Judge |
Classification-based LLM-as-Judge with Cohen's κ calibration |
Library AgentGuard.Security |
Default-deny skill scanner, redactor, sandbox, AIDefence |
Library AgentGuard.Hook |
Claude Code hook lifecycle (12 events × 4 handler types) |
Library AgentGuard.SubAgent |
A2A 1.0 task lifecycle, framework bridges |
Library AgentGuard.Coding |
Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack |
Library AgentGuard.Benchmark |
SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench |
Library AgentGuard.Scenario |
Unified scenario harness — drop-in for manykarim/rf-mcp tests/e2e/ |
Keyword documentation (browsable HTML, generated by Robot Framework libdoc):
- https://manykarim.github.io/robotframework-agentguard/api/ — full per-library API reference, served via GitHub Pages
docs/KEYWORDS.md— alphabetical text reference for all 147 keywords (Markdown, viewable on GitHub)docs/api/— the source HTML files (regenerated via./docs/api/generate.sh)
Operator-driven assertions
Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:
Tool Hit Rate ${result} >= ${0.7}
Failed Tool Call Count ${result} == ${0}
Read Edit Ratio ${session} >= ${0.5}
Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.
Examples
Runnable Robot suites under examples/:
| File | Topic |
|---|---|
01_mcp_server_basics.robot |
Connect to an MCP server, list and call tools |
02_skill_grading.robot |
Grade a SKILL.md against an LLM with Cohen's κ calibration |
03_hook_block_destructive.robot |
Synthesise hook events, assert blocking decisions |
04_subagent_a2a.robot |
A2A subagent task lifecycle + trajectory matching |
05_coding_agent_metrics.robot |
Compute the 12 #42796 behavioural metrics from a session |
06_bfcl_tool_selection.robot |
BFCL AST equality + trajectory comparison |
07_sandbox_run.robot |
Run untrusted code under default-deny Docker sandbox |
08_swe_bench.robot |
SWE-bench Verified loader + pass@k gate |
09_humaneval_live.robot |
HumanEval live grading |
10_rf_mcp_integration.robot |
Drop-in replacement for manykarim/rf-mcp e2e patterns |
11_agentskills_grading.robot |
Grade manykarim/robotframework-agentskills SKILL.md files |
12_mcp_scenario_replacement.robot |
YAML-driven scenarios + live LLM driver |
13_assertion_engine_idiom.robot |
Side-by-side: operator form vs old Should-pair form |
14_facade_imports.robot |
Side-by-side import variants |
Run any example:
PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot
Architecture (12 bounded contexts)
| Context | Purpose |
|---|---|
| Provider | LiteLLM-backed LLMProviderAdapter + thin vendor adapters |
| MCP | FastMCP client wrapper for stdio / SSE / streamable-http / in-memory |
| Skills | SKILL.md discovery, frontmatter validation, Inspect-AI grading |
| Hooks | Synthesise the 12 Claude Code hook events; assert handler decisions |
| SubAgents | A2A 1.0 task lifecycle + delegation-chain assertions |
| CodingAgent | Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL |
| Statistics | scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N |
| Judge | Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7) |
| Security | Default-deny skill scanner, redactor, sandbox policy, AIDefence integration |
| Telemetry | OTel spans + Robot Framework listener embedding scorecards in log.html |
| BehavioralMetrics | The 12 calculators from anthropic/claude-code#42796 |
| ToolCallCorrectness | BFCL AST/trajectory matcher used by MCP, Skills, SubAgents |
Aggregates, value objects, and ACLs are in docs/ddd/.
Performance
Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.
| Surface | Budget |
|---|---|
| MCP in-memory roundtrip (p50 / p95) | ≤ 5 / 10 ms |
| MCP stdio roundtrip (p50) | ≤ 50 ms |
| BFCL AST match (mean per call) | ≤ 1 ms |
mannwhitneyu n=30/30 |
≤ 5 ms |
bootstrap n=30 / 1000 resamples |
≤ 100 ms |
| Library import + Suite Setup (cold) | ≤ 2 s |
Run the suite locally:
uv run pytest benchmarks/ --benchmark-only
Full budget table and cost model: docs/performance/.
Documentation
- Plan —
docs/PLAN.md - Keyword reference — https://manykarim.github.io/robotframework-agentguard/api/ (GitHub Pages) ·
docs/KEYWORDS.md(Markdown) - Architecture Decision Records —
docs/adr/ - Domain model —
docs/ddd/ - Performance budgets —
docs/performance/ - Research dossier —
docs/research/research.md - Contributing —
CONTRIBUTING.md - Security —
SECURITY.md
License
Apache-2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file robotframework_agentguard-0.2.1.tar.gz.
File metadata
- Download URL: robotframework_agentguard-0.2.1.tar.gz
- Upload date:
- Size: 364.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86f0787e21e937e09c157ed488c030da254dd2eaff7a4253adfd545745086b0e
|
|
| MD5 |
a00cd1bb0be2424e6fb69ba88fe40294
|
|
| BLAKE2b-256 |
55e6ffcb5308828e71ab1e2a5e1d5b5cf761901baba37244c30d8fecad4ace95
|
File details
Details for the file robotframework_agentguard-0.2.1-py3-none-any.whl.
File metadata
- Download URL: robotframework_agentguard-0.2.1-py3-none-any.whl
- Upload date:
- Size: 292.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d078c1ff46b2fdd6cd83b506040e287549a0989c33cb5fbbed7155bf2b8bd58
|
|
| MD5 |
f7b5ed6c540ed39f4598092513bc80ba
|
|
| BLAKE2b-256 |
a8d7fb57a13e570151633cc70ca9113864c4982f9f5cf60edbd451e9e512c70b
|