Robot Framework library for testing Agent Skills, Hooks, SubAgents, and MCP Servers — provider-agnostic, BFCL-grade tool-call matching, #42796 behavioral metrics, statistical non-determinism handling.

These details have not been verified by PyPI

Project links

Project description

robotframework-agentguard

A Robot Framework library for testing MCP servers, Agent Skills, Hooks, SubAgents, and coding-agent CLIs — provider-agnostic via LiteLLM, BFCL-grade tool-call matching, statistical N≥10 by default.

AgentGuard turns the moving parts of an agent stack into Robot Framework keywords. Connect to an MCP server, grade a SKILL.md, drive Claude Code / Codex / Aider, replay an A2A subagent task, run BFCL tool-call comparisons, and gate the result with scipy-backed statistics — all without leaving a .robot file.

What it tests

MCP servers — stdio / SSE / streamable-HTTP / in-memory, full FastMCP client surface
Agent Skills — SKILL.md discovery, frontmatter validation, Inspect-AI grading
Hooks — synthesise the 12 Claude Code hook events and assert handler decisions
SubAgents — A2A 1.0 task lifecycle + LangGraph / CrewAI / AutoGen / OpenAI Agents bridges
Coding agents — drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; emit the 12 #42796 behavioural metrics
Public benchmarks — SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
Security — default-deny skill scanner, redactor, sandboxed execution, AIDefence integration
Statistics — Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N

Installation

Note: the package is pending publication to PyPI. Until then, install from the GitHub source.

From PyPI (once published):

pip install robotframework-agentguard
# or
uv add robotframework-agentguard

From source (today):

pip install git+https://github.com/manykarim/robotframework-agentguard.git
# or, with uv
uv add git+https://github.com/manykarim/robotframework-agentguard.git

Optional extras:

pip install 'robotframework-agentguard[bridges]'      # LangGraph / CrewAI / AutoGen / OpenAI Agents
pip install 'robotframework-agentguard[benchmarks]'   # SWE-bench, Aider, HumanEval, MBPP, LiveCodeBench

Configure the LLM provider with a .env file in your project root:

OPENROUTER_API_KEY=sk-or-...
# optional overrides
AGENTGUARD_DEFAULT_MODEL=openrouter/anthropic/claude-sonnet-4-5
AGENTGUARD_JUDGE_MODEL=openrouter/openai/gpt-4o-mini

Verify the install:

agentguard doctor    # provider, env, MCP reachability
agentguard version

Quickstart

*** Settings ***
Library    AgentGuard    provider=litellm    model=openrouter/anthropic/claude-sonnet-4-5

*** Test Cases ***
AgentGuard Should Be Loaded
    ${info}=    Get AgentGuard Info
    Log    ${info}

The kitchen-sink Library AgentGuard exposes every keyword. Prefer narrow imports for bigger suites:

*** Settings ***
Library    AgentGuard.MCP
Library    AgentGuard.Skill
Library    AgentGuard.Stats

Sub-library imports

Import line	Purpose
`Library AgentGuard`	Kitchen-sink — every keyword reachable
`Library AgentGuard.MCP`	Test MCP servers (stdio / SSE / streamable-HTTP / in-memory)
`Library AgentGuard.Skill`	Discover, parse, validate, grade Agent Skills
`Library AgentGuard.Tool`	BFCL-style tool-call AST + trajectory matching
`Library AgentGuard.Stats`	Mann-Whitney U, Cliff's δ, Vargha-Delaney A, bootstrap CIs, pass@k, TARr@N
`Library AgentGuard.Judge`	Classification-based LLM-as-Judge with Cohen's κ calibration
`Library AgentGuard.Security`	Default-deny skill scanner, redactor, sandbox, AIDefence
`Library AgentGuard.Hook`	Claude Code hook lifecycle (12 events × 4 handler types)
`Library AgentGuard.SubAgent`	A2A 1.0 task lifecycle, framework bridges
`Library AgentGuard.Coding`	Drive Claude Code / Codex / Aider / OpenCode + #42796 metric pack
`Library AgentGuard.Benchmark`	SWE-bench Verified, Aider, HumanEval, MBPP, LiveCodeBench
`Library AgentGuard.Scenario`	Unified scenario harness — drop-in for `manykarim/rf-mcp` `tests/e2e/`

Keyword documentation (browsable HTML, generated by Robot Framework libdoc):

https://manykarim.github.io/robotframework-agentguard/api/ — full per-library API reference, served via GitHub Pages
docs/KEYWORDS.md — alphabetical text reference for all 147 keywords (Markdown, viewable on GitHub)
docs/api/ — the source HTML files (regenerated via ./docs/api/generate.sh)

Operator-driven assertions

Every collapsible Get-style keyword accepts the standard (assertion_operator, assertion_expected, message) parameters from robotframework-assertion-engine, so a single keyword does both retrieval and assertion:

Tool Hit Rate              ${result}    >=    ${0.7}
Failed Tool Call Count     ${result}    ==    ${0}
Read Edit Ratio            ${session}   >=    ${0.5}

Without operator arguments the same keywords just return the value. See examples/13_assertion_engine_idiom.robot and ADR-022.

Examples

Runnable Robot suites under examples/:

File	Topic
`01_mcp_server_basics.robot`	Connect to an MCP server, list and call tools
`02_skill_grading.robot`	Grade a `SKILL.md` against an LLM with Cohen's κ calibration
`03_hook_block_destructive.robot`	Synthesise hook events, assert blocking decisions
`04_subagent_a2a.robot`	A2A subagent task lifecycle + trajectory matching
`05_coding_agent_metrics.robot`	Compute the 12 #42796 behavioural metrics from a session
`06_bfcl_tool_selection.robot`	BFCL AST equality + trajectory comparison
`07_sandbox_run.robot`	Run untrusted code under default-deny Docker sandbox
`08_swe_bench.robot`	SWE-bench Verified loader + pass@k gate
`09_humaneval_live.robot`	HumanEval live grading
`10_rf_mcp_integration.robot`	Drop-in replacement for `manykarim/rf-mcp` e2e patterns
`11_agentskills_grading.robot`	Grade `manykarim/robotframework-agentskills` SKILL.md files
`12_mcp_scenario_replacement.robot`	YAML-driven scenarios + live LLM driver
`13_assertion_engine_idiom.robot`	Side-by-side: operator form vs old Should-pair form
`14_facade_imports.robot`	Side-by-side import variants

Run any example:

PYTHONPATH=. robot --outputdir _out examples/01_mcp_server_basics.robot

Architecture (12 bounded contexts)

Context	Purpose
Provider	LiteLLM-backed `LLMProviderAdapter` + thin vendor adapters
MCP	FastMCP client wrapper for stdio / SSE / streamable-http / in-memory
Skills	`SKILL.md` discovery, frontmatter validation, Inspect-AI grading
Hooks	Synthesise the 12 Claude Code hook events; assert handler decisions
SubAgents	A2A 1.0 task lifecycle + delegation-chain assertions
CodingAgent	Drive Claude Code, Codex CLI, Aider, OpenCode, Cline, Continue; normalise JSONL
Statistics	scipy-backed Mann-Whitney, Cliff's δ, bootstrap CI, pass@k, TARr@N
Judge	Classification-based LLM-as-Judge with calibration gating (Cohen's κ ≥ 0.7)
Security	Default-deny skill scanner, redactor, sandbox policy, AIDefence integration
Telemetry	OTel spans + Robot Framework listener embedding scorecards in `log.html`
BehavioralMetrics	The 12 calculators from `anthropic/claude-code#42796`
ToolCallCorrectness	BFCL AST/trajectory matcher used by MCP, Skills, SubAgents

Aggregates, value objects, and ACLs are in docs/ddd/.

Performance

Hard ceilings — CI fails if any benchmark exceeds the matching budget by more than 20%.

Surface	Budget
MCP in-memory roundtrip (p50 / p95)	≤ 5 / 10 ms
MCP stdio roundtrip (p50)	≤ 50 ms
BFCL AST match (mean per call)	≤ 1 ms
`mannwhitneyu` n=30/30	≤ 5 ms
`bootstrap` n=30 / 1000 resamples	≤ 100 ms
Library import + Suite Setup (cold)	≤ 2 s

Run the suite locally:

uv run pytest benchmarks/ --benchmark-only

Full budget table and cost model: docs/performance/.

Documentation

Plan — docs/PLAN.md
Keyword reference — https://manykarim.github.io/robotframework-agentguard/api/ (GitHub Pages) · docs/KEYWORDS.md (Markdown)
Architecture Decision Records — docs/adr/
Domain model — docs/ddd/
Performance budgets — docs/performance/
Research dossier — docs/research/research.md
Contributing — CONTRIBUTING.md
Security — SECURITY.md

License

Apache-2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.1

May 5, 2026

0.2.0 yanked

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

robotframework_agentguard-0.2.1.tar.gz (364.4 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

robotframework_agentguard-0.2.1-py3-none-any.whl (292.6 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file robotframework_agentguard-0.2.1.tar.gz.

File metadata

Download URL: robotframework_agentguard-0.2.1.tar.gz
Upload date: May 5, 2026
Size: 364.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for robotframework_agentguard-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`86f0787e21e937e09c157ed488c030da254dd2eaff7a4253adfd545745086b0e`
MD5	`a00cd1bb0be2424e6fb69ba88fe40294`
BLAKE2b-256	`55e6ffcb5308828e71ab1e2a5e1d5b5cf761901baba37244c30d8fecad4ace95`

See more details on using hashes here.

File details

Details for the file robotframework_agentguard-0.2.1-py3-none-any.whl.

File metadata

Download URL: robotframework_agentguard-0.2.1-py3-none-any.whl
Upload date: May 5, 2026
Size: 292.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for robotframework_agentguard-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d078c1ff46b2fdd6cd83b506040e287549a0989c33cb5fbbed7155bf2b8bd58`
MD5	`f7b5ed6c540ed39f4598092513bc80ba`
BLAKE2b-256	`a8d7fb57a13e570151633cc70ca9113864c4982f9f5cf60edbd451e9e512c70b`

See more details on using hashes here.

robotframework-agentguard 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

robotframework-agentguard

What it tests

Installation

Quickstart

Sub-library imports

Operator-driven assertions

Examples

Architecture (12 bounded contexts)

Performance

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes