The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.
Project description
pytest-skill-engineering
Skill Engineering. Test-driven. AI-analyzed.
A pytest plugin for skill engineering — test your MCP server tools, prompt templates, agent skills, and .agent.md instruction files with real LLMs. Red/Green/Refactor for the skill stack. Let AI analysis tell you what to fix.
Why?
Modern AI systems are built on skill engineering — the discipline of designing modular, reliable, callable capabilities that an LLM can discover, invoke, and orchestrate to perform real tasks. Skills are what separate "text generator" from "coding agent that actually does things."
An MCP server is the runtime for those skills. It doesn't ship alone — it comes bundled with the full skill engineering stack: tools (callable functions), prompt templates (server-side reasoning starters), agent skills (domain knowledge and behavioral guidelines), and .agent.md instruction files (specialist sub-agent definitions in VS Code / Claude Code format). Users layer on their own prompt files (slash commands like /review) on top.
Your unit tests cover the server code. Nothing covers the skill stack. And the skill stack is what the LLM actually sees.
Skill engineering breaks in ways code tests can't catch:
- The tool description is too vague — the LLM picks the wrong tool or passes garbage parameters
- The prompt template renders correctly but the assembled message confuses the LLM
- A prompt file's slash command produces garbage output because the instructions are ambiguous
- The skill has the right facts but is structured so poorly the LLM skips it
- The
.agent.mdfile has the right tools listed but the description is too vague to trigger subagent dispatch
And when you're improving it — how do you know version A is better than version B?
Skill engineering is iterative — prompt tuning, tool description refinement, .agent.md instructions, skill structure. You need A/B testing built in. Run both versions, same prompts, and let the leaderboard tell you which one wins on pass rate and cost.
That's what pytest-skill-engineering does: test the full skill engineering stack, compare variants, and get AI analysis that tells you exactly what to fix.
How It Works
Write tests as natural language prompts — you assert on what happened. If a test fails, your tool descriptions or skill need work, not your code:
- Write a test — a prompt that describes what a user would say
- Run it — an LLM tries to use your tools and fails
- Fix the skill stack — improve tool descriptions, schemas, prompts, or
.agent.mdinstructions until it passes - AI analysis tells you what else to optimize — cost, redundant calls, unused tools
pytest-skill-engineering ships two test harnesses:
Eval + eval_run |
CopilotEval + copilot_eval |
|
|---|---|---|
| Runs the LLM | Pydantic AI synthetic loop | Real GitHub Copilot (CLI SDK) |
| Model | Any provider (Azure, OpenAI, Copilot) | Copilot's active model only |
| MCP auth | You supply tokens / env vars | Copilot handles OAuth automatically |
| Introspection | Full per-call (tool name, args, timing) | Summary (tool names, final response) |
| Cost tracking | USD per test (via litellm pricing) | Premium requests (Copilot billing) |
| Setup | API keys + model config | gh auth login (Copilot subscription) |
Eval + eval_run — bring your own model
You configure the model, wire up MCP servers directly, and get full per-call introspection. Best for iterating on tool descriptions, A/B testing model variants, and cheap CI runs:
from pytest_skill_engineering import Eval, Provider, MCPServer
async def test_balance_query(eval_run):
agent = Eval(
provider=Provider(model="azure/gpt-5-mini"),
mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
)
result = await eval_run(agent, "What's my checking balance?")
assert result.success
assert result.tool_was_called("get_balance")
CopilotEval + copilot_eval — use the real coding agent
Runs the actual GitHub Copilot coding agent — the same one your users have. No model setup, no API keys. Best for end-to-end testing: OAuth handled automatically, skills and custom agents loaded natively:
from pytest_skill_engineering.copilot import CopilotEval
async def test_skill(copilot_eval):
agent = CopilotEval(skill_directories=["skills/my-skill"])
result = await copilot_eval(agent, "What can you help me with?")
assert result.success
→ Choosing a Test Harness — full trade-off guide
AI Analysis
AI analyzes your results and tells you what to fix: which model to deploy, how to improve tool descriptions, where to cut costs. See a sample report →
Quick Start
Using GitHub Copilot? Zero setup:
uv add pytest-skill-engineering[copilot]
gh auth login # one-time
pytest tests/
Using your own model (Azure, OpenAI, Anthropic…):
uv add pytest-skill-engineering
export AZURE_API_BASE=https://your-resource.openai.azure.com/
az login
pytest tests/
AI Analysis judge model (optional but recommended)
The AI analysis report needs a model to generate insights. Configure it in pyproject.toml:
GitHub Copilot:
[tool.pytest.ini_options]
addopts = "--aitest-summary-model=copilot/gpt-5-mini"
Azure OpenAI:
[tool.pytest.ini_options]
addopts = "--aitest-summary-model=azure/gpt-5.2-chat"
Features
- MCP Server Testing — Real models against real tool interfaces and bundled prompt templates
- Prompt File Testing — Test VS Code
.prompt.mdand Claude Code command files (slash commands) withload_prompt_file()/load_prompt_files() - CLI Server Testing — Wrap CLIs as testable tool servers
- Real Coding Agent Testing —
CopilotEval + copilot_evalruns the actual Copilot coding agent (native OAuth, skill loading, custom agent dispatch, exact user experience) .agent.mdTesting — Load.agent.mdfiles withEval.from_agent_file()to test instructions with any model, or useload_custom_agent()+CopilotEvalto test real custom agent dispatch- Eval Comparison — Compare models, skills,
.agent.mdversions, and server configurations - Eval Leaderboard — Auto-ranked by pass rate and cost
- Multi-Turn Sessions — Test conversations that build on context
- AI Analysis — Actionable feedback on tool descriptions, prompts, and costs
- Multi-Provider — Any model via Pydantic AI (OpenAI, Anthropic, Gemini, Azure, Bedrock, Mistral, and more)
- Copilot SDK Provider — Use
copilot/gpt-5-minifor all LLM calls (judge, insights, scoring) — zero additional setup withpytest-skill-engineering[copilot] - Clarification Detection — Catch evals that ask questions instead of acting
- Semantic Assertions — Built-in
llm_assertfixture powered by pydantic-evals LLM judge - Multi-Dimension Scoring —
llm_scorefixture for granular quality measurement across named dimensions - Image Assertions —
llm_assert_imagefor AI-graded visual evaluation of screenshots and charts - Cost Estimation — Automatic per-test cost tracking with pricing from litellm + custom overrides
Who This Is For
- MCP server authors — Validate that LLMs can actually use your tools
- Copilot skill authors — Test skills and
.agent.mdinstructions exactly as users experience them - Eval builders — Compare models, prompts, and skills to find the best configuration
- Teams shipping AI systems — Catch LLM-facing regressions in CI/CD
Documentation
Requirements
- Python 3.11+
- pytest 9.0+
- An LLM provider (Azure, OpenAI, Anthropic, etc.) or a GitHub Copilot subscription (
pytest-skill-engineering[copilot])
Acknowledgments
Inspired by agent-benchmark.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_skill_engineering-0.0.2.tar.gz.
File metadata
- Download URL: pytest_skill_engineering-0.0.2.tar.gz
- Upload date:
- Size: 138.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
73e080c45774dc401fc1388cb94145682c25201ef997dfee3d163cc73ce3bdf9
|
|
| MD5 |
029360ae621cfb405e62815545c1cb30
|
|
| BLAKE2b-256 |
8972017254c82bb67fc34ecbed68ef4585647c26fe33453c5727e674bda75766
|
Provenance
The following attestation bundles were made for pytest_skill_engineering-0.0.2.tar.gz:
Publisher:
release.yml on sbroenne/pytest-skill-engineering
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_skill_engineering-0.0.2.tar.gz -
Subject digest:
73e080c45774dc401fc1388cb94145682c25201ef997dfee3d163cc73ce3bdf9 - Sigstore transparency entry: 1005189196
- Sigstore integration time:
-
Permalink:
sbroenne/pytest-skill-engineering@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/sbroenne
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file pytest_skill_engineering-0.0.2-py3-none-any.whl.
File metadata
- Download URL: pytest_skill_engineering-0.0.2-py3-none-any.whl
- Upload date:
- Size: 178.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a253a7359fdc2183e1f6f3efaca7ceba6df74d1586676bab052c3076c4dc05c0
|
|
| MD5 |
3e7c6d335e12aa5db00f0b9d0c035674
|
|
| BLAKE2b-256 |
f1cf68405c39b7e6280652d1059c48ccaffcbdf72dc46e44aa5d1c4c811a4b66
|
Provenance
The following attestation bundles were made for pytest_skill_engineering-0.0.2-py3-none-any.whl:
Publisher:
release.yml on sbroenne/pytest-skill-engineering
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_skill_engineering-0.0.2-py3-none-any.whl -
Subject digest:
a253a7359fdc2183e1f6f3efaca7ceba6df74d1586676bab052c3076c4dc05c0 - Sigstore transparency entry: 1005189201
- Sigstore integration time:
-
Permalink:
sbroenne/pytest-skill-engineering@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/sbroenne
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58 -
Trigger Event:
workflow_dispatch
-
Statement type: