The testing framework for skill engineering. Test tool descriptions, prompt templates, agent skills, and custom agents with real LLMs. AI analyzes results and tells you what to fix.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sbroenne

These details have not been verified by PyPI

Project description

pytest-skill-engineering

Skill Engineering. Test-driven. AI-analyzed.

A pytest plugin for skill engineering — test your MCP server tools, prompt templates, agent skills, and .agent.md instruction files with real LLMs. Red/Green/Refactor for the skill stack. Let AI analysis tell you what to fix.

Why?

Modern AI systems are built on skill engineering — the discipline of designing modular, reliable, callable capabilities that an LLM can discover, invoke, and orchestrate to perform real tasks. Skills are what separate "text generator" from "coding agent that actually does things."

An MCP server is the runtime for those skills. It doesn't ship alone — it comes bundled with the full skill engineering stack: tools (callable functions), prompt templates (server-side reasoning starters), agent skills (domain knowledge and behavioral guidelines), and .agent.md instruction files (specialist sub-agent definitions in VS Code / Claude Code format). Users layer on their own prompt files (slash commands like /review) on top.

Your unit tests cover the server code. Nothing covers the skill stack. And the skill stack is what the LLM actually sees.

Skill engineering breaks in ways code tests can't catch:

The tool description is too vague — the LLM picks the wrong tool or passes garbage parameters
The prompt template renders correctly but the assembled message confuses the LLM
A prompt file's slash command produces garbage output because the instructions are ambiguous
The skill has the right facts but is structured so poorly the LLM skips it
The .agent.md file has the right tools listed but the description is too vague to trigger subagent dispatch

And when you're improving it — how do you know version A is better than version B?

Skill engineering is iterative — prompt tuning, tool description refinement, .agent.md instructions, skill structure. You need A/B testing built in. Run both versions, same prompts, and let the leaderboard tell you which one wins on pass rate and cost.

That's what pytest-skill-engineering does: test the full skill engineering stack, compare variants, and get AI analysis that tells you exactly what to fix.

How It Works

Write tests as natural language prompts — you assert on what happened. If a test fails, your tool descriptions or skill need work, not your code:

Write a test — a prompt that describes what a user would say
Run it — an LLM tries to use your tools and fails
Fix the skill stack — improve tool descriptions, schemas, prompts, or .agent.md instructions until it passes
AI analysis tells you what else to optimize — cost, redundant calls, unused tools

pytest-skill-engineering ships two test harnesses:

	`Eval` + `eval_run`	`CopilotEval` + `copilot_eval`
Runs the LLM	Pydantic AI synthetic loop	Real GitHub Copilot (CLI SDK)
Model	Any provider (Azure, OpenAI, Copilot)	Copilot's active model only
MCP auth	You supply tokens / env vars	Copilot handles OAuth automatically
Introspection	Full per-call (tool name, args, timing)	Summary (tool names, final response)
Cost tracking	USD per test (via litellm pricing)	Premium requests (Copilot billing)
Setup	API keys + model config	`gh auth login` (Copilot subscription)

Eval + `eval_run` — bring your own model

You configure the model, wire up MCP servers directly, and get full per-call introspection. Best for iterating on tool descriptions, A/B testing model variants, and cheap CI runs:

from pytest_skill_engineering import Eval, Provider, MCPServer

async def test_balance_query(eval_run):
    agent = Eval(
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
    )
    result = await eval_run(agent, "What's my checking balance?")
    assert result.success
    assert result.tool_was_called("get_balance")

CopilotEval + `copilot_eval` — use the real coding agent

Runs the actual GitHub Copilot coding agent — the same one your users have. No model setup, no API keys. Best for end-to-end testing: OAuth handled automatically, skills and custom agents loaded natively:

from pytest_skill_engineering.copilot import CopilotEval

async def test_skill(copilot_eval):
    agent = CopilotEval(skill_directories=["skills/my-skill"])
    result = await copilot_eval(agent, "What can you help me with?")
    assert result.success

→ Choosing a Test Harness — full trade-off guide

AI Analysis

AI analyzes your results and tells you what to fix: which model to deploy, how to improve tool descriptions, where to cut costs. See a sample report →

AI Analysis — winner recommendation, metrics, and comparative analysis

Quick Start

Using GitHub Copilot? Zero setup:

uv add pytest-skill-engineering[copilot]
gh auth login  # one-time
pytest tests/

Using your own model (Azure, OpenAI, Anthropic…):

uv add pytest-skill-engineering
export AZURE_API_BASE=https://your-resource.openai.azure.com/
az login
pytest tests/

AI Analysis judge model (optional but recommended)

The AI analysis report needs a model to generate insights. Configure it in pyproject.toml:

GitHub Copilot:

[tool.pytest.ini_options]
addopts = "--aitest-summary-model=copilot/gpt-5-mini"

Azure OpenAI:

[tool.pytest.ini_options]
addopts = "--aitest-summary-model=azure/gpt-5.2-chat"

Features

MCP Server Testing — Real models against real tool interfaces and bundled prompt templates
Prompt File Testing — Test VS Code .prompt.md and Claude Code command files (slash commands) with load_prompt_file() / load_prompt_files()
CLI Server Testing — Wrap CLIs as testable tool servers
Real Coding Agent Testing — CopilotEval + copilot_eval runs the actual Copilot coding agent (native OAuth, skill loading, custom agent dispatch, exact user experience)
.agent.md Testing — Load .agent.md files with Eval.from_agent_file() to test instructions with any model, or use load_custom_agent() + CopilotEval to test real custom agent dispatch
Eval Comparison — Compare models, skills, .agent.md versions, and server configurations
Eval Leaderboard — Auto-ranked by pass rate and cost
Multi-Turn Sessions — Test conversations that build on context
AI Analysis — Actionable feedback on tool descriptions, prompts, and costs
Multi-Provider — Any model via Pydantic AI (OpenAI, Anthropic, Gemini, Azure, Bedrock, Mistral, and more)
Copilot SDK Provider — Use copilot/gpt-5-mini for all LLM calls (judge, insights, scoring) — zero additional setup with pytest-skill-engineering[copilot]
Clarification Detection — Catch evals that ask questions instead of acting
Semantic Assertions — Built-in llm_assert fixture powered by pydantic-evals LLM judge
Multi-Dimension Scoring — llm_score fixture for granular quality measurement across named dimensions
Image Assertions — llm_assert_image for AI-graded visual evaluation of screenshots and charts
Cost Estimation — Automatic per-test cost tracking with pricing from litellm + custom overrides

Who This Is For

MCP server authors — Validate that LLMs can actually use your tools
Copilot skill authors — Test skills and .agent.md instructions exactly as users experience them
Eval builders — Compare models, prompts, and skills to find the best configuration
Teams shipping AI systems — Catch LLM-facing regressions in CI/CD

Documentation

📚 Full Documentation

Requirements

Python 3.11+
pytest 9.0+
An LLM provider (Azure, OpenAI, Anthropic, etc.) or a GitHub Copilot subscription (pytest-skill-engineering[copilot])

Acknowledgments

Inspired by agent-benchmark.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sbroenne

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.5

Apr 20, 2026

0.6.4

Apr 19, 2026

0.6.3

Apr 18, 2026

0.6.2

Apr 18, 2026

0.6.1

Apr 11, 2026

0.6.0

Apr 11, 2026

0.5.9

Mar 31, 2026

0.5.8

Mar 23, 2026

0.5.7

Mar 22, 2026

0.3.0

Mar 22, 2026

0.2.0

Mar 21, 2026

0.1.0

Mar 1, 2026

This version

0.0.2

Feb 28, 2026

0.0.1

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_skill_engineering-0.0.2.tar.gz (138.6 kB view details)

Uploaded Feb 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_skill_engineering-0.0.2-py3-none-any.whl (178.4 kB view details)

Uploaded Feb 28, 2026 Python 3

File details

Details for the file pytest_skill_engineering-0.0.2.tar.gz.

File metadata

Download URL: pytest_skill_engineering-0.0.2.tar.gz
Upload date: Feb 28, 2026
Size: 138.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_skill_engineering-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`73e080c45774dc401fc1388cb94145682c25201ef997dfee3d163cc73ce3bdf9`
MD5	`029360ae621cfb405e62815545c1cb30`
BLAKE2b-256	`8972017254c82bb67fc34ecbed68ef4585647c26fe33453c5727e674bda75766`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_skill_engineering-0.0.2.tar.gz:

Publisher: release.yml on sbroenne/pytest-skill-engineering

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_skill_engineering-0.0.2.tar.gz
- Subject digest: 73e080c45774dc401fc1388cb94145682c25201ef997dfee3d163cc73ce3bdf9
- Sigstore transparency entry: 1005189196
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: sbroenne/pytest-skill-engineering@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58
- Branch / Tag: refs/heads/main
- Owner: https://github.com/sbroenne
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58
- Trigger Event: workflow_dispatch

File details

Details for the file pytest_skill_engineering-0.0.2-py3-none-any.whl.

File metadata

Download URL: pytest_skill_engineering-0.0.2-py3-none-any.whl
Upload date: Feb 28, 2026
Size: 178.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_skill_engineering-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a253a7359fdc2183e1f6f3efaca7ceba6df74d1586676bab052c3076c4dc05c0`
MD5	`3e7c6d335e12aa5db00f0b9d0c035674`
BLAKE2b-256	`f1cf68405c39b7e6280652d1059c48ccaffcbdf72dc46e44aa5d1c4c811a4b66`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_skill_engineering-0.0.2-py3-none-any.whl:

Publisher: release.yml on sbroenne/pytest-skill-engineering

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_skill_engineering-0.0.2-py3-none-any.whl
- Subject digest: a253a7359fdc2183e1f6f3efaca7ceba6df74d1586676bab052c3076c4dc05c0
- Sigstore transparency entry: 1005189201
- Sigstore integration time: Feb 28, 2026
Source repository:
- Permalink: sbroenne/pytest-skill-engineering@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58
- Branch / Tag: refs/heads/main
- Owner: https://github.com/sbroenne
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8b4225ab86a1aad4e1a6eb13050ebb6090b68e58
- Trigger Event: workflow_dispatch

pytest-skill-engineering 0.0.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pytest-skill-engineering

Why?

How It Works

Eval + eval_run — bring your own model

CopilotEval + copilot_eval — use the real coding agent

AI Analysis

Quick Start

AI Analysis judge model (optional but recommended)

Features

Who This Is For

Documentation

Requirements

Acknowledgments

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Eval + `eval_run` — bring your own model

CopilotEval + `copilot_eval` — use the real coding agent