Automated end-to-end skill testing for LLM coding tools
Project description
skillprobe
Automated testing for LLM skills. Launches Claude Code or Cursor as subprocesses, runs scenarios in isolated workspaces, and reports what passed and what didn't.
Skills are just text injected into the LLM context, and LLMs are probabilistic, so they'll get ignored some percentage of the time no matter how carefully you word them. If you want hard enforcement, hooks are the right tool since they run deterministically every time. But hooks can only check things after the fact (linting, file restrictions, blocked commands). They cant guide the model toward better architectural decisions, teach it your team's domain conventions, set the tone of code review feedback, or help it reason through a multi-step workflow. Skills handle that side, and skillprobe measures how reliably they do it.
When you need this
If you write a few personal skills and tweak them by feel, you probably dont need this. That loop is fast and good enough for individual use.
Where it breaks down:
- Model updates break skills silently. Anthropic ships a new Sonnet, Cursor updates their agent, and a skill that worked last week now produces different output. Nobody notices because nobody retested.
- Teams sharing skills. When 20 engineers share a "code review" skill, one person's gut check isnt representative. You need coverage across scenarios to know whether the skill holds up.
- Publishing to marketplaces. At that point you're distributing software, not vibing with your own tool. "Ask the LLM to fix it" doesnt scale to reproducing someone else's problem.
- The endless tweak loop. After three rounds of edits you cant tell if the latest version is better or if you just moved the problem around. skillprobe gives you a definitive signal by running the same scenarios against both versions and comparing pass rates.
Installation
pip install skillprobe
Or with uv:
uv tool install skillprobe
Or from source:
git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync
Quick start
The repo ships with example skills and test scenarios you can run immediately:
git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync
uv run skillprobe run examples/tests/test-clean-python.yaml
Running: examples/tests/test-clean-python.yaml
Harness: claude-code
Model: claude-haiku-4-5-20251001
Scenarios: 5
Parallel: 1
[PASS] no docstrings on simple functions (11.6s $0.0204)
[PASS] imports at top level (7.2s $0.0199)
[PASS] no obvious comments (7.2s $0.0187)
[FAIL] uses type hints (6.8s $0.0191)
step 1: "Write a Python function that takes a list of integ"
Pattern 'def \w+\(.*:.*\)' did not match
Pattern '-> ' did not match
[PASS] skill does not block normal functionality (11.3s $0.0202)
4/5 passed (44.0s)
Total cost: $0.10
Requires Claude Code or Cursor CLI installed and authenticated.
To generate tests for your own skill instead of writing YAML from scratch:
skillprobe init ./skills/my-skill --harness claude-code
skillprobe run tests/my-skill.yaml
Writing scenarios
Scenarios are YAML files describing what to test. Each scenario can have multiple conversational steps, a workspace fixture that gets copied fresh for every run, setup commands that prepare the workspace, and post-run assertions that check workspace state after everything finishes:
harness: claude-code
model: claude-haiku-4-5-20251001
timeout: 120
skill: ./skills/commit
scenarios:
- name: "commit skill activates on request"
workspace: fixtures/dirty-repo
setup:
- run: "echo 'change' >> file.txt && git add ."
steps:
- prompt: "commit my changes"
assert:
- type: contains
value: "commit"
- type: tool_called
value: "Bash"
after:
- type: file_exists
value: ".git/COMMIT_EDITMSG"
- name: "does not activate for unrelated request"
steps:
- prompt: "explain what this project does"
assert:
- type: not_contains
value: "commit"
Assertion types: contains, not_contains, regex, tool_called, file_exists, file_contains. Any assertion can be inverted with negate: true.
Multi-run for measuring reliability
Since skills are probabilistic, a single pass/fail isnt always meaningful. Run the same prompt multiple times and set a pass rate threshold:
steps:
- prompt: "Write a function with type hints"
runs: 5
min_pass_rate: 0.8
assert:
- type: regex
value: "-> "
[PASS] uses type hints (21.0s $0.0416)
step 1: [ok] 4/5 passed (80%)
Generating tests
init reads your SKILL.md, uses an LLM to generate test scenarios (positive activation, negative activation, behavioral correctness, edge cases), and writes a YAML file to review and tweak. Supports both Anthropic and OpenAI as the generation provider via --provider.
skillprobe init ./skills/commit --harness claude-code
Commands
skillprobe run <test.yaml> runs test scenarios against a real coding tool.
| Flag | Default | Description |
|---|---|---|
--harness |
from YAML | claude-code or cursor |
--model |
from YAML | Model to use for the tool under test |
--parallel |
1 | Number of scenarios to run concurrently |
--timeout |
from YAML | Per-scenario timeout in seconds |
--max-cost |
none | Max USD spend (Claude Code only) |
skillprobe init <skill-dir> generates starter test YAML from a skill definition.
| Flag | Default | Description |
|---|---|---|
--harness |
claude-code |
Target harness |
--output |
tests/<skill>.yaml |
Output YAML path |
--provider |
anthropic |
LLM provider for generation |
--model |
auto | Model for generation |
--fixtures-dir |
fixtures |
Where to write fixture directories |
CI
skillprobe works in CI for catching regressions when models update or skills change. The CI runner needs the target tool's CLI installed and authenticated.
name: skill-tests
on:
push:
paths: ["skills/**", "tests/**"]
schedule:
- cron: "0 6 * * 1"
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm install -g @anthropic-ai/claude-code
- uses: astral-sh/setup-uv@v4
- run: uv tool install skillprobe
- run: skillprobe run tests/my-skill.yaml --harness claude-code
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Security
Scenario YAML files are executable content. Setup commands run with your full user permissions, and the harness launches the AI tool with --dangerously-skip-permissions (or --force for Cursor). Treat test YAML files like shell scripts. Workspaces are temporary copies deleted after each run, and file path assertions validate against workspace boundary escapes.
Why not promptfoo
promptfoo tests prompts in isolation via direct API calls, outside the tool that will actually use them. skillprobe runs the real tool as a subprocess in a real workspace, testing the full stack: skill loading, tool use, file system interactions, multi-turn conversations. Works with subscriptions too since the tool under test handles its own auth.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skillprobe-0.2.0.tar.gz.
File metadata
- Download URL: skillprobe-0.2.0.tar.gz
- Upload date:
- Size: 71.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d5749225b74f100af7bc325aaee5eba547f45337702c6eb188050891e9934180
|
|
| MD5 |
2d0574a860f4ba8efc2fe8dbff9e3497
|
|
| BLAKE2b-256 |
e2b61caf19371f0ad5a3b9b203f079e102c951ef1ddcc996e654764266f301a4
|
Provenance
The following attestation bundles were made for skillprobe-0.2.0.tar.gz:
Publisher:
publish.yml on Anyesh/skillprobe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
skillprobe-0.2.0.tar.gz -
Subject digest:
d5749225b74f100af7bc325aaee5eba547f45337702c6eb188050891e9934180 - Sigstore transparency entry: 1216316609
- Sigstore integration time:
-
Permalink:
Anyesh/skillprobe@f5f3e34512155dd477af53092ec1f54d50c9fef2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Anyesh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f5f3e34512155dd477af53092ec1f54d50c9fef2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file skillprobe-0.2.0-py3-none-any.whl.
File metadata
- Download URL: skillprobe-0.2.0-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e3c67e01ab608ad51d1f9a2430f85ba11e99d56c6e461929e6e25d2dce4fea7c
|
|
| MD5 |
19248d6a9c11cd6e859b8fcec7b90a39
|
|
| BLAKE2b-256 |
2eabb8b31388daad9a49de16c18904f511a65ffe9b7e49ebc2a69d3175caf932
|
Provenance
The following attestation bundles were made for skillprobe-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Anyesh/skillprobe
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
skillprobe-0.2.0-py3-none-any.whl -
Subject digest:
e3c67e01ab608ad51d1f9a2430f85ba11e99d56c6e461929e6e25d2dce4fea7c - Sigstore transparency entry: 1216316683
- Sigstore integration time:
-
Permalink:
Anyesh/skillprobe@f5f3e34512155dd477af53092ec1f54d50c9fef2 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Anyesh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f5f3e34512155dd477af53092ec1f54d50c9fef2 -
Trigger Event:
push
-
Statement type: