Skip to main content

Automated end-to-end skill testing for LLM coding tools

Project description

skillprobe

skillprobe demo

Automated testing for LLM skills. Launches Claude Code or Cursor as subprocesses, runs scenarios in isolated workspaces, and reports what passed and what didn't.

Skills are just text injected into the LLM context, and LLMs are probabilistic, so they'll get ignored some percentage of the time no matter how carefully you word them. If you want hard enforcement, hooks are the right tool since they run deterministically every time. But hooks can only check things after the fact (linting, file restrictions, blocked commands). They cant guide the model toward better architectural decisions, teach it your team's domain conventions, set the tone of code review feedback, or help it reason through a multi-step workflow. Skills handle that side, and skillprobe measures how reliably they do it.

When you need this

If you write a few personal skills and tweak them by feel, you probably dont need this. That loop is fast and good enough for individual use.

Where it breaks down:

  • Model updates break skills silently. Anthropic ships a new Sonnet, Cursor updates their agent, and a skill that worked last week now produces different output. Nobody notices because nobody retested.
  • Teams sharing skills. When 20 engineers share a "code review" skill, one person's gut check isnt representative. You need coverage across scenarios to know whether the skill holds up.
  • Publishing to marketplaces. At that point you're distributing software, not vibing with your own tool. "Ask the LLM to fix it" doesnt scale to reproducing someone else's problem.
  • The endless tweak loop. After three rounds of edits you cant tell if the latest version is better or if you just moved the problem around. skillprobe gives you a definitive signal by running the same scenarios against both versions and comparing pass rates.

Installation

pip install skillprobe

Or with uv:

uv tool install skillprobe

Or from source:

git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync

Quick start

The repo ships with example skills and test scenarios you can run immediately:

git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync
uv run skillprobe run examples/tests/test-clean-python.yaml
Running: examples/tests/test-clean-python.yaml
  Harness: claude-code
  Model: claude-haiku-4-5-20251001
  Scenarios: 5
  Parallel: 1

  [PASS] no docstrings on simple functions (11.6s $0.0204)
  [PASS] imports at top level (7.2s $0.0199)
  [PASS] no obvious comments (7.2s $0.0187)
  [FAIL] uses type hints (6.8s $0.0191)
         step 1: "Write a Python function that takes a list of integ"
           Pattern 'def \w+\(.*:.*\)' did not match
           Pattern '-> ' did not match
  [PASS] skill does not block normal functionality (11.3s $0.0202)

  4/5 passed (44.0s)
  Total cost: $0.10

Requires Claude Code or Cursor CLI installed and authenticated.

To generate tests for your own skill instead of writing YAML from scratch:

skillprobe init ./skills/my-skill --harness claude-code
skillprobe run tests/my-skill.yaml

Writing scenarios

Scenarios are YAML files describing what to test. Each scenario can have multiple conversational steps, a workspace fixture that gets copied fresh for every run, setup commands that prepare the workspace, and post-run assertions that check workspace state after everything finishes:

harness: claude-code
model: claude-haiku-4-5-20251001
timeout: 120
skill: ./skills/commit

scenarios:
  - name: "commit skill activates on request"
    workspace: fixtures/dirty-repo
    setup:
      - run: "echo 'change' >> file.txt && git add ."
    steps:
      - prompt: "commit my changes"
        assert:
          - type: contains
            value: "commit"
          - type: tool_called
            value: "Bash"
    after:
      - type: file_exists
        value: ".git/COMMIT_EDITMSG"

  - name: "does not activate for unrelated request"
    steps:
      - prompt: "explain what this project does"
        assert:
          - type: not_contains
            value: "commit"

Assertion types: contains, not_contains, regex, tool_called, file_exists, file_contains. Any assertion can be inverted with negate: true.

Multi-run for measuring reliability

Since skills are probabilistic, a single pass/fail isnt always meaningful. Run the same prompt multiple times and set a pass rate threshold:

steps:
  - prompt: "Write a function with type hints"
    runs: 5
    min_pass_rate: 0.8
    assert:
      - type: regex
        value: "-> "
  [PASS] uses type hints (21.0s $0.0416)
         step 1: [ok] 4/5 passed (80%)

Generating tests

init reads your SKILL.md, uses an LLM to generate test scenarios (positive activation, negative activation, behavioral correctness, edge cases), and writes a YAML file to review and tweak. Supports both Anthropic and OpenAI as the generation provider via --provider.

skillprobe init ./skills/commit --harness claude-code

Commands

skillprobe run <test.yaml> runs test scenarios against a real coding tool.

Flag Default Description
--harness from YAML claude-code or cursor
--model from YAML Model to use for the tool under test
--parallel 1 Number of scenarios to run concurrently
--timeout from YAML Per-scenario timeout in seconds
--max-cost none Max USD spend (Claude Code only)

skillprobe init <skill-dir> generates starter test YAML from a skill definition.

Flag Default Description
--harness claude-code Target harness
--output tests/<skill>.yaml Output YAML path
--provider anthropic LLM provider for generation
--model auto Model for generation
--fixtures-dir fixtures Where to write fixture directories

CI

skillprobe works in CI for catching regressions when models update or skills change. The CI runner needs the target tool's CLI installed and authenticated.

name: skill-tests
on:
  push:
    paths: ["skills/**", "tests/**"]
  schedule:
    - cron: "0 6 * * 1"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g @anthropic-ai/claude-code
      - uses: astral-sh/setup-uv@v4
      - run: uv tool install skillprobe
      - run: skillprobe run tests/my-skill.yaml --harness claude-code
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Security

Scenario YAML files are executable content. Setup commands run with your full user permissions, and the harness launches the AI tool with --dangerously-skip-permissions (or --force for Cursor). Treat test YAML files like shell scripts. Workspaces are temporary copies deleted after each run, and file path assertions validate against workspace boundary escapes.

Why not promptfoo

promptfoo tests prompts in isolation via direct API calls, outside the tool that will actually use them. skillprobe runs the real tool as a subprocess in a real workspace, testing the full stack: skill loading, tool use, file system interactions, multi-turn conversations. Works with subscriptions too since the tool under test handles its own auth.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillprobe-0.2.0.tar.gz (71.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skillprobe-0.2.0-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file skillprobe-0.2.0.tar.gz.

File metadata

  • Download URL: skillprobe-0.2.0.tar.gz
  • Upload date:
  • Size: 71.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skillprobe-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d5749225b74f100af7bc325aaee5eba547f45337702c6eb188050891e9934180
MD5 2d0574a860f4ba8efc2fe8dbff9e3497
BLAKE2b-256 e2b61caf19371f0ad5a3b9b203f079e102c951ef1ddcc996e654764266f301a4

See more details on using hashes here.

Provenance

The following attestation bundles were made for skillprobe-0.2.0.tar.gz:

Publisher: publish.yml on Anyesh/skillprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file skillprobe-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: skillprobe-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 19.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skillprobe-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3c67e01ab608ad51d1f9a2430f85ba11e99d56c6e461929e6e25d2dce4fea7c
MD5 19248d6a9c11cd6e859b8fcec7b90a39
BLAKE2b-256 2eabb8b31388daad9a49de16c18904f511a65ffe9b7e49ebc2a69d3175caf932

See more details on using hashes here.

Provenance

The following attestation bundles were made for skillprobe-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Anyesh/skillprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page