Automated end-to-end skill testing for LLM coding tools

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

anyesh

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.12
- Python :: 3.13
Topic
- Software Development :: Testing

Project description

skillprobe

skillprobe demo

Automated testing for LLM skills. Launches Claude Code or Cursor as subprocesses, runs scenarios in isolated workspaces, and reports what passed and what didn't.

Skills are just text injected into the LLM context, and LLMs are probabilistic, so they'll get ignored some percentage of the time no matter how carefully you word them. If you want hard enforcement, hooks are the right tool since they run deterministically every time. But hooks can only check things after the fact (linting, file restrictions, blocked commands). They cant guide the model toward better architectural decisions, teach it your team's domain conventions, set the tone of code review feedback, or help it reason through a multi-step workflow. Skills handle that side, and skillprobe measures how reliably they do it.

When you need this

If you write a few personal skills and tweak them by feel, you probably dont need this. That loop is fast and good enough for individual use.

Where it breaks down:

Model updates break skills silently. Anthropic ships a new Sonnet, Cursor updates their agent, and a skill that worked last week now produces different output. Nobody notices because nobody retested.
Teams sharing skills. When 20 engineers share a "code review" skill, one person's gut check isnt representative. You need coverage across scenarios to know whether the skill holds up.
Publishing to marketplaces. At that point you're distributing software, not vibing with your own tool. "Ask the LLM to fix it" doesnt scale to reproducing someone else's problem.
The endless tweak loop. After three rounds of edits you cant tell if the latest version is better or if you just moved the problem around. skillprobe gives you a definitive signal by running the same scenarios against both versions and comparing pass rates.

Installation

pip install skillprobe

Or with uv:

uv tool install skillprobe

Or from source:

git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync

Quick start

The repo ships with example skills and test scenarios you can run immediately:

git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync
uv run skillprobe run examples/tests/test-clean-python.yaml

Running: examples/tests/test-clean-python.yaml
  Harness: claude-code
  Model: claude-haiku-4-5-20251001
  Scenarios: 5
  Parallel: 1

  [PASS] no docstrings on simple functions (11.6s $0.0204)
  [PASS] imports at top level (7.2s $0.0199)
  [PASS] no obvious comments (7.2s $0.0187)
  [FAIL] uses type hints (6.8s $0.0191)
         step 1: "Write a Python function that takes a list of integ"
           Pattern 'def \w+\(.*:.*\)' did not match
           Pattern '-> ' did not match
  [PASS] skill does not block normal functionality (11.3s $0.0202)

  4/5 passed (44.0s)
  Total cost: $0.10

Requires Claude Code or Cursor CLI installed and authenticated.

To generate tests for your own skill instead of writing YAML from scratch:

skillprobe init ./skills/my-skill --harness claude-code
skillprobe run tests/my-skill.yaml

Writing scenarios

Scenarios are YAML files describing what to test. Each scenario can have multiple conversational steps, a workspace fixture that gets copied fresh for every run, setup commands that prepare the workspace, and post-run assertions that check workspace state after everything finishes:

harness: claude-code
model: claude-haiku-4-5-20251001
timeout: 120
skill: ./skills/commit

scenarios:
  - name: "commit skill activates on request"
    workspace: fixtures/dirty-repo
    setup:
      - run: "echo 'change' >> file.txt && git add ."
    steps:
      - prompt: "commit my changes"
        assert:
          - type: contains
            value: "commit"
          - type: tool_called
            value: "Bash"
    after:
      - type: file_exists
        value: ".git/COMMIT_EDITMSG"

  - name: "does not activate for unrelated request"
    steps:
      - prompt: "explain what this project does"
        assert:
          - type: not_contains
            value: "commit"

Assertion types: contains, not_contains, regex, tool_called, file_exists, file_contains. Any assertion can be inverted with negate: true.

Multi-run for measuring reliability

Since skills are probabilistic, a single pass/fail isnt always meaningful. Run the same prompt multiple times and set a pass rate threshold:

steps:
  - prompt: "Write a function with type hints"
    runs: 5
    min_pass_rate: 0.8
    assert:
      - type: regex
        value: "-> "

  [PASS] uses type hints (21.0s $0.0416)
         step 1: [ok] 4/5 passed (80%)

Generating tests

init reads your SKILL.md, uses an LLM to generate test scenarios (positive activation, negative activation, behavioral correctness, edge cases), and writes a YAML file to review and tweak. Supports both Anthropic and OpenAI as the generation provider via --provider.

skillprobe init ./skills/commit --harness claude-code

Commands

skillprobe run <test.yaml> runs test scenarios against a real coding tool.

Flag	Default	Description
`--harness`	from YAML	`claude-code` or `cursor`
`--model`	from YAML	Model to use for the tool under test
`--parallel`	1	Number of scenarios to run concurrently
`--timeout`	from YAML	Per-scenario timeout in seconds
`--max-cost`	none	Max USD spend (Claude Code only)

skillprobe init <skill-dir> generates starter test YAML from a skill definition.

Flag	Default	Description
`--harness`	`claude-code`	Target harness
`--output`	`tests/<skill>.yaml`	Output YAML path
`--provider`	`anthropic`	LLM provider for generation
`--model`	auto	Model for generation
`--fixtures-dir`	`fixtures`	Where to write fixture directories

CI

skillprobe works in CI for catching regressions when models update or skills change. The CI runner needs the target tool's CLI installed and authenticated.

name: skill-tests
on:
  push:
    paths: ["skills/**", "tests/**"]
  schedule:
    - cron: "0 6 * * 1"

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g @anthropic-ai/claude-code
      - uses: astral-sh/setup-uv@v4
      - run: uv tool install skillprobe
      - run: skillprobe run tests/my-skill.yaml --harness claude-code
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Security

Scenario YAML files are executable content. Setup commands run with your full user permissions, and the harness launches the AI tool with --dangerously-skip-permissions (or --force for Cursor). Treat test YAML files like shell scripts. Workspaces are temporary copies deleted after each run, and file path assertions validate against workspace boundary escapes.

Why not promptfoo

promptfoo tests prompts in isolation via direct API calls, outside the tool that will actually use them. skillprobe runs the real tool as a subprocess in a real workspace, testing the full stack: skill loading, tool use, file system interactions, multi-turn conversations. Works with subscriptions too since the tool under test handles its own auth.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

anyesh

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.12
- Python :: 3.13
Topic
- Software Development :: Testing

Release history Release notifications | RSS feed

0.5.0

Apr 11, 2026

0.4.0

Apr 3, 2026

0.3.0

Apr 3, 2026

This version

0.2.0

Apr 2, 2026

0.1.1

Apr 1, 2026

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillprobe-0.2.0.tar.gz (71.2 kB view details)

Uploaded Apr 2, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skillprobe-0.2.0-py3-none-any.whl (19.0 kB view details)

Uploaded Apr 2, 2026 Python 3

File details

Details for the file skillprobe-0.2.0.tar.gz.

File metadata

Download URL: skillprobe-0.2.0.tar.gz
Upload date: Apr 2, 2026
Size: 71.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skillprobe-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d5749225b74f100af7bc325aaee5eba547f45337702c6eb188050891e9934180`
MD5	`2d0574a860f4ba8efc2fe8dbff9e3497`
BLAKE2b-256	`e2b61caf19371f0ad5a3b9b203f079e102c951ef1ddcc996e654764266f301a4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skillprobe-0.2.0.tar.gz:

Publisher: publish.yml on Anyesh/skillprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skillprobe-0.2.0.tar.gz
- Subject digest: d5749225b74f100af7bc325aaee5eba547f45337702c6eb188050891e9934180
- Sigstore transparency entry: 1216316609
- Sigstore integration time: Apr 2, 2026
Source repository:
- Permalink: Anyesh/skillprobe@f5f3e34512155dd477af53092ec1f54d50c9fef2
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Anyesh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f5f3e34512155dd477af53092ec1f54d50c9fef2
- Trigger Event: push

File details

Details for the file skillprobe-0.2.0-py3-none-any.whl.

File metadata

Download URL: skillprobe-0.2.0-py3-none-any.whl
Upload date: Apr 2, 2026
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skillprobe-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3c67e01ab608ad51d1f9a2430f85ba11e99d56c6e461929e6e25d2dce4fea7c`
MD5	`19248d6a9c11cd6e859b8fcec7b90a39`
BLAKE2b-256	`2eabb8b31388daad9a49de16c18904f511a65ffe9b7e49ebc2a69d3175caf932`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skillprobe-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Anyesh/skillprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skillprobe-0.2.0-py3-none-any.whl
- Subject digest: e3c67e01ab608ad51d1f9a2430f85ba11e99d56c6e461929e6e25d2dce4fea7c
- Sigstore transparency entry: 1216316683
- Sigstore integration time: Apr 2, 2026
Source repository:
- Permalink: Anyesh/skillprobe@f5f3e34512155dd477af53092ec1f54d50c9fef2
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/Anyesh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f5f3e34512155dd477af53092ec1f54d50c9fef2
- Trigger Event: push

skillprobe 0.2.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

skillprobe

When you need this

Installation

Quick start

Writing scenarios

Multi-run for measuring reliability

Generating tests

Commands

CI

Security

Why not promptfoo

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance