Automated end-to-end skill testing for LLM coding tools

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

anyesh

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.12
- Python :: 3.13
Topic
- Software Development :: Testing

Project description

skillprobe

AI coding tools like Claude Code and Cursor inject instructions into the LLM context behind the scenes, whether they call them skills, rules, or system prompts. There's no good way to test whether those instructions are actually being followed. You write a skill that says "never add docstrings" and half the time the model adds them anyway.

skillprobe automates the testing. It launches Claude Code or Cursor as subprocesses, runs your test scenarios in real workspaces, checks the output against assertions, and reports what passed and what didn't, all from a single command with no manual prompting required.

Who this is for (and who it isn't)

If you write a few skills for your own use and tweak them when something feels off, you probably don't need this. Most people create skills by asking an LLM to write one, try it a couple times, and if the output looks wrong they ask the LLM to adjust it. That loop is fast, cheap, and good enough for personal use.

Where that loop breaks down:

Model updates break skills silently. Anthropic ships a new Sonnet, Cursor updates their agent behavior, and a skill that worked last week now produces subtly different output. Nobody notices because nobody retested, and skillprobe exists to catch exactly that kind of silent regression.

Teams sharing skills across engineers. When 20 developers share a "code review" skill, one person's gut check isn't representative because everyone is hitting it with different prompts, different codebases, and different expectations. You need actual coverage across scenarios to know whether the skill holds up.

Publishing to marketplaces. Both Claude Code and Cursor now have plugin marketplaces where skill authors ship to thousands of users. At that point you're distributing software, not vibing with your own tool. User reports from strangers don't come with context, and "ask the LLM to fix it" doesn't scale to reproducing someone else's problem.

Breaking the endless tweak loop. You named a skill "clean-python" and told it to never add docstrings, but after three rounds of edits you're not sure if the latest version is actually better or if you just moved the problem around. skillprobe gives you a definitive "this version is better than the last one" signal by running the same scenarios against both and comparing pass rates.

If none of those situations apply to you, a simpler workflow (write skill, try it, adjust) is probably the right call. skillprobe is for when you need more confidence than vibing can provide.

Installation

pip install skillprobe

Or with uv:

uv tool install skillprobe

Or from source:

git clone https://github.com/Anyesh/skillprobe.git
cd skillprobe
uv sync

Quick start

Generate test scenarios from an existing skill, then run them:

skillprobe init ./skills/my-skill --harness claude-code
skillprobe run tests/my-skill.yaml

Running: tests/my-skill.yaml
  Harness: claude-code
  Model: claude-haiku-4-5-20251001
  Scenarios: 3
  Parallel: 1

  [PASS] commit skill activates on request (9.1s)
  [PASS] multi-turn refinement (12.3s)
  [FAIL] negative activation -- 'commit' found in response
         step 1: "explain what this project does"
           'commit' found in response

  2/3 passed (27.8s)

Writing scenarios

Scenarios are YAML files describing what to test. Each scenario can have multiple conversational steps, a workspace fixture that gets copied fresh for every run, setup commands, and post-run assertions that check workspace state after everything finishes:

harness: claude-code
model: claude-haiku-4-5-20251001
timeout: 120
skill: ./skills/commit

scenarios:
  - name: "commit skill activates on request"
    workspace: fixtures/dirty-repo
    setup:
      - run: "echo 'change' >> file.txt && git add ."
    steps:
      - prompt: "commit my changes"
        assert:
          - type: contains
            value: "commit"
          - type: tool_called
            value: "Bash"
    after:
      - type: file_exists
        value: ".git/COMMIT_EDITMSG"

  - name: "does not activate for unrelated request"
    steps:
      - prompt: "explain what this project does"
        assert:
          - type: not_contains
            value: "commit"

Supported assertion types: contains, not_contains, regex, tool_called, file_exists, and file_contains. Any assertion can be inverted with negate: true.

Generating tests

You don't have to write scenario YAML from scratch. Point init at a skill directory and it reads the SKILL.md, uses an LLM to figure out what should be tested (positive activation, negative activation, behavioral correctness, edge cases), and writes a starter YAML file you can review and tweak:

skillprobe init ./skills/commit --harness claude-code

The init command supports both Anthropic and OpenAI as providers for test generation. Pass --provider openai and --model gpt-4o if you prefer, or it defaults to Anthropic with claude-sonnet-4-6. This requires an API key for whichever provider you choose (via ANTHROPIC_API_KEY or OPENAI_API_KEY).

Commands

skillprobe run <test.yaml> runs test scenarios against a real coding tool.

Flag	Default	Description
`--harness`	from YAML	`claude-code` or `cursor`
`--model`	from YAML	Model to use for the tool under test
`--parallel`	1	Number of scenarios to run concurrently
`--timeout`	from YAML	Per-scenario timeout in seconds
`--max-cost`	none	Max USD spend (Claude Code only)

skillprobe init <skill-dir> generates starter test YAML from a skill definition.

Flag	Default	Description
`--harness`	`claude-code`	Target harness
`--output`	`tests/<skill>.yaml`	Output YAML path
`--provider`	`anthropic`	LLM provider for generation
`--model`	auto	Model for generation
`--fixtures-dir`	`fixtures`	Where to write fixture directories

Using in CI

skillprobe works well in CI for catching regressions when models update or skills change. The CI environment needs the target tool's CLI installed and authenticated, since skillprobe spawns it as a subprocess.

# .github/workflows/skill-tests.yml
name: skill-tests

on:
  push:
    paths: ["skills/**", "tests/**"]
  schedule:
    - cron: "0 6 * * 1"  # weekly Monday 6am

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - run: npm install -g @anthropic-ai/claude-code

      - uses: astral-sh/setup-uv@v4

      - run: uv tool install skillprobe

      - run: skillprobe run tests/my-skill.yaml --harness claude-code
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Why not promptfoo

Tools like promptfoo test prompts in isolation by making their own API calls, outside the tool that will actually use them. skillprobe runs the real tools as subprocesses in real workspaces, so it tests the full stack: skill loading, tool use, file system interactions, multi-turn conversations. It also works with subscriptions (no API key required for the tool under test, only for init if you use it).

References

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

anyesh

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.12
- Python :: 3.13
Topic
- Software Development :: Testing

Release history Release notifications | RSS feed

0.5.0

Apr 11, 2026

0.4.0

Apr 3, 2026

0.3.0

Apr 3, 2026

0.2.0

Apr 2, 2026

0.1.1

Apr 1, 2026

This version

0.1.0

Apr 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillprobe-0.1.0.tar.gz (31.3 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skillprobe-0.1.0-py3-none-any.whl (18.4 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file skillprobe-0.1.0.tar.gz.

File metadata

Download URL: skillprobe-0.1.0.tar.gz
Upload date: Apr 1, 2026
Size: 31.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skillprobe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c924b7f15f108e395502843ac67fa4384f9955ea385761b5cdc56bce53fa9c93`
MD5	`e1a79d3c4405409dd81a86797cda6d4c`
BLAKE2b-256	`ad05c138f1d7e926bfffb9f3d5193ed674687cce4fbc0742b7ff6d884c775f91`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skillprobe-0.1.0.tar.gz:

Publisher: publish.yml on Anyesh/skillprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skillprobe-0.1.0.tar.gz
- Subject digest: c924b7f15f108e395502843ac67fa4384f9955ea385761b5cdc56bce53fa9c93
- Sigstore transparency entry: 1205400207
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: Anyesh/skillprobe@a85ef50e0466cfc1339387ad796a5438ab1946c3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Anyesh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a85ef50e0466cfc1339387ad796a5438ab1946c3
- Trigger Event: push

File details

Details for the file skillprobe-0.1.0-py3-none-any.whl.

File metadata

Download URL: skillprobe-0.1.0-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 18.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for skillprobe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`03d57143d327599519ba057eb7afb2bf9a1dc4c4ee45d46ce5167031d210d798`
MD5	`1382a648fafc51a100ef766c61890d54`
BLAKE2b-256	`2317cc2842bcd01d0bcd9f3bbe340fc1fdadba6a2234c0eb10335db5ec9a411c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for skillprobe-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Anyesh/skillprobe

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: skillprobe-0.1.0-py3-none-any.whl
- Subject digest: 03d57143d327599519ba057eb7afb2bf9a1dc4c4ee45d46ce5167031d210d798
- Sigstore transparency entry: 1205400210
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: Anyesh/skillprobe@a85ef50e0466cfc1339387ad796a5438ab1946c3
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Anyesh
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a85ef50e0466cfc1339387ad796a5438ab1946c3
- Trigger Event: push

skillprobe 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

skillprobe

Who this is for (and who it isn't)

Installation

Quick start

Writing scenarios

Generating tests

Commands

Using in CI

Why not promptfoo

References

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance