The development workbench for Agent Skills — develop, test, and evaluate skills across agent frameworks.

These details have not been verified by PyPI

Project description

SkillBench

The pytest for Agent Skills — develop, test, and evaluate skills across agent frameworks.

skills.sh     → discover skills
GitHub repos  → store skills
SkillBench     → develop, test, and evaluate skills

Quick Start

git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .

# Run an eval in under 60 seconds
skillbench pull vercel-labs/skills/skills/find-skills
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --adapter azure --model YOUR_DEPLOYMENT

What It Does

Pull skills from any GitHub repo into a local cache
Test skills against real agent frameworks (Claude, OpenAI, Azure, Aider, Codex, Cursor, Claude Code, OpenCode)
Evaluate across multiple frameworks in one command and compare results
Track run history with persistent SQLite storage and JSON/terminal reports

Example: Testing the #1 Skill on skills.sh

# Pull the top skill (503K installs)
skillbench pull vercel-labs/skills/skills/find-skills

# Set up Azure OpenAI
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export SKILLBENCH_JUDGE_MODEL="your-deployment"

# Run evaluation
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --adapter azure --model your-deployment

# Cross-framework comparison
skillbench eval ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --frameworks azure,aider --model your-deployment

Define Scenarios

Scenarios are declarative YAML test cases:

skill: find-skills
scenarios:
  - name: search-for-react-skill
    prompt: "How do I optimize my React app? Can you find a skill for that?"
    assertions:
      - type: tool_called
        tool: Bash
      - type: output_contains
        value: "skills"
      - type: llm_judge
        criteria: "Agent searched for React skills and presented install commands"
    tags: [core]

Assertion types: tool_called, file_created, output_contains, llm_judge (auto-detects Anthropic/OpenAI/Azure).

Or generate them automatically:

skillbench generate-scenarios ./my-skill/

CLI Reference

Command	Description
`skillbench init <name>`	Scaffold a new skill with SKILL.md template
`skillbench validate <path>`	Validate SKILL.md against the spec
`skillbench pull <source>`	Fetch skills from GitHub
`skillbench generate-scenarios <path>`	AI-generate test scenarios
`skillbench test <path>`	Run scenarios with a single adapter
`skillbench eval <path>`	Evaluate across multiple adapters
`skillbench history`	View past run results
`skillbench report <run-id>`	Display a previous run report
`skillbench benchmark pull`	Pull the top 100 skills from skills.sh
`skillbench contribute <path>`	(Coming in v0.2) PR improvements back

Adapters

Adapter	`--adapter`	Type	Requirements
Claude API	`claude`	API	`ANTHROPIC_API_KEY`
OpenAI API	`openai`	API	`OPENAI_API_KEY`
Azure OpenAI	`azure`	API	`AZURE_OPENAI_API_KEY` + `AZURE_OPENAI_ENDPOINT`
Custom endpoint	`openai`	API	`OPENAI_API_KEY` + `OPENAI_BASE_URL`
Aider	`aider`	CLI	`aider` CLI installed (`pip install aider-chat`)
Claude Code	`claude-code`	CLI	`claude` CLI installed
Cursor	`cursor`	CLI	`agent` CLI + `CURSOR_API_KEY`
OpenAI Codex	`codex`	CLI	`codex` CLI installed (`npm i -g @openai/codex`)
OpenCode	`opencode`	CLI	`opencode` CLI installed (opencode.ai)

API adapters simulate tool use in a sandbox — the model sees Bash/Read/Write tools and SkillBench executes them in an isolated workspace.

CLI adapters run the actual agent binary with real filesystem access in a temp workspace — true end-to-end testing.

Use --model to override the default model/deployment:

skillbench test ./my-skill --adapter azure --model my-gpt4o-deployment
skillbench test ./my-skill --adapter aider --model azure/my-deployment

Architecture

src/skillbench/
├── cli.py              # Typer CLI
├── core/
│   ├── skill.py        # SKILL.md parser + validator
│   ├── scenario.py     # Scenario models + YAML loader
│   ├── runner.py       # Scenario execution engine
│   ├── evaluator.py    # Assertions (deterministic + LLM judge)
│   └── scenario_gen.py # AI-assisted scenario generation
├── adapters/
│   ├── base.py         # Framework adapter protocol
│   ├── claude_api.py   # Anthropic Claude API
│   ├── openai_api.py   # OpenAI / Azure / compatible APIs
│   ├── aider.py        # Aider CLI
│   ├── claude_code.py  # Claude Code CLI
│   ├── cursor.py       # Cursor agent CLI
│   ├── codex.py        # OpenAI Codex CLI
│   └── opencode.py     # OpenCode CLI
├── registry/
│   ├── puller.py       # Pull skills from GitHub
│   ├── cache.py        # Local skill cache (~/.skillbench/skills/)
│   └── contribute.py   # PR workflow (v0.2)
├── store/
│   └── history.py      # SQLite run history
└── reports/
    └── generator.py    # Rich terminal + JSON reports

CI/CD

# .github/workflows/skill-eval.yml
name: Skill Evaluation
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install .
      - run: skillbench eval ./my-skill --frameworks azure -o results.json
        env:
          AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          SKILLBENCH_JUDGE_MODEL: ${{ secrets.SKILLBENCH_JUDGE_MODEL }}

Exit codes: 0 = all pass, 1 = failures. JSON output (-o) for machine-readable results.

Benchmark Suite

SkillBench includes a manifest of the top 100 skills from skills.sh:

skillbench benchmark pull     # Pull all skills

Requirements

Python 3.11+
API keys for the adapters you want to use

Contributing

Contributions welcome! To get started:

git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .[dev]
pytest tests/ -v

Please open an issue before submitting large changes.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Mar 13, 2026

0.0.1.dev1 pre-release

Mar 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillbench-0.1.0.tar.gz (42.1 kB view details)

Uploaded Mar 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skillbench-0.1.0-py3-none-any.whl (50.4 kB view details)

Uploaded Mar 13, 2026 Python 3

File details

Details for the file skillbench-0.1.0.tar.gz.

File metadata

Download URL: skillbench-0.1.0.tar.gz
Upload date: Mar 13, 2026
Size: 42.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for skillbench-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1de947890fe928ec3480e60afcf8156bf16ed03496c958d5a0d6ee99f2021533`
MD5	`37c13145c389122ce46a3fdd78fb8e2e`
BLAKE2b-256	`f981a9dfef8503b2734d721d437790e228a3b82d2068e9fa4c289c2fe98045b3`

See more details on using hashes here.

File details

Details for the file skillbench-0.1.0-py3-none-any.whl.

File metadata

Download URL: skillbench-0.1.0-py3-none-any.whl
Upload date: Mar 13, 2026
Size: 50.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for skillbench-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ff08e6d730aa12ce4321c86b88c69b9842272130cf3ccc9d860cfae8462c530`
MD5	`a4f7476b4e30c396b38a34c8837a4dce`
BLAKE2b-256	`24f740bda33c6c493a4a5b4eae83cad5c6309e1b1d86945ba8f7c64c67c1b5b2`

See more details on using hashes here.

skillbench 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

SkillBench

Quick Start

What It Does

Example: Testing the #1 Skill on skills.sh

Define Scenarios

CLI Reference

Adapters

Architecture

CI/CD

Benchmark Suite

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes