Skip to main content

The development workbench for Agent Skills — develop, test, and evaluate skills across agent frameworks.

Project description

SkillBench

The pytest for Agent Skills — develop, test, and evaluate skills across agent frameworks.

skills.sh     → discover skills
GitHub repos  → store skills
SkillBench     → develop, test, and evaluate skills

Quick Start

git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .

# Run an eval in under 60 seconds
skillbench pull vercel-labs/skills/skills/find-skills
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --adapter azure --model YOUR_DEPLOYMENT

What It Does

  • Pull skills from any GitHub repo into a local cache
  • Test skills against real agent frameworks (Claude, OpenAI, Azure, Aider, Codex, Cursor, Claude Code, OpenCode)
  • Evaluate across multiple frameworks in one command and compare results
  • Track run history with persistent SQLite storage and JSON/terminal reports

Example: Testing the #1 Skill on skills.sh

# Pull the top skill (503K installs)
skillbench pull vercel-labs/skills/skills/find-skills

# Set up Azure OpenAI
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export SKILLBENCH_JUDGE_MODEL="your-deployment"

# Run evaluation
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --adapter azure --model your-deployment

# Cross-framework comparison
skillbench eval ~/.skillbench/skills/vercel-labs/skills/find-skills \
  --frameworks azure,aider --model your-deployment

Define Scenarios

Scenarios are declarative YAML test cases:

skill: find-skills
scenarios:
  - name: search-for-react-skill
    prompt: "How do I optimize my React app? Can you find a skill for that?"
    assertions:
      - type: tool_called
        tool: Bash
      - type: output_contains
        value: "skills"
      - type: llm_judge
        criteria: "Agent searched for React skills and presented install commands"
    tags: [core]

Assertion types: tool_called, file_created, output_contains, llm_judge (auto-detects Anthropic/OpenAI/Azure).

Or generate them automatically:

skillbench generate-scenarios ./my-skill/

CLI Reference

Command Description
skillbench init <name> Scaffold a new skill with SKILL.md template
skillbench validate <path> Validate SKILL.md against the spec
skillbench pull <source> Fetch skills from GitHub
skillbench generate-scenarios <path> AI-generate test scenarios
skillbench test <path> Run scenarios with a single adapter
skillbench eval <path> Evaluate across multiple adapters
skillbench history View past run results
skillbench report <run-id> Display a previous run report
skillbench benchmark pull Pull the top 100 skills from skills.sh
skillbench contribute <path> (Coming in v0.2) PR improvements back

Adapters

Adapter --adapter Type Requirements
Claude API claude API ANTHROPIC_API_KEY
OpenAI API openai API OPENAI_API_KEY
Azure OpenAI azure API AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT
Custom endpoint openai API OPENAI_API_KEY + OPENAI_BASE_URL
Aider aider CLI aider CLI installed (pip install aider-chat)
Claude Code claude-code CLI claude CLI installed
Cursor cursor CLI agent CLI + CURSOR_API_KEY
OpenAI Codex codex CLI codex CLI installed (npm i -g @openai/codex)
OpenCode opencode CLI opencode CLI installed (opencode.ai)

API adapters simulate tool use in a sandbox — the model sees Bash/Read/Write tools and SkillBench executes them in an isolated workspace.

CLI adapters run the actual agent binary with real filesystem access in a temp workspace — true end-to-end testing.

Use --model to override the default model/deployment:

skillbench test ./my-skill --adapter azure --model my-gpt4o-deployment
skillbench test ./my-skill --adapter aider --model azure/my-deployment

Architecture

src/skillbench/
├── cli.py              # Typer CLI
├── core/
│   ├── skill.py        # SKILL.md parser + validator
│   ├── scenario.py     # Scenario models + YAML loader
│   ├── runner.py       # Scenario execution engine
│   ├── evaluator.py    # Assertions (deterministic + LLM judge)
│   └── scenario_gen.py # AI-assisted scenario generation
├── adapters/
│   ├── base.py         # Framework adapter protocol
│   ├── claude_api.py   # Anthropic Claude API
│   ├── openai_api.py   # OpenAI / Azure / compatible APIs
│   ├── aider.py        # Aider CLI
│   ├── claude_code.py  # Claude Code CLI
│   ├── cursor.py       # Cursor agent CLI
│   ├── codex.py        # OpenAI Codex CLI
│   └── opencode.py     # OpenCode CLI
├── registry/
│   ├── puller.py       # Pull skills from GitHub
│   ├── cache.py        # Local skill cache (~/.skillbench/skills/)
│   └── contribute.py   # PR workflow (v0.2)
├── store/
│   └── history.py      # SQLite run history
└── reports/
    └── generator.py    # Rich terminal + JSON reports

CI/CD

# .github/workflows/skill-eval.yml
name: Skill Evaluation
on: [push, pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install .
      - run: skillbench eval ./my-skill --frameworks azure -o results.json
        env:
          AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
          AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
          SKILLBENCH_JUDGE_MODEL: ${{ secrets.SKILLBENCH_JUDGE_MODEL }}

Exit codes: 0 = all pass, 1 = failures. JSON output (-o) for machine-readable results.

Benchmark Suite

SkillBench includes a manifest of the top 100 skills from skills.sh:

skillbench benchmark pull     # Pull all skills

Requirements

  • Python 3.11+
  • API keys for the adapters you want to use

Contributing

Contributions welcome! To get started:

git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .[dev]
pytest tests/ -v

Please open an issue before submitting large changes.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillbench-0.1.0.tar.gz (42.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skillbench-0.1.0-py3-none-any.whl (50.4 kB view details)

Uploaded Python 3

File details

Details for the file skillbench-0.1.0.tar.gz.

File metadata

  • Download URL: skillbench-0.1.0.tar.gz
  • Upload date:
  • Size: 42.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for skillbench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1de947890fe928ec3480e60afcf8156bf16ed03496c958d5a0d6ee99f2021533
MD5 37c13145c389122ce46a3fdd78fb8e2e
BLAKE2b-256 f981a9dfef8503b2734d721d437790e228a3b82d2068e9fa4c289c2fe98045b3

See more details on using hashes here.

File details

Details for the file skillbench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: skillbench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for skillbench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ff08e6d730aa12ce4321c86b88c69b9842272130cf3ccc9d860cfae8462c530
MD5 a4f7476b4e30c396b38a34c8837a4dce
BLAKE2b-256 24f740bda33c6c493a4a5b4eae83cad5c6309e1b1d86945ba8f7c64c67c1b5b2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page