The development workbench for Agent Skills — develop, test, and evaluate skills across agent frameworks.
Project description
SkillBench
The pytest for Agent Skills — develop, test, and evaluate skills across agent frameworks.
skills.sh → discover skills
GitHub repos → store skills
SkillBench → develop, test, and evaluate skills
Quick Start
git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .
# Run an eval in under 60 seconds
skillbench pull vercel-labs/skills/skills/find-skills
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
--adapter azure --model YOUR_DEPLOYMENT
What It Does
- Pull skills from any GitHub repo into a local cache
- Test skills against real agent frameworks (Claude, OpenAI, Azure, Aider, Codex, Cursor, Claude Code, OpenCode)
- Evaluate across multiple frameworks in one command and compare results
- Track run history with persistent SQLite storage and JSON/terminal reports
Example: Testing the #1 Skill on skills.sh
# Pull the top skill (503K installs)
skillbench pull vercel-labs/skills/skills/find-skills
# Set up Azure OpenAI
export AZURE_OPENAI_API_KEY="your-key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export SKILLBENCH_JUDGE_MODEL="your-deployment"
# Run evaluation
skillbench test ~/.skillbench/skills/vercel-labs/skills/find-skills \
--adapter azure --model your-deployment
# Cross-framework comparison
skillbench eval ~/.skillbench/skills/vercel-labs/skills/find-skills \
--frameworks azure,aider --model your-deployment
Define Scenarios
Scenarios are declarative YAML test cases:
skill: find-skills
scenarios:
- name: search-for-react-skill
prompt: "How do I optimize my React app? Can you find a skill for that?"
assertions:
- type: tool_called
tool: Bash
- type: output_contains
value: "skills"
- type: llm_judge
criteria: "Agent searched for React skills and presented install commands"
tags: [core]
Assertion types: tool_called, file_created, output_contains, llm_judge (auto-detects Anthropic/OpenAI/Azure).
Or generate them automatically:
skillbench generate-scenarios ./my-skill/
CLI Reference
| Command | Description |
|---|---|
skillbench init <name> |
Scaffold a new skill with SKILL.md template |
skillbench validate <path> |
Validate SKILL.md against the spec |
skillbench pull <source> |
Fetch skills from GitHub |
skillbench generate-scenarios <path> |
AI-generate test scenarios |
skillbench test <path> |
Run scenarios with a single adapter |
skillbench eval <path> |
Evaluate across multiple adapters |
skillbench history |
View past run results |
skillbench report <run-id> |
Display a previous run report |
skillbench benchmark pull |
Pull the top 100 skills from skills.sh |
skillbench contribute <path> |
(Coming in v0.2) PR improvements back |
Adapters
| Adapter | --adapter |
Type | Requirements |
|---|---|---|---|
| Claude API | claude |
API | ANTHROPIC_API_KEY |
| OpenAI API | openai |
API | OPENAI_API_KEY |
| Azure OpenAI | azure |
API | AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT |
| Custom endpoint | openai |
API | OPENAI_API_KEY + OPENAI_BASE_URL |
| Aider | aider |
CLI | aider CLI installed (pip install aider-chat) |
| Claude Code | claude-code |
CLI | claude CLI installed |
| Cursor | cursor |
CLI | agent CLI + CURSOR_API_KEY |
| OpenAI Codex | codex |
CLI | codex CLI installed (npm i -g @openai/codex) |
| OpenCode | opencode |
CLI | opencode CLI installed (opencode.ai) |
API adapters simulate tool use in a sandbox — the model sees Bash/Read/Write tools and SkillBench executes them in an isolated workspace.
CLI adapters run the actual agent binary with real filesystem access in a temp workspace — true end-to-end testing.
Use --model to override the default model/deployment:
skillbench test ./my-skill --adapter azure --model my-gpt4o-deployment
skillbench test ./my-skill --adapter aider --model azure/my-deployment
Architecture
src/skillbench/
├── cli.py # Typer CLI
├── core/
│ ├── skill.py # SKILL.md parser + validator
│ ├── scenario.py # Scenario models + YAML loader
│ ├── runner.py # Scenario execution engine
│ ├── evaluator.py # Assertions (deterministic + LLM judge)
│ └── scenario_gen.py # AI-assisted scenario generation
├── adapters/
│ ├── base.py # Framework adapter protocol
│ ├── claude_api.py # Anthropic Claude API
│ ├── openai_api.py # OpenAI / Azure / compatible APIs
│ ├── aider.py # Aider CLI
│ ├── claude_code.py # Claude Code CLI
│ ├── cursor.py # Cursor agent CLI
│ ├── codex.py # OpenAI Codex CLI
│ └── opencode.py # OpenCode CLI
├── registry/
│ ├── puller.py # Pull skills from GitHub
│ ├── cache.py # Local skill cache (~/.skillbench/skills/)
│ └── contribute.py # PR workflow (v0.2)
├── store/
│ └── history.py # SQLite run history
└── reports/
└── generator.py # Rich terminal + JSON reports
CI/CD
# .github/workflows/skill-eval.yml
name: Skill Evaluation
on: [push, pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install .
- run: skillbench eval ./my-skill --frameworks azure -o results.json
env:
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
AZURE_OPENAI_ENDPOINT: ${{ secrets.AZURE_OPENAI_ENDPOINT }}
SKILLBENCH_JUDGE_MODEL: ${{ secrets.SKILLBENCH_JUDGE_MODEL }}
Exit codes: 0 = all pass, 1 = failures. JSON output (-o) for machine-readable results.
Benchmark Suite
SkillBench includes a manifest of the top 100 skills from skills.sh:
skillbench benchmark pull # Pull all skills
Requirements
- Python 3.11+
- API keys for the adapters you want to use
Contributing
Contributions welcome! To get started:
git clone https://github.com/YOUR_USERNAME/skillbench.git
cd skillbench
python3 -m venv .venv && source .venv/bin/activate
pip install .[dev]
pytest tests/ -v
Please open an issue before submitting large changes.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skillbench-0.1.0.tar.gz.
File metadata
- Download URL: skillbench-0.1.0.tar.gz
- Upload date:
- Size: 42.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1de947890fe928ec3480e60afcf8156bf16ed03496c958d5a0d6ee99f2021533
|
|
| MD5 |
37c13145c389122ce46a3fdd78fb8e2e
|
|
| BLAKE2b-256 |
f981a9dfef8503b2734d721d437790e228a3b82d2068e9fa4c289c2fe98045b3
|
File details
Details for the file skillbench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: skillbench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 50.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ff08e6d730aa12ce4321c86b88c69b9842272130cf3ccc9d860cfae8462c530
|
|
| MD5 |
a4f7476b4e30c396b38a34c8837a4dce
|
|
| BLAKE2b-256 |
24f740bda33c6c493a4a5b4eae83cad5c6309e1b1d86945ba8f7c64c67c1b5b2
|