An open spec for A/B benchmarking Claude Code skills via declarative test suites.
Project description
skillevaluation
Does your Claude Code skill actually make the agent better? Prove it — with measured before/after numbers.
A Claude Code skill is just a folder — a SKILL.md plus some attachments. It's easy to write one and assume it helps. skillevaluation lets you measure the help: write a small eval.yaml next to your skill, and a runner executes each test case twice — once with the skill loaded, once without — then hands you a clear A/B delta on pass rate, speed, tokens, turns, and tool calls.
No more "I think this skill is good." Now you can say "this skill lifts pass rate 40 points and cuts tokens 43%" — and back it with reproducible cases.
The payoff
Here's the bundled gdpr-pii-classifier example — the same five cases, run with the skill and without it:
| Dimension | Without skill | With skill | Delta |
|---|---|---|---|
| Pass rate | 40% | 80% | +40 pts |
| Avg tokens | 3,210 | 1,840 | −43% |
| Avg turns | 8.2 | 4.6 | −44% |
| Avg duration | 22.8s | 14.2s | −38% |
| Avg tool calls | 5.4 | 3.0 | −44% |
The skill more than doubled the pass rate and made the agent faster and cheaper. That's exactly the kind of claim skillevaluation is built to produce.
Numbers above are illustrative of the example's shape — your real deltas depend on your agent runtime and model.
How it works
- Write
eval.yamlnext to yourSKILL.md— a handful of declarative cases (a prompt, plain-English expectations, and optional shell validators). - A runner executes each case twice — once with the skill loaded (the with arm), once without (the without arm).
- You get measured deltas — each case is classified (
flip_to_pass,pass_kept, …) and aggregated into per-dimension lift.
┌─ with skill ────▶ pass? + metrics ─┐
each case ──────▶┤ ├──▶ outcome ──▶ aggregate deltas
└─ without skill ─▶ pass? + metrics ─┘
Quickstart
pip install skillevaluation
1. Describe what "better" means. Drop an eval.yaml beside your SKILL.md:
# eval.yaml
cases:
- name: tracks_with_id
prompt: "Classify these schema fields and write JSON to /workspace/output.json: email, ip_address, name, age."
expectations:
- "The response classifies email as PII"
- "The response identifies ip_address as pseudonymous (not PII)"
validators:
- cmd: "jq -e '.email.category == \"PII\"' /workspace/output.json"
label: "email categorized as PII"
See the full five-case suite in examples/gdpr-pii-classifier/eval.yaml.
2. Score your A/B results. Once you've run each case with and without the skill, feed the per-arm results to the library and get the deltas back:
from skillevaluation.outcomes import classify_outcome
from skillevaluation.aggregation import CaseResult, CaseMetrics, compute_run_aggregates
results = [
CaseResult(
case_name="tracks_with_id",
outcome=classify_outcome(with_passed=True, without_passed=False),
with_skill=CaseMetrics(passed=True, duration_ms=14200, turns=4, total_tokens=1840, tool_call_count=3),
without_skill=CaseMetrics(passed=False, duration_ms=22800, turns=8, total_tokens=3210, tool_call_count=5),
),
# ... one CaseResult per case
]
agg = compute_run_aggregates(results)
print(agg.pass_rate) # {'with_skill': 1.0, 'without_skill': 0.0, 'delta_pts': 100.0}
print(agg.to_dict()) # full per-dimension JSON, matching the wire schema
What actually runs the agent? That part is yours to bring. This repo defines the format, the scoring, and the spec — it does not ship the harness that drives Claude Code through each case. Wire your own agent loop to the runner contract, or use a conforming runner like DecimalAI that does the A/B execution for you.
Status: v0.1.0, pre-1.0. The format is stable enough to build on, but APIs may shift before v1 — changes are logged in
CHANGELOG.md.
What's in the box
A typed, dependency-light Python reference implementation (only needs PyYAML):
| Module | What it does |
|---|---|
skillevaluation.parser |
Parse + strictly validate eval.yaml |
skillevaluation.outcomes |
Classify each case: flip_to_pass / pass_kept / fail_kept / flip_to_fail / error |
skillevaluation.aggregation |
Per-dimension delta math, with an honest apples-to-oranges skip rule |
skillevaluation.baseline |
Baseline-cache key derivation (skip re-running an unchanged without arm) |
skillevaluation.trajectory.format_v1 |
Canonical agent-session rendering, so different runners' LLM judges agree |
Use it as a spec, not just a library
skillevaluation is an open spec, so any tool — in any language — can produce interoperable results. If you're building your own runner, start here:
spec/eval-yaml.md— the file formatspec/runner-contract.md— how to execute cases A/B and aggregatespec/llm-judge.md— the judge input/output contractspec/trajectory-format.md— canonical session renderingschemas/— JSON Schemas for every input and outputCONFORMANCE.md+compatibility-tests/— golden in/out pairs your implementation must reproduce
Deliberately out of scope: live traffic-split experiments, external eval-score webhooks (DeepEval/LangSmith), catalog ranking or publish-gate policy, and the exact LLM-judge prompt wording (the contract is specified; the prompt is your choice).
Composing with agentversion
A skillevaluation run produces a numeric score; its sibling spec agentversion records that score as an evaluation gate on an agent manifest:
{
"evaluation": {
"gates": [
{
"name": "skillevaluation:gdpr-pii-classifier",
"actual_score": 0.92,
"threshold": 0.80,
"passed": true,
"evaluator_ref": "skillevaluation://eval-yaml@v1,hash:abc123…"
}
]
}
}
The evaluator_ref URI scheme is defined here; agentversion treats it as opaque.
Contributing
Contributions are genuinely welcome — especially new conformance cases that catch an edge the golden suite misses. See CONTRIBUTING.md. Dev setup is the usual:
git clone https://github.com/decimal-labs/skillevaluation
cd skillevaluation
pip install -e ".[dev]"
pytest
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file skillevaluation-0.1.0.tar.gz.
File metadata
- Download URL: skillevaluation-0.1.0.tar.gz
- Upload date:
- Size: 51.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0617d58a23ef0bbd30b02832e21e76f81474b355aa1965679604aa74e209363d
|
|
| MD5 |
cf9c68662f6c17975d1b6a2808d45382
|
|
| BLAKE2b-256 |
aff98035c74c52f80fa346111884ecff9f45ccd12f889b1ba713ab61d01d0079
|
File details
Details for the file skillevaluation-0.1.0-py3-none-any.whl.
File metadata
- Download URL: skillevaluation-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
394d53b3acae52bdc7ae006ea27e7a81424f33576e4ed08b3154b4bf1a305dd8
|
|
| MD5 |
c31bb0fab7d63e7b6e3e3e67183b69f9
|
|
| BLAKE2b-256 |
b8206000a67cccdedcf5470dc7f4e6c127b97c1e431ba1bf6a23eccd2772b4f6
|