Skip to main content

An open spec for A/B benchmarking Claude Code skills via declarative test suites.

Project description

skillevaluation

Does your Claude Code skill actually make the agent better? Prove it — with measured before/after numbers.

CI PyPI Python License

A Claude Code skill is just a folder — a SKILL.md plus some attachments. It's easy to write one and assume it helps. skillevaluation lets you measure the help: write a small eval.yaml next to your skill, and a runner executes each test case twice — once with the skill loaded, once without — then hands you a clear A/B delta on pass rate, speed, tokens, turns, and tool calls.

No more "I think this skill is good." Now you can say "this skill lifts pass rate 40 points and cuts tokens 43%" — and back it with reproducible cases.


The payoff

Here's the bundled gdpr-pii-classifier example — the same five cases, run with the skill and without it:

Dimension Without skill With skill Delta
Pass rate 40% 80% +40 pts
Avg tokens 3,210 1,840 −43%
Avg turns 8.2 4.6 −44%
Avg duration 22.8s 14.2s −38%
Avg tool calls 5.4 3.0 −44%

The skill more than doubled the pass rate and made the agent faster and cheaper. That's exactly the kind of claim skillevaluation is built to produce.

Numbers above are illustrative of the example's shape — your real deltas depend on your agent runtime and model.


How it works

  1. Write eval.yaml next to your SKILL.md — a handful of declarative cases (a prompt, plain-English expectations, and optional shell validators).
  2. A runner executes each case twice — once with the skill loaded (the with arm), once without (the without arm).
  3. You get measured deltas — each case is classified (flip_to_pass, pass_kept, …) and aggregated into per-dimension lift.
                    ┌─ with skill ────▶ pass? + metrics ─┐
   each case ──────▶┤                                    ├──▶ outcome ──▶ aggregate deltas
                    └─ without skill ─▶ pass? + metrics ─┘

Quickstart

pip install skillevaluation

1. Describe what "better" means. Drop an eval.yaml beside your SKILL.md:

# eval.yaml
cases:
  - name: tracks_with_id
    prompt: "Classify these schema fields and write JSON to /workspace/output.json: email, ip_address, name, age."
    expectations:
      - "The response classifies email as PII"
      - "The response identifies ip_address as pseudonymous (not PII)"
    validators:
      - cmd: "jq -e '.email.category == \"PII\"' /workspace/output.json"
        label: "email categorized as PII"

See the full five-case suite in examples/gdpr-pii-classifier/eval.yaml.

2. Score your A/B results. Once you've run each case with and without the skill, feed the per-arm results to the library and get the deltas back:

from skillevaluation.outcomes import classify_outcome
from skillevaluation.aggregation import CaseResult, CaseMetrics, compute_run_aggregates

results = [
    CaseResult(
        case_name="tracks_with_id",
        outcome=classify_outcome(with_passed=True, without_passed=False),
        with_skill=CaseMetrics(passed=True,  duration_ms=14200, turns=4, total_tokens=1840, tool_call_count=3),
        without_skill=CaseMetrics(passed=False, duration_ms=22800, turns=8, total_tokens=3210, tool_call_count=5),
    ),
    # ... one CaseResult per case
]

agg = compute_run_aggregates(results)
print(agg.pass_rate)   # {'with_skill': 1.0, 'without_skill': 0.0, 'delta_pts': 100.0}
print(agg.to_dict())   # full per-dimension JSON, matching the wire schema

What actually runs the agent? That part is yours to bring. This repo defines the format, the scoring, and the spec — it does not ship the harness that drives Claude Code through each case. Wire your own agent loop to the runner contract, or use a conforming runner like DecimalAI that does the A/B execution for you.

Status: v0.1.0, pre-1.0. The format is stable enough to build on, but APIs may shift before v1 — changes are logged in CHANGELOG.md.


What's in the box

A typed, dependency-light Python reference implementation (only needs PyYAML):

Module What it does
skillevaluation.parser Parse + strictly validate eval.yaml
skillevaluation.outcomes Classify each case: flip_to_pass / pass_kept / fail_kept / flip_to_fail / error
skillevaluation.aggregation Per-dimension delta math, with an honest apples-to-oranges skip rule
skillevaluation.baseline Baseline-cache key derivation (skip re-running an unchanged without arm)
skillevaluation.trajectory.format_v1 Canonical agent-session rendering, so different runners' LLM judges agree

Use it as a spec, not just a library

skillevaluation is an open spec, so any tool — in any language — can produce interoperable results. If you're building your own runner, start here:

Deliberately out of scope: live traffic-split experiments, external eval-score webhooks (DeepEval/LangSmith), catalog ranking or publish-gate policy, and the exact LLM-judge prompt wording (the contract is specified; the prompt is your choice).


Composing with agentversion

A skillevaluation run produces a numeric score; its sibling spec agentversion records that score as an evaluation gate on an agent manifest:

{
  "evaluation": {
    "gates": [
      {
        "name": "skillevaluation:gdpr-pii-classifier",
        "actual_score": 0.92,
        "threshold": 0.80,
        "passed": true,
        "evaluator_ref": "skillevaluation://eval-yaml@v1,hash:abc123…"
      }
    ]
  }
}

The evaluator_ref URI scheme is defined here; agentversion treats it as opaque.


Contributing

Contributions are genuinely welcome — especially new conformance cases that catch an edge the golden suite misses. See CONTRIBUTING.md. Dev setup is the usual:

git clone https://github.com/decimal-labs/skillevaluation
cd skillevaluation
pip install -e ".[dev]"
pytest

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillevaluation-0.1.0.tar.gz (51.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skillevaluation-0.1.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file skillevaluation-0.1.0.tar.gz.

File metadata

  • Download URL: skillevaluation-0.1.0.tar.gz
  • Upload date:
  • Size: 51.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for skillevaluation-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0617d58a23ef0bbd30b02832e21e76f81474b355aa1965679604aa74e209363d
MD5 cf9c68662f6c17975d1b6a2808d45382
BLAKE2b-256 aff98035c74c52f80fa346111884ecff9f45ccd12f889b1ba713ab61d01d0079

See more details on using hashes here.

File details

Details for the file skillevaluation-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for skillevaluation-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 394d53b3acae52bdc7ae006ea27e7a81424f33576e4ed08b3154b4bf1a305dd8
MD5 c31bb0fab7d63e7b6e3e3e67183b69f9
BLAKE2b-256 b8206000a67cccdedcf5470dc7f4e6c127b97c1e431ba1bf6a23eccd2772b4f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page