Skip to main content

An open spec for A/B benchmarking skills via declarative test suites.

Project description

skillevaluation

Does your skill actually make the agent better? Prove it — with measured before/after numbers.

CI PyPI Python License

A skill is just a folder — a SKILL.md plus some attachments. It's easy to write one and assume it helps. skillevaluation lets you measure the help: write a small eval.yaml next to your skill, and a runner executes each test case twice — once with the skill loaded, once without — then hands you a clear A/B delta on pass rate, speed, tokens, turns, and tool calls.

No more "I think this skill is good." Now you can say "this skill lifts pass rate 40 points and cuts tokens 43%" — and back it with reproducible cases.


The payoff

Here's the bundled gdpr-pii-classifier example — the same five cases, run with the skill and without it:

Dimension Without skill With skill Delta
Pass rate 40% 80% +40 pts
Avg tokens 3,210 1,840 −43%
Avg turns 8.2 4.6 −44%
Avg duration 22.8s 14.2s −38%
Avg tool calls 5.4 3.0 −44%

The skill more than doubled the pass rate and made the agent faster and cheaper. That's exactly the kind of claim skillevaluation is built to produce.

Numbers above are illustrative of the example's shape — your real deltas depend on your agent runtime and model.


How it works

  1. Write eval.yaml next to your SKILL.md — a handful of declarative cases (a prompt, plain-English expectations, and optional shell validators).
  2. A runner executes each case twice — once with the skill loaded (the with arm), once without (the without arm).
  3. You get measured deltas — each case is classified (flip_to_pass, pass_kept, …) and aggregated into per-dimension lift.
                    ┌─ with skill ────▶ pass? + metrics ─┐
   each case ──────▶┤                                    ├──▶ outcome ──▶ aggregate deltas
                    └─ without skill ─▶ pass? + metrics ─┘

Quickstart

pip install skillevaluation

1. Describe what "better" means. Drop an eval.yaml beside your SKILL.md:

# eval.yaml
cases:
  - name: tracks_with_id
    prompt: "Classify these schema fields and write JSON to /workspace/output.json: email, ip_address, name, age."
    expectations:
      - "The response classifies email as PII"
      - "The response identifies ip_address as pseudonymous (not PII)"
    validators:
      - cmd: "jq -e '.email.category == \"PII\"' /workspace/output.json"
        label: "email categorized as PII"

See the full five-case suite in examples/gdpr-pii-classifier/eval.yaml.

2. Score your A/B results. Once you've run each case with and without the skill, feed the per-arm results to the library and get the deltas back:

from skillevaluation.outcomes import classify_outcome
from skillevaluation.aggregation import CaseResult, CaseMetrics, compute_run_aggregates

results = [
    CaseResult(
        case_name="tracks_with_id",
        outcome=classify_outcome(with_passed=True, without_passed=False),
        with_skill=CaseMetrics(passed=True,  duration_ms=14200, turns=4, total_tokens=1840, tool_call_count=3),
        without_skill=CaseMetrics(passed=False, duration_ms=22800, turns=8, total_tokens=3210, tool_call_count=5),
    ),
    # ... one CaseResult per case
]

agg = compute_run_aggregates(results)
print(agg.pass_rate)   # {'with_skill': 1.0, 'without_skill': 0.0, 'delta_pts': 100.0}
print(agg.to_dict())   # full per-dimension JSON, matching the wire schema

What actually runs the agent? That part is yours to bring. This repo defines the format, the scoring, and the spec — it does not ship the harness that drives the agent through each case. Wire your own agent loop to the runner contract, or use a conforming runner like DecimalAI that does the A/B execution for you.

Status: v0.1.1, pre-1.0. The format is stable enough to build on, but APIs may shift before v1 — changes are logged in CHANGELOG.md.


What's in the box

A typed, dependency-light Python reference implementation (only needs PyYAML):

Module What it does
skillevaluation.parser Parse + strictly validate eval.yaml
skillevaluation.outcomes Classify each case: flip_to_pass / pass_kept / fail_kept / flip_to_fail / error
skillevaluation.aggregation Per-dimension delta math, with an honest apples-to-oranges skip rule
skillevaluation.baseline Baseline-cache key derivation (skip re-running an unchanged without arm)
skillevaluation.trajectory.format_v1 Canonical agent-session rendering, so different runners' LLM judges agree

Use it as a spec, not just a library

skillevaluation is an open spec, so any tool — in any language — can produce interoperable results. If you're building your own runner, start here:

Deliberately out of scope: live traffic-split experiments, external eval-score webhooks (DeepEval/LangSmith), catalog ranking or publish-gate policy, and the exact LLM-judge prompt wording (the contract is specified; the prompt is your choice).


Contributing

Contributions are genuinely welcome — especially new conformance cases that catch an edge the golden suite misses. See CONTRIBUTING.md. Dev setup is the usual:

git clone https://github.com/decimal-labs/skillevaluation
cd skillevaluation
pip install -e ".[dev]"
pytest

License

Apache 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillevaluation-0.1.1.tar.gz (51.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

skillevaluation-0.1.1-py3-none-any.whl (20.7 kB view details)

Uploaded Python 3

File details

Details for the file skillevaluation-0.1.1.tar.gz.

File metadata

  • Download URL: skillevaluation-0.1.1.tar.gz
  • Upload date:
  • Size: 51.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for skillevaluation-0.1.1.tar.gz
Algorithm Hash digest
SHA256 506ad90a51d4793c6b64f58ddfda9ce75706a29982b5167ade010d51c0642b5e
MD5 12cb74a74571200f34aa37930a7dc9f1
BLAKE2b-256 0a66ab3ea558ad86f05141c98eda266786411d8c55a3859281c160b5f0328d42

See more details on using hashes here.

File details

Details for the file skillevaluation-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for skillevaluation-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1810647aa2bc61f4621ce58344b2136f4783ce69c796d731cce288cbd3e61caf
MD5 d576e2750af92b583b44a73145245ae3
BLAKE2b-256 aa6972e8c999572847d0e6ef307cc820a282bda24cf3c4cd061e752d4d60c2db

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page