An open spec for A/B benchmarking skills via declarative test suites.

These details have not been verified by PyPI

Project links

Project description

skillevaluation

Does your skill actually make the agent better? Prove it — with measured before/after numbers.

A skill is just a folder — a SKILL.md plus some attachments. It's easy to write one and assume it helps. skillevaluation lets you measure the help: write a small eval.yaml next to your skill, and a runner executes each test case twice — once with the skill loaded, once without — then hands you a clear A/B delta on pass rate, speed, tokens, turns, and tool calls.

No more "I think this skill is good." Now you can say "this skill lifts pass rate 40 points and cuts tokens 43%" — and back it with reproducible cases.

The payoff

Here's the bundled gdpr-pii-classifier example — the same five cases, run with the skill and without it:

Dimension	Without skill	With skill	Delta
Pass rate	40%	80%	+40 pts
Avg tokens	3,210	1,840	−43%
Avg turns	8.2	4.6	−44%
Avg duration	22.8s	14.2s	−38%
Avg tool calls	5.4	3.0	−44%

The skill more than doubled the pass rate and made the agent faster and cheaper. That's exactly the kind of claim skillevaluation is built to produce.

Numbers above are illustrative of the example's shape — your real deltas depend on your agent runtime and model.

How it works

Write eval.yaml next to your SKILL.md — a handful of declarative cases (a prompt, plain-English expectations, and optional shell validators).
A runner executes each case twice — once with the skill loaded (the with arm), once without (the without arm).
You get measured deltas — each case is classified (flip_to_pass, pass_kept, …) and aggregated into per-dimension lift.

                    ┌─ with skill ────▶ pass? + metrics ─┐
   each case ──────▶┤                                    ├──▶ outcome ──▶ aggregate deltas
                    └─ without skill ─▶ pass? + metrics ─┘

Quickstart

pip install skillevaluation

1. Describe what "better" means. Drop an eval.yaml beside your SKILL.md:

# eval.yaml
cases:
  - name: tracks_with_id
    prompt: "Classify these schema fields and write JSON to /workspace/output.json: email, ip_address, name, age."
    expectations:
      - "The response classifies email as PII"
      - "The response identifies ip_address as pseudonymous (not PII)"
    validators:
      - cmd: "jq -e '.email.category == \"PII\"' /workspace/output.json"
        label: "email categorized as PII"

See the full five-case suite in examples/gdpr-pii-classifier/eval.yaml.

2. Score your A/B results. Once you've run each case with and without the skill, feed the per-arm results to the library and get the deltas back:

from skillevaluation.outcomes import classify_outcome
from skillevaluation.aggregation import CaseResult, CaseMetrics, compute_run_aggregates

results = [
    CaseResult(
        case_name="tracks_with_id",
        outcome=classify_outcome(with_passed=True, without_passed=False),
        with_skill=CaseMetrics(passed=True,  duration_ms=14200, turns=4, total_tokens=1840, tool_call_count=3),
        without_skill=CaseMetrics(passed=False, duration_ms=22800, turns=8, total_tokens=3210, tool_call_count=5),
    ),
    # ... one CaseResult per case
]

agg = compute_run_aggregates(results)
print(agg.pass_rate)   # {'with_skill': 1.0, 'without_skill': 0.0, 'delta_pts': 100.0}
print(agg.to_dict())   # full per-dimension JSON, matching the wire schema

What actually runs the agent? That part is yours to bring. This repo defines the format, the scoring, and the spec — it does not ship the harness that drives the agent through each case. Wire your own agent loop to the runner contract, or use a conforming runner like DecimalAI that does the A/B execution for you.

Status: v0.1.1, pre-1.0. The format is stable enough to build on, but APIs may shift before v1 — changes are logged in CHANGELOG.md.

What's in the box

A typed, dependency-light Python reference implementation (only needs PyYAML):

Module	What it does
`skillevaluation.parser`	Parse + strictly validate `eval.yaml`
`skillevaluation.outcomes`	Classify each case: `flip_to_pass` / `pass_kept` / `fail_kept` / `flip_to_fail` / `error`
`skillevaluation.aggregation`	Per-dimension delta math, with an honest apples-to-oranges skip rule
`skillevaluation.baseline`	Baseline-cache key derivation (skip re-running an unchanged without arm)
`skillevaluation.trajectory.format_v1`	Canonical agent-session rendering, so different runners' LLM judges agree

Use it as a spec, not just a library

skillevaluation is an open spec, so any tool — in any language — can produce interoperable results. If you're building your own runner, start here:

spec/eval-yaml.md — the file format
spec/runner-contract.md — how to execute cases A/B and aggregate
spec/llm-judge.md — the judge input/output contract
spec/trajectory-format.md — canonical session rendering
schemas/ — JSON Schemas for every input and output
CONFORMANCE.md + compatibility-tests/ — golden in/out pairs your implementation must reproduce

Deliberately out of scope: live traffic-split experiments, external eval-score webhooks (DeepEval/LangSmith), catalog ranking or publish-gate policy, and the exact LLM-judge prompt wording (the contract is specified; the prompt is your choice).

Contributing

Contributions are genuinely welcome — especially new conformance cases that catch an edge the golden suite misses. See CONTRIBUTING.md. Dev setup is the usual:

git clone https://github.com/decimal-labs/skillevaluation
cd skillevaluation
pip install -e ".[dev]"
pytest

License

Apache 2.0.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

May 30, 2026

0.1.0

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

skillevaluation-0.1.1.tar.gz (51.4 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

skillevaluation-0.1.1-py3-none-any.whl (20.7 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file skillevaluation-0.1.1.tar.gz.

File metadata

Download URL: skillevaluation-0.1.1.tar.gz
Upload date: May 30, 2026
Size: 51.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for skillevaluation-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`506ad90a51d4793c6b64f58ddfda9ce75706a29982b5167ade010d51c0642b5e`
MD5	`12cb74a74571200f34aa37930a7dc9f1`
BLAKE2b-256	`0a66ab3ea558ad86f05141c98eda266786411d8c55a3859281c160b5f0328d42`

See more details on using hashes here.

File details

Details for the file skillevaluation-0.1.1-py3-none-any.whl.

File metadata

Download URL: skillevaluation-0.1.1-py3-none-any.whl
Upload date: May 30, 2026
Size: 20.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for skillevaluation-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1810647aa2bc61f4621ce58344b2136f4783ce69c796d731cce288cbd3e61caf`
MD5	`d576e2750af92b583b44a73145245ae3`
BLAKE2b-256	`aa6972e8c999572847d0e6ef307cc820a282bda24cf3c4cd061e752d4d60c2db`

See more details on using hashes here.

skillevaluation 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

skillevaluation

The payoff

How it works

Quickstart

What's in the box

Use it as a spec, not just a library

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes