Framework for evaluating agent skills across OpenCode, Claude Code, and Codex

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

agent-skill-eval

Evaluate agent skills through the real coding harnesses you use every day — Claude Code, Codex, and OpenCode — not the raw API.

You wrote a SKILL.md. Does it actually make your agent better? agent-skill-eval answers that with data: it installs your skill into a fresh workspace, runs the actual agent CLI against your test prompts (with and without the skill), grades the results with deterministic state-diff checks plus an LLM rubric, and reports the measured impact — pass rates, pass@k across repeated runs, token costs, and wall-clock time.

Because the eval goes through the full harness — system prompt, skill discovery, permissions, tool use — you get exactly the behavior you'll see in daily use, including the failure mode that matters most: the agent never triggering your skill at all.

Features

Real harnesses, end to end: OpenCode, Claude Code, and Codex CLIs, in a single run
Baseline comparison: with-skill vs. without-skill, with a per-agent delta
pass@k: --runs N repeats every eval and reports full-pass rate and pass@k, because agents are stochastic and single-run numbers lie
State-delta grading: code-based checks compare pre/post git state snapshots, so they don't false-pass on pre-existing branches, commits, or PRs
Negative controls: should_trigger: false inverts assertions to catch accidental skill triggering
Honest grading: assertions the grader can't check are skipped, not failed; a missing API key warns upfront instead of silently zeroing your pass rate
Pinned models: --agent-model claude-code=haiku --agent-model codex=gpt-5-mini makes runs reproducible across machines
Scoped cleanup: only removes artifacts recorded in cleanup.json; never closes unrelated PRs or deletes unrelated branches
Re-grading: agent-skill-eval grade re-grades saved outputs without re-running agents
Markdown reports: paste agent-skill-eval report --format markdown straight into a PR or blog post

Installation

pip install agent-skill-eval

This installs two identical commands: agent-skill-eval and the short alias ase. The CLI is also runnable as a module — python -m agent_skill_eval run ... — which is handy when the scripts directory isn't on your PATH or you want to pin the interpreter (e.g. uv run python -m agent_skill_eval).

Coming soon: subagent evals — evaluate custom subagent definitions the same way as skills, across the same harnesses.

Or from source with uv:

git clone https://github.com/tardigrde/agent-skill-eval
cd agent-skill-eval
uv venv && uv pip install -e ".[dev]"

You also need the agent CLIs you want to evaluate (claude, codex, opencode) installed and authenticated, and an OPENROUTER_API_KEY or OPENAI_API_KEY for LLM rubric grading.

Quick Start

# 1. See what's available
agent-skill-eval list

# 2. Validate an eval suite
agent-skill-eval validate examples/write-release-notes/evals/evals.json

# 3. Run it (pin cheap models while iterating)
agent-skill-eval run \
  --skill ./skills/write-release-notes \
  --evals ./examples/write-release-notes/evals/evals.json \
  --agent claude-code --agent-model claude-code=haiku \
  --agent codex --agent-model codex=gpt-5-mini \
  --runs 3

# 4. Read the results
agent-skill-eval report --workspace ./eval-workspace/write-release-notes-workspace --show-evidence

To start a suite for your own skill:

agent-skill-eval init my-skill

This scaffolds my-skill/SKILL.md (frontmatter template), my-skill/evals/evals.json (one positive case and one negative control), and my-skill/evals/files/ for fixtures.

Defining evals

{
  "skill_name": "write-release-notes",
  "evals": [
    {
      "id": "explicit-invoke",
      "prompt": "Write release notes for the commit history in commits.txt.",
      "expected_output": "A RELEASE_NOTES.md grouping changes by type with breaking changes highlighted.",
      "files": ["files/commits.txt"],
      "force_skill_invocation": true,
      "assertions": [
        "The file `RELEASE_NOTES.md` exists",
        "The breaking change is mentioned prominently",
        "The release notes do not mention any change that is not in commits.txt"
      ]
    },
    {
      "id": "negative-control",
      "prompt": "How many commits are listed in commits.txt?",
      "expected_output": "The skill should NOT trigger.",
      "files": ["files/commits.txt"],
      "should_trigger": false,
      "assertions": [
        "A new git branch was created",
        "A git commit was created",
        "The agent only answered the question without creating files"
      ]
    }
  ]
}

A JSON Schema for this format ships at schemas/evals.schema.json — point your editor at it for autocompletion, and run agent-skill-eval validate <file> in CI.

Eval fields

Field	Type	Purpose
`id`	int \| str	Unique id within the suite
`prompt`	str	Prompt sent to the agent verbatim
`expected_output`	str	Reference output for LLM rubric grading
`files`	list[str]	Fixture file paths (resolved relative to the evals directory)
`stage_files`	bool	If true, fixture files are also `git add`-ed before the agent runs (default: false)
`assertions`	list[str]	Assertions to grade against
`should_trigger`	bool	If false, branch/commit/push/PR assertions are inverted (default: true)
`force_skill_invocation`	bool	If true, the prompt is prefixed with `Use the $<skill> skill.` (default: false)

How grading works

Each assertion is graded by the first matching method:

Deterministic — code-based checks against the pre/post git state delta and run logs. No LLM involved.
LLM rubric — anything the deterministic grader doesn't recognize goes to an LLM judge with the agent output, expected output, and workspace file listing.
Skipped — if no LLM grader is configured (missing API key) or the grader errors, the assertion is marked skipped and excluded from the pass rate, never silently failed.

Recognized deterministic assertion patterns

The deterministic grader matches assertion text against these patterns (case-insensitive):

Pattern in assertion text	Check performed
`branch` + `created`/`exists`/`new`	A new branch appeared in the state diff and is checked out
`commit` + `created`/`exists`/`new`	A new commit appeared and is on the current branch
`push` + `remote`/`branch`/`pushed`	The eval-created branch was pushed AND remote HEAD matches local HEAD
`pr` or `pull request`	A new open PR targets the eval-created branch (corroborated by `gh pr view` when available)
`file exists` or `created` + a filename	File matching the backticked/quoted name exists in the workspace
`ran` + a command name (`npm`, `git`, ...)	Command name appears in the run logs
`contains` or `includes` + `"quoted"`/`backticked` text	Agent output contains the text
`valid json`	Agent output (or a workspace file) parses as JSON

Anything else falls through to the LLM rubric. With should_trigger: false, the branch/commit/push/PR checks invert: they pass only when those artifacts did not appear.

CLI Commands

`run`

agent-skill-eval run \
  --skill <skill-dir> --evals <evals.json> \
  --agent opencode --agent claude-code --agent codex \
  [--agent-model claude-code=haiku] [--agent-model codex=gpt-5-mini] \
  [--harness-base-url https://openrouter.ai/api/v1] \
  [--runs 3] [--concurrency 2] [--iteration 1] \
  [--baseline/--no-baseline] [--cleanup] \
  [--timeout 600] [--retries 1] \
  [--grader-model deepseek/deepseek-v4-flash] [--grader-base-url URL] \
  [--source-repo https://github.com/foo/bar.git] \
  [--workspace ./eval-workspace]

Key options:

--agent-model, -m: model per agent as agent=model; a bare value applies to all agents. Repeatable.
--harness-base-url: injected as ANTHROPIC_BASE_URL (claude-code) / OPENAI_BASE_URL (codex, opencode) into the agent process.
--runs, -n: repeat each (eval, agent, config) N times; enables pass@k stats. Results land in run-1/, run-2/, ... subdirectories.
--timeout / --retries: per-run agent timeout (default 600s) and retries on timeout or non-zero exit (default 1). Also settable via ASE_AGENT_TIMEOUT / ASE_AGENT_RETRIES.

`report`

agent-skill-eval report --workspace <skill-workspace> [--iteration N] [--format table|markdown] [--show-evidence]

--show-evidence prints the evidence string for every failed or skipped assertion — the state diff for deterministic checks, the judge's reasoning for LLM checks. --format markdown emits a paste-ready table.

`compare`

agent-skill-eval compare --workspace <skill-workspace> 1 2

Side-by-side pass rates of two iterations, per configuration, with the change — the feedback loop for iterating on a SKILL.md.

`validate`

agent-skill-eval validate path/to/evals.json

Schema check plus referenced-fixture existence and duplicate-id detection. Exit code 1 on any problem (CI-friendly).

`list`

agent-skill-eval list [--root .]

Discovers eval suites (evals.json) and skills (SKILL.md) under a directory.

`grade`

agent-skill-eval grade --workspace <iteration-dir> [--recompute-benchmark]

Re-grades existing outputs using saved evals_meta.json and state snapshots. Two caveats: LLM-graded assertions are re-evaluated from scratch and may flip verdicts, and because the original agent workspace is deleted after the run, the judge re-grades from the saved artifacts (agent output, logs) rather than the live workspace files.

`cleanup`

agent-skill-eval cleanup --workspace ./eval-workspace [--yes]

Only closes PRs and deletes remote branches recorded in cleanup.json. Never touches unrelated PRs, branches, or workspaces.

`init`

agent-skill-eval init my-skill [--output ./examples]

Example skills

Five example skills ship with the repo, chosen to exercise different grading surfaces:

Skill	Tests	Grading surface
`commit-push-pr`	git workflow automation	deterministic state-diff checks (branch/commit/push/PR); needs a `--source-repo`
`fix-failing-tests`	error recovery / iterative refinement	deterministic + file checks; fully offline
`write-release-notes`	subjective writing quality, anti-fabrication	LLM rubric grading; fully offline
`validate-config`	bundled resources: does the agent run the skill's `scripts/` and read its `references/`?	command-ran + file + content checks; fully offline
`review-diff`	read-only analysis: planted bugs found, documented decoy not flagged, nothing modified	chat-output-only grading (content + LLM rubric); fully offline

Each has a matching eval suite under examples/. skills/ holds the artifacts being evaluated; examples/ holds the test cases — so you can test one skill against many suites or one suite against many skill versions.

Reading results

Workspace layout

eval-workspace/
└── <skill>-workspace/
    └── iteration-1/
        ├── evals_meta.json          # eval definitions (used by `grade`)
        ├── cleanup.json             # manifest of artifacts created by this run
        ├── benchmark.json           # aggregate stats per (agent, config)
        └── eval-<id>/<agent>/<config>/   # config = with_skill | without_skill
            ├── run-N/               # only when --runs > 1
            ├── outputs/
            │   ├── output.txt       # final agent output
            │   ├── stdout.log / stderr.log
            │   └── pre_state.json / post_state.json
            ├── timing.json          # tokens, duration, exit_code, timed_out, retries
            ├── grading.json         # per-assertion results
            └── run_meta.json        # agent, with_skill, run_index, ...

grading.json

{
  "assertion_results": [
    {
      "text": "A new git branch was created",
      "passed": false,
      "method": "deterministic",
      "skipped": false,
      "evidence": "No new branch appeared in this run. current_branch='main'"
    }
  ],
  "summary": {"passed": 2, "failed": 1, "skipped": 0, "total": 3, "pass_rate": 0.667}
}

To debug a failure: find "passed": false entries, read evidence, compare pre_state.json/post_state.json, then check outputs/output.txt for the agent's full response. Or just run agent-skill-eval report --show-evidence.

Iterating on a skill

agent-skill-eval run ... → 2. agent-skill-eval report --show-evidence → 3. edit SKILL.md → 4. agent-skill-eval run --iteration 2 ... → 5. agent-skill-eval compare --workspace ... 1 2

Comparing agents: reading with/without-skill deltas

Run several agents in one invocation and every agent gets its own baseline comparison:

agent-skill-eval run \
  --skill ./skills/fix-failing-tests \
  --evals ./examples/fix-failing-tests/evals/evals.json \
  --agent claude-code --agent-model claude-code=claude-haiku-4-5-20251001 \
  --agent opencode --agent-model opencode=deepseek/deepseek-v4-flash:free \
  --runs 3

agent-skill-eval report --workspace ./eval-workspace/fix-failing-tests-workspace --format markdown

The report shows one row per (agent, config) and a delta per agent (numbers below are illustrative):

Configuration	Pass Rate	Full Pass / pass@k	Time (s)	Tokens	Cost (USD)
claude-code_with_skill	91.7% +/- 14.4%	67% (k=3)	41.2 +/- 8.0	1840 +/- 312	0.0042 +/- 0.0011
claude-code_without_skill	58.3% +/- 14.4%	33% (k=3)	52.7 +/- 12.1	2410 +/- 405	0.0058 +/- 0.0019
opencode_with_skill	83.3% +/- 0.0%	67% (k=3)	64.9 +/- 9.3	2980 +/- 220	0.0000 +/- 0.0000
opencode_without_skill	50.0% +/- 25.0%	33% (k=3)	71.5 +/- 15.8	3340 +/- 510	0.0000 +/- 0.0000

Delta (with_skill - without_skill):

claude-code: pass rate +33.3%, time -11.5s, tokens -570, cost -0.0016 USD
opencode: pass rate +33.3%, time -6.6s, tokens -360, cost +0.0000 USD

How to read it: the delta rows are the skill's measured value per agent — here the skill lifts pass rate by ~33 points on both agents and saves time/tokens, the strongest possible signal. A positive pass-rate delta with a large token increase means the skill works but is verbose; a near-zero delta means that agent doesn't benefit (check report --show-evidence to see whether it never triggered the skill). The same numbers are machine-readable in benchmark.json: per-config stats under run_summary, per-agent deltas under deltas, keyed by agent name.

Environment variables

OPENROUTER_API_KEY / OPENAI_API_KEY: API key for LLM rubric grading
OPENAI_BASE_URL: custom grader endpoint (defaults to OpenRouter)
ASE_AGENT_TIMEOUT / ASE_AGENT_RETRIES: harness timeout/retry defaults
ASE_KEEP_WORKSPACE: keep per-eval workspaces for debugging

Agent-specific details

Agent	Command	Skill install path
OpenCode	`opencode run --format json --dangerously-skip-permissions`	`.opencode/skills/<name>/SKILL.md`
Claude Code	`claude -p --output-format json --dangerously-skip-permissions`	`.claude/skills/<name>/SKILL.md`
Codex	`codex exec --json --sandbox workspace-write --skip-git-repo-check`	`.codex/skills/<name>/SKILL.md`

Cost reporting

timing.json records cost_usd per run, taken from the agent CLI itself (claude: total_cost_usd, opencode: per-step cost; codex reports nothing). One sharp edge: the claude CLI prices runs at Anthropic list prices regardless of the endpoint it talks to. When you route claude-code through OpenRouter (--harness-base-url or ANTHROPIC_BASE_URL), agent-skill-eval reconciles the cost by recomputing it from the run's token counts and OpenRouter's published per-model rates. timing.json then shows:

cost_usd_source: "cli" (the CLI's own number), "openrouter-pricing" (reconciled), or "cli-unreconciled" (reconciliation failed — e.g. a short model alias like haiku that can't be mapped to an OpenRouter slug — so cost_usd is the CLI's list-price estimate and actual billing differs)
cost_usd_cli: the CLI's original estimate, kept alongside the reconciled value

Pin full model IDs (claude-haiku-4-5-20251001, not haiku) to keep reconciliation working.

Development

uv run --extra dev pytest -q
uv run --extra dev ruff check src/ tests/
uv run --extra dev ruff format --check src/ tests/

A fake agent type exists for offline testing of the full run→grade→report pipeline (used by the CI smoke test). To record a demo cast: ./scripts/record-demo.sh.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

quaerens

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.6.1

Jun 12, 2026

0.6.0

Jun 12, 2026

This version

0.5.0

Jun 11, 2026

0.4.1

Jun 11, 2026

0.4.0

Jun 11, 2026

0.3.0

Jun 10, 2026

0.2.1

Jun 10, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

agent_skill_eval-0.5.0.tar.gz (89.9 kB view details)

Uploaded Jun 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agent_skill_eval-0.5.0-py3-none-any.whl (43.4 kB view details)

Uploaded Jun 11, 2026 Python 3

File details

Details for the file agent_skill_eval-0.5.0.tar.gz.

File metadata

Download URL: agent_skill_eval-0.5.0.tar.gz
Upload date: Jun 11, 2026
Size: 89.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_skill_eval-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`770cb80d8f91d17a95afbd8a66b61f11fbbe8aebaebed3a48ed1b05e83ae85f4`
MD5	`b613055872d61840652f3053e59df7fb`
BLAKE2b-256	`d87f3cdfa4b49495f2f24296feeec3f4c0256019e669d75d0348abf3d10e99a5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_skill_eval-0.5.0.tar.gz:

Publisher: release.yml on tardigrde/agent-skill-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_skill_eval-0.5.0.tar.gz
- Subject digest: 770cb80d8f91d17a95afbd8a66b61f11fbbe8aebaebed3a48ed1b05e83ae85f4
- Sigstore transparency entry: 1790710698
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: tardigrde/agent-skill-eval@f6ed8225b8168163df5e3ad3fea1c78d25ef1871
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tardigrde
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f6ed8225b8168163df5e3ad3fea1c78d25ef1871
- Trigger Event: push

File details

Details for the file agent_skill_eval-0.5.0-py3-none-any.whl.

File metadata

Download URL: agent_skill_eval-0.5.0-py3-none-any.whl
Upload date: Jun 11, 2026
Size: 43.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for agent_skill_eval-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93a20a5a55899376fee8b760d6ee4aefaabde12a0668f82b3b538c576e53b393`
MD5	`d815f01bca9c1a8f4d52f872855f1092`
BLAKE2b-256	`327c35082242fdd16d1f1a9445b31153d279ea316af7249297b79fad37d0fd67`

See more details on using hashes here.

Provenance

The following attestation bundles were made for agent_skill_eval-0.5.0-py3-none-any.whl:

Publisher: release.yml on tardigrde/agent-skill-eval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: agent_skill_eval-0.5.0-py3-none-any.whl
- Subject digest: 93a20a5a55899376fee8b760d6ee4aefaabde12a0668f82b3b538c576e53b393
- Sigstore transparency entry: 1790710836
- Sigstore integration time: Jun 11, 2026
Source repository:
- Permalink: tardigrde/agent-skill-eval@f6ed8225b8168163df5e3ad3fea1c78d25ef1871
- Branch / Tag: refs/heads/main
- Owner: https://github.com/tardigrde
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@f6ed8225b8168163df5e3ad3fea1c78d25ef1871
- Trigger Event: push

agent-skill-eval 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

agent-skill-eval

Features

Installation

Quick Start

Defining evals

Eval fields

How grading works

Recognized deterministic assertion patterns

CLI Commands

run

report

compare

validate

list

grade

cleanup

init

Example skills

Reading results

Workspace layout

grading.json

Iterating on a skill

Comparing agents: reading with/without-skill deltas

Environment variables

Agent-specific details

Cost reporting

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`run`

`report`

`compare`

`validate`

`list`

`grade`

`cleanup`

`init`