Auditor for Claude Code skills and slash commands. Validates structured output against schemas using layered evaluation.
Project description
clauditor
Automated quality checks for Agent Skills. A skill is a reusable instruction file (SKILL.md) that tells Claude how to do a task. clauditor answers three questions about every run: Did it run? Did it return the right structure? Was the answer actually good? First two checks cost pennies and run in CI; the third is for release gates.
Contents
Why clauditor? · Install · One-minute example · Installing /clauditor · Using /clauditor · Quick Start · Three Layers · Suggest · CLI Reference · Pytest Integration · Eval Spec Format · Skill compatibility · Authentication · Reference docs
Why clauditor?
Three checks, cheap to expensive:
- Layer 1 — Did your skill produce the right structure? A free, instant check that runs in CI. Catches: "the output is missing the Venues section," "no phone numbers were extracted," "the URL is malformed." No LLM calls; pure regex and string matching against your assertions.
- Layer 2 — Did it pull the right fields? A small LLM (Anthropic's Haiku model, ~pennies per run) reads the skill's output and validates it against a schema you write. Catches: "the venue's address field is empty," "tier-1 entries are missing a website URL."
- Layer 3 — Was the answer actually useful? A stronger LLM (Anthropic's Sonnet model, ~dollars per run) grades the output against your rubric. Run on release, not every commit. Catches: "the venues are too far from the requested area," "the recommendations don't match the kid's age range."
The same eval.json file drives all three layers. You write it once; clauditor uses it for static checks, structured-field grading, and rubric scoring.
Install
pip install clauditor-eval
clauditor --version
Layer 1 works without any LLM credentials. Layers 2 & 3 and propose-eval need either an Anthropic API key (ANTHROPIC_API_KEY) or the claude CLI installed and signed in to a Claude Pro/Max subscription — see Authentication.
Installing from source (for contributors)
git clone https://github.com/wjduenow/clauditor.git
cd clauditor
uv sync --dev
uv is the project's package manager; it's a faster drop-in for pip + venv.
One-minute example
Both paths below assume you already have a SKILL.md — clauditor checks skills, it doesn't write them. A skill lives in .claude/skills/<name>/SKILL.md (modern layout) or .claude/commands/<name>.md (older layout); run ls .claude/ from your project root if you're not sure which you have.
I want a minimal eval spec I'll fill in myself. clauditor init reads your SKILL.md and writes a bare-bones eval.json next to it — no LLM call, no tokens spent. You add assertions and grading criteria from there.
clauditor init .claude/skills/my-skill/SKILL.md # generate a starter eval spec
clauditor validate .claude/skills/my-skill/SKILL.md # run Layer 1 against the skill
Expected output:
✓ Running /my-skill...
4/4 assertions passed (100%)
I want clauditor to write a richer eval spec for me. clauditor propose-eval sends your SKILL.md (and optionally a captured real-world run) to an LLM and gets back a populated eval.json with assertions, sections, and grading criteria already drafted.
clauditor propose-eval .claude/skills/my-skill/SKILL.md --dry-run # preview the prompt — no tokens spent
clauditor propose-eval .claude/skills/my-skill/SKILL.md # LLM writes the eval spec
clauditor validate .claude/skills/my-skill/SKILL.md # run Layer 1 against it
--dry-run prints the prompt clauditor would send to the LLM without making any API call — a cost-free preview so you can iterate on inputs before spending tokens.
Swap validate for grade once you've added grading_criteria to the spec to run Layer 3.
Installing the /clauditor slash command
If you use Claude Code interactively, you can type /clauditor <skill> in the prompt instead of running CLI commands — Claude reads the eval spec, runs the checks, and shows you results in-line. This is optional; the CLI works without it.
From your project root, uv run clauditor setup creates a symlink at .claude/skills/clauditor pointing at the bundled Claude Code skill; pip install --upgrade clauditor then picks up skill updates automatically. Restart Claude Code once if .claude/skills/ did not exist before.
Flags and details
--unlink— remove the/clauditorsymlink. Refuses symlinks not pointing at the installed clauditor package, so it won't touch user-authored skills.--force— overwrite an existing file or symlink at.claude/skills/clauditor.--project-dir PATH— override project-root detection (default walks up for.git/or.claude/).
Edits under .claude/skills/clauditor/ are hot-reloaded by Claude Code. uv run clauditor doctor reports the symlink's health (absent / installed / stale / wrong-target / unmanaged).
Using /clauditor in Claude Code
Invoke the slash command with a skill path — Claude locates the eval spec, runs L1, and asks before spending tokens on L3; if no sibling <skill>.eval.json exists, Claude offers clauditor propose-eval as an LLM-assisted bootstrap (with a cost-free --dry-run preview) before grading. If L3 reports failing criteria, Claude offers clauditor suggest to propose a unified diff of SKILL.md edits motivated by the specific failing criterion ids.
/clauditor .claude/commands/my-skill.md
Full reference: docs/skill-usage.md.
Quick Start
A new skill goes from "untested" to "covered" in four steps: clauditor capture records a real run, clauditor propose-eval bootstraps a full three-layer spec from the SKILL.md plus that capture, clauditor validate tightens L1 assertions, then the spec wires into pytest for regression coverage.
clauditor capture my-skill -- "initial context" # save real output → tests/eval/captured/
clauditor propose-eval .claude/commands/my-skill.md # LLM writes the spec from SKILL.md + capture
clauditor validate .claude/commands/my-skill.md
clauditor validate .claude/commands/my-skill.md --json # CI mode
Covered in the full reference: the capture command and interactive-skill limitations, propose-eval options, pytest fixtures (clauditor_runner, clauditor_asserter, clauditor_spec). Full reference: docs/quick-start.md.
Three Layers of Validation
L1 catches shape regressions for free, L2 uses Haiku to validate structured fields, L3 uses Sonnet to grade against a rubric — all three drive off the same eval spec:
{"assertions": [...], "sections": [...], "grading_criteria": [...]}
Full reference: docs/layers.md.
LLM-assisted skill improvement (clauditor suggest)
When clauditor grade returns failing L3 criteria, clauditor suggest reads that iteration's grading.json, asks Sonnet to propose minimal SKILL.md edits keyed to the failing criterion ids, and writes a unified diff plus a JSON sidecar with motivated_by, anchor, confidence, and per-edit rationale. Every proposal is hard-validated so its anchor appears exactly once in the target SKILL.md before anything lands on disk — no silent drift, no blind patches.
clauditor grade .claude/skills/my-skill/SKILL.md # produces grading.json
clauditor suggest .claude/skills/my-skill/SKILL.md # reads latest grading.json
# → unified diff on stdout; sidecar at .clauditor/suggestions/my-skill-<ts>.{diff,json}
Review, git apply (or hand-edit), then re-run clauditor grade to measure the score delta. The sidecar is stable (schema_version: 1) for downstream tooling.
Covered in the full reference: traceability via motivated_by, the anchor-safety contract, sidecar field-by-field reference, --from-iteration, --with-transcripts, --model, --json, and the full worked walkthrough. Full reference: docs/skill-usage.md#proposing-skill-improvements.
CLI Reference
Stable exit-code contract (0 = pass, 1 = skill failed, 2 = input error, 3 = Anthropic error). grade auto-increments iteration slots under .clauditor/iteration-N/<skill>/ and appends metrics to history.jsonl.
clauditor init <skill.md> # Starter eval.json
clauditor propose-eval <skill.md> # LLM-assisted EvalSpec bootstrap
clauditor lint <skill.md> # Static agentskills.io spec conformance
clauditor validate <skill.md> # Layer 1 assertions
clauditor grade <skill.md> # Layer 3 quality grading
clauditor compare --skill <s> --from 1 --to 2 # Diff iterations
clauditor trend <skill> --metric total.total # History + sparkline
clauditor badge <skill.md> # Shields.io endpoint JSON for README embed
Covered in the full reference: every subcommand flag (--variance, --iteration, --diff, …), exit codes, history.jsonl shape, clauditor trend metric paths. Full reference: docs/cli-reference.md.
Pytest Integration
def test_my_skill(clauditor_runner, clauditor_asserter):
result = clauditor_runner.run("my-skill", '"San Jose, CA"')
clauditor_asserter(result).assert_contains("Results")
Full reference: docs/pytest-plugin.md.
Eval Spec Format
An <skill-name>.eval.json lives next to the skill's .md file and drives all three layers. In plain English, it lists: what input to test the skill with, structural rules the output must satisfy (assertions), fields the output should contain (sections + tiers, used by Layer 2), and rubric questions for the LLM judge (grading criteria, used by Layer 3).
{
"skill_name": "find-kid-activities",
"test_args": "\"Cupertino, CA\" --ages 4-6",
"assertions": [
{"id": "has_venues", "type": "contains", "needle": "Venues"}
],
"sections": [
{
"name": "Venues",
"tiers": [
{
"label": "default",
"min_entries": 3,
"fields": [
{"id": "v_name", "name": "name", "required": true}
]
}
]
}
],
"grading_criteria": [
{"id": "distance_ok", "criterion": "Are all venues within the specified distance?"}
]
}
In this example: test_args is the prompt clauditor passes to the skill. The single L1 assertion checks the output literally contains the word "Venues". The sections block tells Layer 2 to find a "Venues" section with at least 3 entries, each with a name field. The grading_criteria block gives Layer 3 a yes/no question to grade the output on.
Optional blocks (input_files, output_files, variance, trigger_tests) add staging, file-based output capture, variance measurement, and trigger precision.
Covered in the full reference: the full eval-spec JSON shape, input_files staging rules, output_file / output_files capture, and the format validation DSL (phone_us, url, domain, … or inline regex). Full reference: docs/eval-spec-reference.md.
Alignment with agentskills.io
clauditor implements (and extends) the workflow at agentskills.io/skill-creation/evaluating-skills:
| agentskills.io concept | clauditor |
|---|---|
| Test case (prompt + expected + files) | .eval.json with test_args, input_files, sections, grading_criteria |
| Deterministic assertions | Layer 1 — assertions.py, FORMAT_REGISTRY (20 types) |
| LLM-judged structural checks | Layer 2 — grader.py, tiered schema extraction |
| Rubric quality grading | Layer 3 — quality_grader.py, per-criterion scoring + variance |
| Regression + longitudinal history | clauditor compare, .clauditor/history.jsonl, clauditor trend --metric <dotted.path> |
| Per-iteration workspace | .clauditor/iteration-N/<skill>/ with sidecars + run-*/ transcripts |
Beyond the spec: trigger precision testing, tiered extraction, pytest plugin, input_files staging, blind A/B judge, baseline pair runs, transcript capture, LLM-driven skill improvement proposer (clauditor suggest), LLM-assisted EvalSpec bootstrap (clauditor propose-eval), Pro/Max subscription-auth option (--no-api-key) for research-heavy skills that exceed the API-tier rate limit, static spec-conformance check (clauditor lint). Out of scope: human-in-the-loop feedback capture.
Note: --no-api-key only affects the subprocess; the six LLM-mediated commands (grade, propose-eval, suggest, triggers, extract, compare --blind) route their own Anthropic call through a pluggable transport that accepts either ANTHROPIC_API_KEY or a claude CLI subscription by default. See Authentication and API Keys.
Skill compatibility
clauditor works for most skills out of the box. A few patterns need a workaround or aren't supported yet:
- Skills with parallel sub-tasks (the
Task(run_in_background=true)pattern): pass--sync-tasksto force them to run sequentially. Output capture works correctly, but the async behavior itself (race conditions, late-arriving results) is not tested — you're evaluating a slightly different execution model than what ships. - Skills that ask the user mid-run (e.g.
AskUserQuestionto clarify intent): not supported directly — clauditor runs skills non-interactively, so the question never gets an answer and the run hangs. The fix is usually to take all parameters in the initial prompt; see the worked before/after example and thenot_contains AskUserQuestionregression assertion indocs/skill-usage.md#recipe-skills-that-ask-the-user-mid-run, withexamples/.claude/skills/find-kid-activities/SKILL.eval.jsonas the canonical anchor. - Skills whose correctness depends on async timing: cannot be tested accurately yet. Blocked on an upstream Claude Code feature.
Technical detail and upstream tracking
clauditor invokes skills through claude -p (non-interactive print mode), which is a strict subset of the interactive Claude Code runtime. Works: sequential Task calls, parallel tool calls in the parent turn, every standard tool (WebSearch, WebFetch, Bash, Read, Write, Edit). Works with --sync-tasks: skills using Task(run_in_background=true) — the flag sets CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 in the subprocess env, resolving the #97 output-truncation case. Loud failure today: skills whose correctness depends on true async semantics — blocked on upstream Claude Code gaining headless background-task polling, tracked in anthropics/claude-code#52917 and catalogued in docs/adr/transport-research-103.md.
Full matrix and refactoring recipes: docs/skill-usage.md#skill-compatibility.
Authentication and API Keys
Do I need an Anthropic API key?
- Just running Layer 1 checks (
validate,lint,init,capture) → no key needed. These are deterministic and never call an LLM. - You have a Claude Pro/Max subscription and the
claudeCLI installed → no API key needed. clauditor's defaultautotransport detects the CLI on PATH and uses your subscription auth for grading. Pass--transport clito be explicit. - Otherwise → set
ANTHROPIC_API_KEY(get one at console.anthropic.com) before runninggrade,extract,propose-eval,suggest,triggers, orcompare --blind.
The six LLM-mediated commands above route their Anthropic call through a pluggable transport — either the HTTP SDK (--transport api) or a subprocess to the local claude CLI (--transport cli). The default auto setting picks CLI when available, else API. Full reference: docs/transport-architecture.md.
clauditor also supports multi-provider grading: pass --grading-provider {anthropic,openai,auto} (or set CLAUDITOR_GRADING_PROVIDER / EvalSpec.grading_provider) to route the LLM-grader call through the OpenAI SDK with OPENAI_API_KEY instead. Under the default auto, clauditor infers the provider from the grading_model prefix (claude-* → anthropic, gpt-* / o[0-9]+* → openai). When grading_model is unset (the default), auto falls back to anthropic — the subscription-first historical default — so a spec with no provider/model fields grades against Claude Sonnet exactly as it did before multi-provider support landed.
Running clauditor grade <skill> --transport cli is the one-liner for subscription auth end-to-end: it implicitly strips ANTHROPIC_API_KEY / ANTHROPIC_AUTH_TOKEN from the skill subprocess env, so both the grader and the skill use subscription auth. Pass --transport api to keep the keys.
Running skills under a non-Claude harness. Clauditor can also drive your skill through the OpenAI Codex CLI instead of Claude Code: pass --harness codex (or set CLAUDITOR_HARNESS / EvalSpec.harness) to validate, grade, capture, or run. Codex needs CODEX_API_KEY or OPENAI_API_KEY in the subprocess env; the default auto harness picks Claude when claude is on PATH, else Codex. The harness axis is independent of the grader provider — an eval can run under Codex while the L3 grader still calls Claude (or vice versa). Full reference: docs/codex-harness.md.
Cost and observability
Every clauditor grade iteration writes a context.json sidecar capturing the harness, provider, model, sandbox mode, reasoning-token count, and an estimated cost_usd for the grader calls. clauditor trend and clauditor audit aggregate across iterations and refuse to silently average across mismatched harness or provider axes — pass --cross-harness / --cross-provider to opt in. Full reference: docs/cost-tracking.md and docs/audit-trend-workflow.md.
Reference docs
docs/architecture.md— how clauditor works under the hood (mermaid diagrams of the grade flow)docs/quick-start.md— tutorial walkthrough from init → validate → pytestdocs/layers.md— the three-layer framework in depthdocs/cli-reference.md— full subcommand + flag + exit-code referencedocs/eval-spec-reference.md— complete.eval.jsonschemadocs/pytest-plugin.md— pytest fixtures and optionsdocs/skill-usage.md— using/clauditorin Claude Codedocs/skills.md— catalog of skills shipped with this repo, with live badge statusdocs/badges.md— shields.io badges from iteration sidecars (clauditor badge)docs/stream-json-schema.md—claudestream-json parser contractdocs/codex-stream-schema.md—codexNDJSON parser contract (sibling of stream-json-schema)docs/transport-architecture.md— CLI vs SDK transport, auth-state matrix, precedence, migrationdocs/codex-harness.md— running skills under the OpenAI Codex CLIdocs/cost-tracking.md—cost_usd, reasoning tokens, the pricing table, and per-iterationcontext.jsondocs/audit-trend-workflow.md— measuring regressions over iterations withaudit/trend/compare/badgeCONTRIBUTING.md— maintainer pre-release dogfood gate + contribution workflow
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file clauditor_eval-0.1.2.tar.gz.
File metadata
- Download URL: clauditor_eval-0.1.2.tar.gz
- Upload date:
- Size: 804.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d4a4f207cdea6c3e5e27d17e6098b728263d164feb22c06d7e29aec7c9660b6
|
|
| MD5 |
fb4b8e812c98fefaa61a3309d6355223
|
|
| BLAKE2b-256 |
ffcc28671076a99805cd95f620d1014d84e1061629b9aead51f1fa4314387b98
|
Provenance
The following attestation bundles were made for clauditor_eval-0.1.2.tar.gz:
Publisher:
publish.yml on wjduenow/clauditor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
clauditor_eval-0.1.2.tar.gz -
Subject digest:
3d4a4f207cdea6c3e5e27d17e6098b728263d164feb22c06d7e29aec7c9660b6 - Sigstore transparency entry: 1598045633
- Sigstore integration time:
-
Permalink:
wjduenow/clauditor@5f225e44f03e411054dfe11eca7a70862a6bd1e0 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/wjduenow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5f225e44f03e411054dfe11eca7a70862a6bd1e0 -
Trigger Event:
release
-
Statement type:
File details
Details for the file clauditor_eval-0.1.2-py3-none-any.whl.
File metadata
- Download URL: clauditor_eval-0.1.2-py3-none-any.whl
- Upload date:
- Size: 368.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
05ee3884b92477cb30c7c0713f9b5ae145a3905b74e7b4e5be563957db395898
|
|
| MD5 |
3ca335b67b49879a43f65684d243f0b1
|
|
| BLAKE2b-256 |
ebd41088b73c1a698062a61cd6747012ca57bc95d422492dcb8e78acfea08712
|
Provenance
The following attestation bundles were made for clauditor_eval-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on wjduenow/clauditor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
clauditor_eval-0.1.2-py3-none-any.whl -
Subject digest:
05ee3884b92477cb30c7c0713f9b5ae145a3905b74e7b4e5be563957db395898 - Sigstore transparency entry: 1598045780
- Sigstore integration time:
-
Permalink:
wjduenow/clauditor@5f225e44f03e411054dfe11eca7a70862a6bd1e0 -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/wjduenow
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@5f225e44f03e411054dfe11eca7a70862a6bd1e0 -
Trigger Event:
release
-
Statement type: