Auditor for Claude Code skills and slash commands. Validates structured output against schemas using layered evaluation.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wjduenow

These details have not been verified by PyPI

Project description

clauditor

Automated quality checks for Agent Skills. A skill is a reusable instruction file (SKILL.md) that tells Claude how to do a task. clauditor answers three questions about every run: Did it run? Did it return the right structure? Was the answer actually good? First two checks cost pennies and run in CI; the third is for release gates.

Contents

Why clauditor? · Install · One-minute example · Installing /clauditor · Using /clauditor · Quick Start · Three Layers · Suggest · CLI Reference · Pytest Integration · Eval Spec Format · Skill compatibility · Authentication · Reference docs

Why clauditor?

Three checks, cheap to expensive:

Layer 1 — Did your skill produce the right structure? A free, instant check that runs in CI. Catches: "the output is missing the Venues section," "no phone numbers were extracted," "the URL is malformed." No LLM calls; pure regex and string matching against your assertions.
Layer 2 — Did it pull the right fields? A small LLM (Anthropic's Haiku model, ~pennies per run) reads the skill's output and validates it against a schema you write. Catches: "the venue's address field is empty," "tier-1 entries are missing a website URL."
Layer 3 — Was the answer actually useful? A stronger LLM (Anthropic's Sonnet model, ~dollars per run) grades the output against your rubric. Run on release, not every commit. Catches: "the venues are too far from the requested area," "the recommendations don't match the kid's age range."

The same eval.json file drives all three layers. You write it once; clauditor uses it for static checks, structured-field grading, and rubric scoring.

Install

pip install clauditor-eval
clauditor --version

Layer 1 works without any LLM credentials. Layers 2 & 3 and propose-eval need either an Anthropic API key (ANTHROPIC_API_KEY) or the claude CLI installed and signed in to a Claude Pro/Max subscription — see Authentication.

Installing from source (for contributors)

git clone https://github.com/wjduenow/clauditor.git
cd clauditor
uv sync --dev

uv is the project's package manager; it's a faster drop-in for pip + venv.

One-minute example

Both paths below assume you already have a SKILL.md — clauditor checks skills, it doesn't write them. A skill lives in .claude/skills/<name>/SKILL.md (modern layout) or .claude/commands/<name>.md (older layout); run ls .claude/ from your project root if you're not sure which you have.

I want a minimal eval spec I'll fill in myself. clauditor init reads your SKILL.md and writes a bare-bones eval.json next to it — no LLM call, no tokens spent. You add assertions and grading criteria from there.

clauditor init .claude/skills/my-skill/SKILL.md       # generate a starter eval spec
clauditor validate .claude/skills/my-skill/SKILL.md   # run Layer 1 against the skill

Expected output:

✓ Running /my-skill...
4/4 assertions passed (100%)

I want clauditor to write a richer eval spec for me. clauditor propose-eval sends your SKILL.md (and optionally a captured real-world run) to an LLM and gets back a populated eval.json with assertions, sections, and grading criteria already drafted.

clauditor propose-eval .claude/skills/my-skill/SKILL.md --dry-run  # preview the prompt — no tokens spent
clauditor propose-eval .claude/skills/my-skill/SKILL.md            # LLM writes the eval spec
clauditor validate    .claude/skills/my-skill/SKILL.md             # run Layer 1 against it

--dry-run prints the prompt clauditor would send to the LLM without making any API call — a cost-free preview so you can iterate on inputs before spending tokens.

Swap validate for grade once you've added grading_criteria to the spec to run Layer 3.

Installing the /clauditor slash command

If you use Claude Code interactively, you can type /clauditor <skill> in the prompt instead of running CLI commands — Claude reads the eval spec, runs the checks, and shows you results in-line. This is optional; the CLI works without it.

From your project root, uv run clauditor setup creates a symlink at .claude/skills/clauditor pointing at the bundled Claude Code skill; pip install --upgrade clauditor then picks up skill updates automatically. Restart Claude Code once if .claude/skills/ did not exist before.

Flags and details

--unlink — remove the /clauditor symlink. Refuses symlinks not pointing at the installed clauditor package, so it won't touch user-authored skills.
--force — overwrite an existing file or symlink at .claude/skills/clauditor.
--project-dir PATH — override project-root detection (default walks up for .git/ or .claude/).

Edits under .claude/skills/clauditor/ are hot-reloaded by Claude Code. uv run clauditor doctor reports the symlink's health (absent / installed / stale / wrong-target / unmanaged).

Using /clauditor in Claude Code

Invoke the slash command with a skill path — Claude locates the eval spec, runs L1, and asks before spending tokens on L3; if no sibling <skill>.eval.json exists, Claude offers clauditor propose-eval as an LLM-assisted bootstrap (with a cost-free --dry-run preview) before grading. If L3 reports failing criteria, Claude offers clauditor suggest to propose a unified diff of SKILL.md edits motivated by the specific failing criterion ids.

/clauditor .claude/commands/my-skill.md

Full reference: docs/skill-usage.md.

Quick Start

A new skill goes from "untested" to "covered" in four steps: clauditor capture records a real run, clauditor propose-eval bootstraps a full three-layer spec from the SKILL.md plus that capture, clauditor validate tightens L1 assertions, then the spec wires into pytest for regression coverage.

clauditor capture my-skill -- "initial context"   # save real output → tests/eval/captured/
clauditor propose-eval .claude/commands/my-skill.md  # LLM writes the spec from SKILL.md + capture
clauditor validate .claude/commands/my-skill.md
clauditor validate .claude/commands/my-skill.md --json  # CI mode

Covered in the full reference: the capture command and interactive-skill limitations, propose-eval options, pytest fixtures (clauditor_runner, clauditor_asserter, clauditor_spec). Full reference: docs/quick-start.md.

Three Layers of Validation

L1 catches shape regressions for free, L2 uses Haiku to validate structured fields, L3 uses Sonnet to grade against a rubric — all three drive off the same eval spec:

{"assertions": [...], "sections": [...], "grading_criteria": [...]}

Full reference: docs/layers.md.

LLM-assisted skill improvement (`clauditor suggest`)

When clauditor grade returns failing L3 criteria, clauditor suggest reads that iteration's grading.json, asks Sonnet to propose minimal SKILL.md edits keyed to the failing criterion ids, and writes a unified diff plus a JSON sidecar with motivated_by, anchor, confidence, and per-edit rationale. Every proposal is hard-validated so its anchor appears exactly once in the target SKILL.md before anything lands on disk — no silent drift, no blind patches.

clauditor grade .claude/skills/my-skill/SKILL.md      # produces grading.json
clauditor suggest .claude/skills/my-skill/SKILL.md    # reads latest grading.json
# → unified diff on stdout; sidecar at .clauditor/suggestions/my-skill-<ts>.{diff,json}

Review, git apply (or hand-edit), then re-run clauditor grade to measure the score delta. The sidecar is stable (schema_version: 1) for downstream tooling.

Covered in the full reference: traceability via motivated_by, the anchor-safety contract, sidecar field-by-field reference, --from-iteration, --with-transcripts, --model, --json, and the full worked walkthrough. Full reference: docs/skill-usage.md#proposing-skill-improvements.

CLI Reference

Stable exit-code contract (0 = pass, 1 = skill failed, 2 = input error, 3 = Anthropic error). grade auto-increments iteration slots under .clauditor/iteration-N/<skill>/ and appends metrics to history.jsonl.

clauditor init <skill.md>             # Starter eval.json
clauditor propose-eval <skill.md>     # LLM-assisted EvalSpec bootstrap
clauditor lint <skill.md>             # Static agentskills.io spec conformance
clauditor validate <skill.md>         # Layer 1 assertions
clauditor grade <skill.md>            # Layer 3 quality grading
clauditor compare --skill <s> --from 1 --to 2  # Diff iterations
clauditor trend <skill> --metric total.total   # History + sparkline
clauditor badge <skill.md>            # Shields.io endpoint JSON for README embed

Covered in the full reference: every subcommand flag (--variance, --iteration, --diff, …), exit codes, history.jsonl shape, clauditor trend metric paths. Full reference: docs/cli-reference.md.

Pytest Integration

def test_my_skill(clauditor_runner, clauditor_asserter):
    result = clauditor_runner.run("my-skill", '"San Jose, CA"')
    clauditor_asserter(result).assert_contains("Results")

Full reference: docs/pytest-plugin.md.

Eval Spec Format

An <skill-name>.eval.json lives next to the skill's .md file and drives all three layers. In plain English, it lists: what input to test the skill with, structural rules the output must satisfy (assertions), fields the output should contain (sections + tiers, used by Layer 2), and rubric questions for the LLM judge (grading criteria, used by Layer 3).

{
  "skill_name": "find-kid-activities",
  "test_args": "\"Cupertino, CA\" --ages 4-6",

  "assertions": [
    {"id": "has_venues", "type": "contains", "needle": "Venues"}
  ],

  "sections": [
    {
      "name": "Venues",
      "tiers": [
        {
          "label": "default",
          "min_entries": 3,
          "fields": [
            {"id": "v_name", "name": "name", "required": true}
          ]
        }
      ]
    }
  ],

  "grading_criteria": [
    {"id": "distance_ok", "criterion": "Are all venues within the specified distance?"}
  ]
}

In this example: test_args is the prompt clauditor passes to the skill. The single L1 assertion checks the output literally contains the word "Venues". The sections block tells Layer 2 to find a "Venues" section with at least 3 entries, each with a name field. The grading_criteria block gives Layer 3 a yes/no question to grade the output on.

Optional blocks (input_files, output_files, variance, trigger_tests) add staging, file-based output capture, variance measurement, and trigger precision.

Covered in the full reference: the full eval-spec JSON shape, input_files staging rules, output_file / output_files capture, and the format validation DSL (phone_us, url, domain, … or inline regex). Full reference: docs/eval-spec-reference.md.

Alignment with agentskills.io

clauditor implements (and extends) the workflow at agentskills.io/skill-creation/evaluating-skills:

agentskills.io concept	clauditor
Test case (prompt + expected + files)	`.eval.json` with `test_args`, `input_files`, `sections`, `grading_criteria`
Deterministic assertions	Layer 1 — `assertions.py`, `FORMAT_REGISTRY` (20 types)
LLM-judged structural checks	Layer 2 — `grader.py`, tiered schema extraction
Rubric quality grading	Layer 3 — `quality_grader.py`, per-criterion scoring + variance
Regression + longitudinal history	`clauditor compare`, `.clauditor/history.jsonl`, `clauditor trend --metric <dotted.path>`
Per-iteration workspace	`.clauditor/iteration-N/<skill>/` with sidecars + `run-*/` transcripts

Beyond the spec: trigger precision testing, tiered extraction, pytest plugin, input_files staging, blind A/B judge, baseline pair runs, transcript capture, LLM-driven skill improvement proposer (clauditor suggest), LLM-assisted EvalSpec bootstrap (clauditor propose-eval), Pro/Max subscription-auth option (--no-api-key) for research-heavy skills that exceed the API-tier rate limit, static spec-conformance check (clauditor lint). Out of scope: human-in-the-loop feedback capture.

Note: --no-api-key only affects the subprocess; the six LLM-mediated commands (grade, propose-eval, suggest, triggers, extract, compare --blind) route their own Anthropic call through a pluggable transport that accepts either ANTHROPIC_API_KEY or a claude CLI subscription by default. See Authentication and API Keys.

Skill compatibility

clauditor works for most skills out of the box. A few patterns need a workaround or aren't supported yet:

Skills with parallel sub-tasks (the Task(run_in_background=true) pattern): pass --sync-tasks to force them to run sequentially. Output capture works correctly, but the async behavior itself (race conditions, late-arriving results) is not tested — you're evaluating a slightly different execution model than what ships.
Skills that ask the user mid-run (e.g. AskUserQuestion to clarify intent): not supported directly — clauditor runs skills non-interactively, so the question never gets an answer and the run hangs. The fix is usually to take all parameters in the initial prompt; see the worked before/after example and the not_contains AskUserQuestion regression assertion in docs/skill-usage.md#recipe-skills-that-ask-the-user-mid-run, with examples/.claude/skills/find-kid-activities/SKILL.eval.json as the canonical anchor.
Skills whose correctness depends on async timing: cannot be tested accurately yet. Blocked on an upstream Claude Code feature.

Technical detail and upstream tracking

clauditor invokes skills through claude -p (non-interactive print mode), which is a strict subset of the interactive Claude Code runtime. Works: sequential Task calls, parallel tool calls in the parent turn, every standard tool (WebSearch, WebFetch, Bash, Read, Write, Edit). Works with --sync-tasks: skills using Task(run_in_background=true) — the flag sets CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 in the subprocess env, resolving the #97 output-truncation case. Loud failure today: skills whose correctness depends on true async semantics — blocked on upstream Claude Code gaining headless background-task polling, tracked in anthropics/claude-code#52917 and catalogued in docs/adr/transport-research-103.md.

Full matrix and refactoring recipes: docs/skill-usage.md#skill-compatibility.

Authentication and API Keys

Do I need an Anthropic API key?

Just running Layer 1 checks (validate, lint, init, capture) → no key needed. These are deterministic and never call an LLM.
You have a Claude Pro/Max subscription and the claude CLI installed → no API key needed. clauditor's default auto transport detects the CLI on PATH and uses your subscription auth for grading. Pass --transport cli to be explicit.
Otherwise → set ANTHROPIC_API_KEY (get one at console.anthropic.com) before running grade, extract, propose-eval, suggest, triggers, or compare --blind.

The six LLM-mediated commands above route their Anthropic call through a pluggable transport — either the HTTP SDK (--transport api) or a subprocess to the local claude CLI (--transport cli). The default auto setting picks CLI when available, else API. Full reference: docs/transport-architecture.md.

clauditor also supports multi-provider grading: pass --grading-provider {anthropic,openai,auto} (or set CLAUDITOR_GRADING_PROVIDER / EvalSpec.grading_provider) to route the LLM-grader call through the OpenAI SDK with OPENAI_API_KEY instead. Under the default auto, clauditor infers the provider from the grading_model prefix (claude-* → anthropic, gpt-* / o[0-9]+* → openai). When grading_model is unset (the default), auto falls back to anthropic — the subscription-first historical default — so a spec with no provider/model fields grades against Claude Sonnet exactly as it did before multi-provider support landed.

Running clauditor grade <skill> --transport cli is the one-liner for subscription auth end-to-end: it implicitly strips ANTHROPIC_API_KEY / ANTHROPIC_AUTH_TOKEN from the skill subprocess env, so both the grader and the skill use subscription auth. Pass --transport api to keep the keys.

Running skills under a non-Claude harness. Clauditor can also drive your skill through the OpenAI Codex CLI instead of Claude Code: pass --harness codex (or set CLAUDITOR_HARNESS / EvalSpec.harness) to validate, grade, capture, or run. Codex needs CODEX_API_KEY or OPENAI_API_KEY in the subprocess env; the default auto harness picks Claude when claude is on PATH, else Codex. The harness axis is independent of the grader provider — an eval can run under Codex while the L3 grader still calls Claude (or vice versa). Full reference: docs/codex-harness.md.

Cost and observability

Every clauditor grade iteration writes a context.json sidecar capturing the harness, provider, model, sandbox mode, reasoning-token count, and an estimated cost_usd for the grader calls. clauditor trend and clauditor audit aggregate across iterations and refuse to silently average across mismatched harness or provider axes — pass --cross-harness / --cross-provider to opt in. Full reference: docs/cost-tracking.md and docs/audit-trend-workflow.md.

Reference docs

docs/architecture.md — how clauditor works under the hood (mermaid diagrams of the grade flow)
docs/quick-start.md — tutorial walkthrough from init → validate → pytest
docs/layers.md — the three-layer framework in depth
docs/cli-reference.md — full subcommand + flag + exit-code reference
docs/eval-spec-reference.md — complete .eval.json schema
docs/pytest-plugin.md — pytest fixtures and options
docs/skill-usage.md — using /clauditor in Claude Code
docs/skills.md — catalog of skills shipped with this repo, with live badge status
docs/badges.md — shields.io badges from iteration sidecars (clauditor badge)
docs/stream-json-schema.md — claude stream-json parser contract
docs/codex-stream-schema.md — codex NDJSON parser contract (sibling of stream-json-schema)
docs/transport-architecture.md — CLI vs SDK transport, auth-state matrix, precedence, migration
docs/codex-harness.md — running skills under the OpenAI Codex CLI
docs/cost-tracking.md — cost_usd, reasoning tokens, the pricing table, and per-iteration context.json
docs/audit-trend-workflow.md — measuring regressions over iterations with audit / trend / compare / badge
CONTRIBUTING.md — maintainer pre-release dogfood gate + contribution workflow

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

wjduenow

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

May 26, 2026

This version

0.1.2

May 22, 2026

0.1.1

Apr 26, 2026

0.1.0

Apr 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clauditor_eval-0.1.2.tar.gz (804.1 kB view details)

Uploaded May 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

clauditor_eval-0.1.2-py3-none-any.whl (368.6 kB view details)

Uploaded May 22, 2026 Python 3

File details

Details for the file clauditor_eval-0.1.2.tar.gz.

File metadata

Download URL: clauditor_eval-0.1.2.tar.gz
Upload date: May 22, 2026
Size: 804.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clauditor_eval-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`3d4a4f207cdea6c3e5e27d17e6098b728263d164feb22c06d7e29aec7c9660b6`
MD5	`fb4b8e812c98fefaa61a3309d6355223`
BLAKE2b-256	`ffcc28671076a99805cd95f620d1014d84e1061629b9aead51f1fa4314387b98`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clauditor_eval-0.1.2.tar.gz:

Publisher: publish.yml on wjduenow/clauditor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clauditor_eval-0.1.2.tar.gz
- Subject digest: 3d4a4f207cdea6c3e5e27d17e6098b728263d164feb22c06d7e29aec7c9660b6
- Sigstore transparency entry: 1598045633
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: wjduenow/clauditor@5f225e44f03e411054dfe11eca7a70862a6bd1e0
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/wjduenow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5f225e44f03e411054dfe11eca7a70862a6bd1e0
- Trigger Event: release

File details

Details for the file clauditor_eval-0.1.2-py3-none-any.whl.

File metadata

Download URL: clauditor_eval-0.1.2-py3-none-any.whl
Upload date: May 22, 2026
Size: 368.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for clauditor_eval-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`05ee3884b92477cb30c7c0713f9b5ae145a3905b74e7b4e5be563957db395898`
MD5	`3ca335b67b49879a43f65684d243f0b1`
BLAKE2b-256	`ebd41088b73c1a698062a61cd6747012ca57bc95d422492dcb8e78acfea08712`

See more details on using hashes here.

Provenance

The following attestation bundles were made for clauditor_eval-0.1.2-py3-none-any.whl:

Publisher: publish.yml on wjduenow/clauditor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: clauditor_eval-0.1.2-py3-none-any.whl
- Subject digest: 05ee3884b92477cb30c7c0713f9b5ae145a3905b74e7b4e5be563957db395898
- Sigstore transparency entry: 1598045780
- Sigstore integration time: May 22, 2026
Source repository:
- Permalink: wjduenow/clauditor@5f225e44f03e411054dfe11eca7a70862a6bd1e0
- Branch / Tag: refs/tags/v0.1.2
- Owner: https://github.com/wjduenow
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@5f225e44f03e411054dfe11eca7a70862a6bd1e0
- Trigger Event: release

clauditor-eval 0.1.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

clauditor

Why clauditor?

Install

One-minute example

Installing the /clauditor slash command

Using /clauditor in Claude Code

Quick Start

Three Layers of Validation

LLM-assisted skill improvement (clauditor suggest)

CLI Reference

Pytest Integration

Eval Spec Format

Skill compatibility

Authentication and API Keys

Cost and observability

Reference docs

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

LLM-assisted skill improvement (`clauditor suggest`)