Skip to main content

Deterministic quality scorer for AI agent instruction files. Multi-format (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md), 8-dimension scoring with security, anti-gaming detection, zero dependencies.

Project description

Schliff

Your AI instructions are silently degrading. Schliff catches it.

Deterministic quality scoring for CLAUDE.md, SKILL.md, .cursorrules, AGENTS.md, and system prompts. No LLM, no API key — same input, same score. Python 3.9+, zero core dependencies (optional schliff[evolve] adds litellm for the evolution loop).

PyPI Python Downloads Tests License: MIT

# In Claude Code — installs the skill + slash commands
/plugin marketplace add Zandereins/schliff
/plugin install schliff@schliff

Quick alternative — just the skill, via the open skills registry:

npx skills add Zandereins/schliff

Or the Python CLI, for CI / pre-commit:

pip install schliff

Then run schliff demo for an instant self-contained example (works on any install), or schliff score path/to/SKILL.md on your own file:

$ schliff demo
schliff v7.2.0

  structure      ████████░░   78/100  good
  efficiency     ████░░░░░░   35/100  poor
  composability  ██░░░░░░░░   20/100  poor
  clarity        ██████████  100/100  perfect

  Structural Score  █████░░░░░░░░░░░░░░░  23.4/100  [F]
  ℹ Scored 4/7 dimensions — the score can't exceed 42% until the rest are measured. Run /schliff:init to add an eval suite and score: triggers, quality, edges.
  → 7 deterministic fixes available. Run `/schliff:auto` in Claude Code to apply.

  Tokens: 100 / 1,000 (ok)

  This is a deliberately bad skill. Try schliff on your own skills!

A real optimization

@wan-huiyan ran schliff on the 1,331-line SKILL.md for agent-review-panel, a multi-agent code-review skill. Two optimization rounds later: 340 lines, 75% fewer tokens, structure 65 → 100, composability 56 → 91. A/B tested on a 1,132-line document — identical review quality with a quarter of the tokens.

Skill Score Rounds Author
agent-review-panel 75 [B] → 85.6 [A] 2 @wan-huiyan
shieldclaw (OpenClaw) 68.3 [C] → 94.6 [A] 1 @Zandereins

Score yours: schliff score path/to/SKILL.mdshare what you find


Seen in the wild

A root CLAUDE.md written for modelcontextprotocol/servers (Anthropic's official MCP reference repo) merged to main on April 17th, 2026. Running schliff on it returned 59.2/100 at 40% weight coverage (pre-v8 scale; full-denominator scoring would place this lower) — a useful measurement of where the file actually needed work and where the scorer was structurally unfair for a project-root document. Full walkthrough →


What the data says

We scored 120 public instruction files across 60 source repos. Mean grade: D. 59% below C. Adding one companion eval suite lifts the mean +22 points.

  • Composability is the real weak spot — mean 30.4/100. Files tell agents what to do, rarely where to stop or hand off
  • No companion eval suite in the corpus — verified 0/60 source repos ship an eval-suite.json, evals/, or any test artifact. Three dimensions stay unmeasured, locking 45% of the score
  • Hedging dilutes intent — efficiency averages 52.8/100. "You might want to consider" is noise
  • Format alone doesn't save you — AGENTS.md averages 64.8, SKILL.md 55.4. Skipping frontmatter costs ~15 points regardless of format

Read the full report → · Reproduce it


What Schliff Catches

Dimension Weight What it catches
structure 15% Missing frontmatter, empty headers, no examples, dead content
triggers 20% Eval-suite trigger accuracy, false positives, missed activations
quality 20% Thin assertions, missing feature coverage, low coherence
edges 15% No edge cases defined, missing categories (invalid, scale, unicode)
efficiency 10% Hedging, filler words, repetition, low signal-to-noise
composability 10% Missing scope boundaries, no error behavior, no handoff points
clarity 5% Contradictions, vague references, ambiguous instructions
security 5% (opt-in) Hardcoded secrets, unsafe commands, exposed credentials

Grades: S (≥95) · A (≥85) · B (≥75) · C (≥65) · D (≥50) · E (≥35) · F (<35). Full methodology: docs/SCORING.md


Quick Start

schliff score path/to/SKILL.md          # score any instruction file
schliff score --url https://github.com/user/repo/blob/main/SKILL.md
schliff suggest path/to/SKILL.md         # ranked fixes with impact estimates
schliff compare skill-v1.md skill-v2.md  # side-by-side comparison
schliff doctor                           # scan all installed skills

schliff suggest returns the top fixes with their estimated point impact:

$ schliff suggest demo/bad-skill/SKILL.md
TOP FIXES (estimated impact):
 1. [ ~25] Create eval-suite.json with trigger test cases (should_trigger: true/false prompts)
 2. [  ~8] Create eval-suite.json with 3+ test cases, each with typed assertions (contains, pattern, excludes, format)
 3. [  +2] Add 'description: <what this skill does and when to use it>' to frontmatter
 4. [  +2] Add handoff points: 'Then use X skill for...', 'If Y, instead use Z skill'
 5. [  +2] Add 'Use this skill when...' (positive scope) AND 'Do NOT use for...' (negative scope) sections

Current: 22.6 [F]  →  Estimated after fixes: ~60.6 [D]

Evolution Engine

Close the loop: score → patch → re-score → stop at plateau. One command.

pip install schliff[evolve]
schliff evolve path/to/SKILL.md
  structure         70 → 100     Frontmatter, examples, concrete commands
  triggers           0 → 100     Description keywords, negative boundaries
  quality            0 →  95     Eval suite generated, assertions added
  edges              0 → 100     Edge cases synthesized
  efficiency        35 →  93     Hedging removed, information density up
  composability     30 →  90     Scope boundaries, error behavior, deps
  clarity           90 → 100     Vague references resolved

  Composite         54 [D] → 98 [S]    18 iterations, 12 kept / 6 reverted

The engine applies deterministic patches first (free, no LLM), then uses an LLM for what rules can't fix — structural reorganization, example generation, edge case synthesis. Only improvements that pass all dimension guards are kept; rejects are reverted automatically.


CI Integration

schliff verify path/to/SKILL.md --min-score 75 --regression
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/Zandereins/schliff
    rev: v7.2.0
    hooks:
      - id: schliff-verify
        args: ['--min-score', '75']

--regression fails the check if the score dropped versus the last successful run (history file in .schliff/history.jsonl). Pair with GitHub Actions to gate PRs on instruction-file quality the same way you gate on test coverage.


Anti-Gaming

Schliff detects score inflation. The benchmark suite catches 6 gaming patterns:

Gaming attempt How caught
Empty headers Content check — empty sections penalized
Keyword stuffing Dedup + frequency cap
Copy-paste examples Repeated-line detection (94 → 43)
Contradictions "always X" vs "never X" finder
Bloated preamble Signal-to-noise via sqrt density curve
Missing scope 10 composability sub-checks

How it compares to other AI-instruction linters
agnix AgentLinter Schliff
Approach 399 rules across 9 tools 8-dim scoring + secret scan 7-dim 0–100 composite + evolution loop
Stack Rust (npm/cargo/brew) Node.js (npx) Python 3.9+ stdlib
Core dependencies Rust toolchain npm/node None (core)
Output Pass/fail rule violations Score + diagnostics Composite grade + ranked fixes + auto-improve
Evolution loop schliff evolve (54→98 in 18 iter)
Anti-gaming detection ✅ 6 detectors
CI gate (regression) via action via CLI exit --min-score + --regression

agnix is great if you want immediate rule-based validation with zero scoring nuance. AgentLinter adds scoring but no evolution. Schliff is the only tool that gives you a deterministic composite you can gate PRs on, plus an evolution engine that closes the loop.

How it differs from autoresearch

Inspired by Karpathy's autoresearch — but Schliff is a linter, not a research loop. schliff score runs in CI without touching the improvement loop.

autoresearch Schliff
Target ML training scripts AI instruction files
Patches 100% LLM ~32% deterministic, rest LLM
Scoring 1 metric 7 dimensions + optional runtime
Anti-gaming None 6 detection vectors
Dependencies ML frameworks Python 3.9+ stdlib only (core)
All commands
Command Purpose
schliff demo See schliff in action instantly
schliff score <path> Score any instruction file
schliff score --url <url> Score a remote file (HTTPS-only)
schliff score --tokens Section-by-section token breakdown
schliff suggest <path> Ranked fixes with estimated impact
schliff compare <a> <b> Side-by-side comparison with deltas
schliff diff <path> Score delta vs. previous commit
schliff verify <path> CI gate — exit 0/1, --min-score, --regression
schliff doctor Scan all installed skills
schliff badge <path> Generate markdown badge
schliff report <path> Markdown quality report
schliff evolve <path> Autonomous improvement loop

Claude Code skills (require bash install.sh):

Command Purpose
/schliff:auto Autonomous improvement with EMA-based stopping
/schliff:init <path> Bootstrap eval suite + baseline
/schliff:analyze One-shot gap analysis
/schliff:mesh Detect trigger conflicts across skills
/schliff:report Generate shareable report with badge
Architecture
flowchart TB
    subgraph Scoring ["Scoring Engine (deterministic, no LLM)"]
        SKILL[SKILL.md + eval-suite.json] --> PARSE[Parse & Extract]
        PARSE --> S1[Structure]
        PARSE --> S2[Triggers]
        PARSE --> S3[Quality]
        PARSE --> S4[Edges]
        PARSE --> S5[Efficiency]
        PARSE --> S6[Composability]
        PARSE --> S7[Clarity]
        PARSE --> S8[Security]
        S1 & S2 & S3 & S4 & S5 & S6 & S7 & S8 --> COMPOSITE[Weighted Composite + Grade]
    end

    subgraph Loop ["Evolution Engine (optional, LLM-powered)"]
        COMPOSITE --> GRADIENT[Identify Weakest Dimension]
        GRADIENT --> PATCH[Generate Patch]
        PATCH --> APPLY[Apply + Re-score]
        APPLY -->|improved| KEEP[Keep]
        APPLY -->|worse| REVERT[Revert]
        KEEP & REVERT --> PLATEAU{Plateau?}
        PLATEAU -->|no| GRADIENT
        PLATEAU -->|yes| DONE[Done]
    end

~32% of patches are applied deterministically (confidence=high, single-edit effort); the rest fall back to the LLM. The LLM handles structural reorganization, example generation, edge case synthesis. (source: skills/schliff/scripts/measure_patch_ratio.py)


When Schliff isn't the right tool

  • LLM-based semantic understanding — Schliff is pattern-based. If you need a model to reason about whether two paragraphs are semantically contradictory (vs. structurally), a scorer like AgentLinter will catch cases schliff misses
  • Creative-writing prompts — the weights and dimensions are opinionated for coding-agent instructions. Applied to persona prompts or creative-writing system prompts, the composite number will be misleading
  • You don't want to write an eval suite — 45% of the score stays unmeasured without one. The remaining 55% (structure, efficiency, composability, clarity) still gives useful signal, but the A/S grades are out of reach

Limitations

The structural score measures file organization, not runtime effectiveness. A skill scoring 95/100 can still produce wrong output — use --runtime scoring for that.

The trigger scorer uses TF-IDF heuristics. Skills with generic domain vocabulary may hit a precision ceiling around 75-80.


Badge

schliff badge path/to/SKILL.md

Schliff: 99 [S]

Contributing

Found a scoring bug? Add a test case and open an issue. Want to improve scoring logic? Edit scoring/*.py, run the tests, PR the diff.

License

MIT


schliff (German) — the finishing cut. "Den letzten Schliff geben" = to give something its final polish.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schliff-8.0.0.tar.gz (217.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schliff-8.0.0-py3-none-any.whl (253.9 kB view details)

Uploaded Python 3

File details

Details for the file schliff-8.0.0.tar.gz.

File metadata

  • Download URL: schliff-8.0.0.tar.gz
  • Upload date:
  • Size: 217.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for schliff-8.0.0.tar.gz
Algorithm Hash digest
SHA256 91a612a7e0ee4ac24f8a745839bed69547dd15c36ba61b96270ddc496e02fb65
MD5 c6402f7393d86d81984c887930cce079
BLAKE2b-256 90c3f4fb949a85cf165725d76e2cc018e20d41a382cad9b4339498520c7a2847

See more details on using hashes here.

Provenance

The following attestation bundles were made for schliff-8.0.0.tar.gz:

Publisher: publish.yml on Zandereins/schliff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file schliff-8.0.0-py3-none-any.whl.

File metadata

  • Download URL: schliff-8.0.0-py3-none-any.whl
  • Upload date:
  • Size: 253.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for schliff-8.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0358704de305d3dd0df3dc2e7f23b3f19cdbef2c8d9cc4f23e9b72ae0835ee3
MD5 53ab3e1e566b742d977c5e6e821a583d
BLAKE2b-256 16acd8bf5f1fa76e4b588fd5cd253132cabe46593780c64996b5636bd7140279

See more details on using hashes here.

Provenance

The following attestation bundles were made for schliff-8.0.0-py3-none-any.whl:

Publisher: publish.yml on Zandereins/schliff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page