Deterministic quality scorer for AI agent instruction files. Multi-format (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md), 8-dimension scoring with security, anti-gaming detection, zero dependencies.
Project description
Schliff
Your AI instructions are silently degrading. Schliff catches it.
Deterministic quality scoring for CLAUDE.md, SKILL.md, .cursorrules, AGENTS.md, and system prompts. No LLM, no API key — same input, same score. Python 3.9+, zero core dependencies (optional schliff[evolve] adds litellm for the evolution loop).
pip install schliff
schliff score path/to/SKILL.md
$ schliff score demo/bad-skill/SKILL.md
schliff v7.1.1
structure ███████░░░ 70/100 fair
efficiency ████░░░░░░ 35/100 poor
composability ███░░░░░░░ 30/100 poor
clarity █████████░ 90/100 great
Structural Score ███████████░░░░░░░░░ 53.8/100 [D]
⚠ 4/8 dimensions measured (weight coverage: 40%). Unmeasured: triggers, quality, edges
→ 13 deterministic fixes available. Run `/schliff:auto` in Claude Code to apply.
Tokens: 378 / 1,000 (ok)
A real optimization
@wan-huiyan ran schliff on the 1,331-line SKILL.md for agent-review-panel, a multi-agent code-review skill. Two optimization rounds later: 340 lines, 75% fewer tokens, structure 65 → 100, composability 56 → 91. A/B tested on a 1,132-line document — identical review quality with a quarter of the tokens.
| Skill | Score | Rounds | Author |
|---|---|---|---|
| agent-review-panel | 75 [C] → 85.6 [A] | 2 | @wan-huiyan |
| shieldclaw (OpenClaw) | 68 [C] → 94.6 [A] | 1 | @Zandereins |
| demo bad-skill | 54 [D] → 98.3 [S] | 18 auto | @Zandereins |
Score yours: schliff score path/to/SKILL.md — share what you find
Seen in the wild
A root CLAUDE.md written for modelcontextprotocol/servers (Anthropic's official MCP reference repo) merged to main on April 17th, 2026. Running schliff on it returned 59.2/100 at 40% weight coverage — a useful measurement of where the file actually needed work and where the scorer was structurally unfair for a project-root document. Full walkthrough →
What the data says
We scored 120 public instruction files across 60 source repos. Mean grade: D. 59% below C. Adding one companion eval suite lifts the mean +22 points.
- Composability is the real weak spot — mean 30.4/100. Files tell agents what to do, rarely where to stop or hand off
- No companion eval suite in the corpus — verified 0/60 source repos ship an
eval-suite.json,evals/, or any test artifact. Three dimensions stay unmeasured, locking 45% of the score - Hedging dilutes intent — efficiency averages 52.8/100. "You might want to consider" is noise
- Format alone doesn't save you — AGENTS.md averages 64.8, SKILL.md 55.4. Skipping frontmatter costs ~15 points regardless of format
Read the full report → · Reproduce it
What Schliff Catches
| Dimension | Weight | What it catches |
|---|---|---|
| structure | 15% | Missing frontmatter, empty headers, no examples, dead content |
| triggers | 20% | Eval-suite trigger accuracy, false positives, missed activations |
| quality | 20% | Thin assertions, missing feature coverage, low coherence |
| edges | 15% | No edge cases defined, missing categories (invalid, scale, unicode) |
| efficiency | 10% | Hedging, filler words, repetition, low signal-to-noise |
| composability | 10% | Missing scope boundaries, no error behavior, no handoff points |
| clarity | 5% | Contradictions, vague references, ambiguous instructions |
| security | 5% | (opt-in) Hardcoded secrets, unsafe commands, exposed credentials |
Grades: S (≥95) · A (≥85) · B (≥75) · C (≥65) · D (≥50) · E (≥35) · F (<35). Full methodology: docs/SCORING.md
Quick Start
schliff score path/to/SKILL.md # score any instruction file
schliff score --url https://github.com/user/repo/blob/main/SKILL.md
schliff suggest path/to/SKILL.md # ranked fixes with impact estimates
schliff compare skill-v1.md skill-v2.md # side-by-side comparison
schliff doctor # scan all installed skills
schliff suggest returns the top fixes with their estimated point impact:
$ schliff suggest demo/bad-skill/SKILL.md
TOP FIXES (estimated impact):
1. [ ~25] Create eval-suite.json with trigger test cases (should_trigger: true/false prompts)
2. [ ~8] Create eval-suite.json with 3+ test cases, each with typed assertions
3. [ +2] Add 'description: <what this skill does and when to use it>' to frontmatter
4. [ +2] Add handoff points: 'Then use X skill for...', 'If Y, instead use Z skill'
5. [ +2] Add 'Use this skill when...' + 'Do NOT use for...' scope sections
Current: 53.8 [D] → Estimated after fixes: ~91.8 [A]
Evolution Engine
Close the loop: score → patch → re-score → stop at plateau. One command.
pip install schliff[evolve]
schliff evolve path/to/SKILL.md
structure 70 → 100 Frontmatter, examples, concrete commands
triggers 0 → 100 Description keywords, negative boundaries
quality 0 → 95 Eval suite generated, assertions added
edges 0 → 100 Edge cases synthesized
efficiency 35 → 93 Hedging removed, information density up
composability 30 → 90 Scope boundaries, error behavior, deps
clarity 90 → 100 Vague references resolved
Composite 54 [D] → 98 [S] 18 iterations, 12 kept / 6 reverted
The engine applies deterministic patches first (free, no LLM), then uses an LLM for what rules can't fix — structural reorganization, example generation, edge case synthesis. Only improvements that pass all dimension guards are kept; rejects are reverted automatically.
CI Integration
schliff verify path/to/SKILL.md --min-score 75 --regression
# .pre-commit-config.yaml
repos:
- repo: https://github.com/Zandereins/schliff
rev: v7.1.1
hooks:
- id: schliff-verify
args: ['--min-score', '75']
--regression fails the check if the score dropped versus the last successful run (history file in .schliff/history.jsonl). Pair with GitHub Actions to gate PRs on instruction-file quality the same way you gate on test coverage.
Anti-Gaming
Schliff detects score inflation. The benchmark suite catches 6 gaming patterns:
| Gaming attempt | How caught |
|---|---|
| Empty headers | Content check — empty sections penalized |
| Keyword stuffing | Dedup + frequency cap |
| Copy-paste examples | Repeated-line detection (94 → 43) |
| Contradictions | "always X" vs "never X" finder |
| Bloated preamble | Signal-to-noise via sqrt density curve |
| Missing scope | 10 composability sub-checks |
How it compares to other AI-instruction linters
| agnix | AgentLinter | Schliff | |
|---|---|---|---|
| Approach | 399 rules across 9 tools | 8-dim scoring + secret scan | 7-dim 0–100 composite + evolution loop |
| Stack | Rust (npm/cargo/brew) | Node.js (npx) | Python 3.9+ stdlib |
| Core dependencies | Rust toolchain | npm/node | None (core) |
| Output | Pass/fail rule violations | Score + diagnostics | Composite grade + ranked fixes + auto-improve |
| Evolution loop | — | — | ✅ schliff evolve (54→98 in 18 iter) |
| Anti-gaming detection | — | — | ✅ 6 detectors |
| CI gate (regression) | via action | via CLI exit | ✅ --min-score + --regression |
agnix is great if you want immediate rule-based validation with zero scoring nuance. AgentLinter adds scoring but no evolution. Schliff is the only tool that gives you a deterministic composite you can gate PRs on, plus an evolution engine that closes the loop.
How it differs from autoresearch
Inspired by Karpathy's autoresearch — but Schliff is a linter, not a research loop. schliff score runs in CI without touching the improvement loop.
| autoresearch | Schliff | |
|---|---|---|
| Target | ML training scripts | AI instruction files |
| Patches | 100% LLM | 60-70% deterministic, 30-40% LLM |
| Scoring | 1 metric | 7 dimensions + optional runtime |
| Anti-gaming | None | 6 detection vectors |
| Dependencies | ML frameworks | Python 3.9+ stdlib only (core) |
All commands
| Command | Purpose |
|---|---|
schliff demo |
See schliff in action instantly |
schliff score <path> |
Score any instruction file |
schliff score --url <url> |
Score a remote file (HTTPS-only) |
schliff score --tokens |
Section-by-section token breakdown |
schliff suggest <path> |
Ranked fixes with estimated impact |
schliff compare <a> <b> |
Side-by-side comparison with deltas |
schliff diff <path> |
Score delta vs. previous commit |
schliff verify <path> |
CI gate — exit 0/1, --min-score, --regression |
schliff doctor |
Scan all installed skills |
schliff badge <path> |
Generate markdown badge |
schliff report <path> |
Markdown quality report |
schliff evolve <path> |
Autonomous improvement loop |
Claude Code skills (require bash install.sh):
| Command | Purpose |
|---|---|
/schliff:auto |
Autonomous improvement with EMA-based stopping |
/schliff:init <path> |
Bootstrap eval suite + baseline |
/schliff:analyze |
One-shot gap analysis |
/schliff:mesh |
Detect trigger conflicts across skills |
/schliff:report |
Generate shareable report with badge |
Architecture
flowchart TB
subgraph Scoring ["Scoring Engine (deterministic, no LLM)"]
SKILL[SKILL.md + eval-suite.json] --> PARSE[Parse & Extract]
PARSE --> S1[Structure]
PARSE --> S2[Triggers]
PARSE --> S3[Quality]
PARSE --> S4[Edges]
PARSE --> S5[Efficiency]
PARSE --> S6[Composability]
PARSE --> S7[Clarity]
PARSE --> S8[Security]
S1 & S2 & S3 & S4 & S5 & S6 & S7 & S8 --> COMPOSITE[Weighted Composite + Grade]
end
subgraph Loop ["Evolution Engine (optional, LLM-powered)"]
COMPOSITE --> GRADIENT[Identify Weakest Dimension]
GRADIENT --> PATCH[Generate Patch]
PATCH --> APPLY[Apply + Re-score]
APPLY -->|improved| KEEP[Keep]
APPLY -->|worse| REVERT[Revert]
KEEP & REVERT --> PLATEAU{Plateau?}
PLATEAU -->|no| GRADIENT
PLATEAU -->|yes| DONE[Done]
end
60-70% of patches follow deterministic rules. The LLM handles structural reorganization, example generation, edge case synthesis.
When Schliff isn't the right tool
- LLM-based semantic understanding — Schliff is pattern-based. If you need a model to reason about whether two paragraphs are semantically contradictory (vs. structurally), a scorer like AgentLinter will catch cases schliff misses
- Creative-writing prompts — the weights and dimensions are opinionated for coding-agent instructions. Applied to persona prompts or creative-writing system prompts, the composite number will be misleading
- You don't want to write an eval suite — 45% of the score stays unmeasured without one. The remaining 55% (structure, efficiency, composability, clarity) still gives useful signal, but the A/S grades are out of reach
Limitations
The structural score measures file organization, not runtime effectiveness. A skill scoring 95/100 can still produce wrong output — use --runtime scoring for that.
The trigger scorer uses TF-IDF heuristics. Skills with generic domain vocabulary may hit a precision ceiling around 75-80.
Badge
schliff badge path/to/SKILL.md
Contributing
Found a scoring bug? Add a test case and open an issue. Want to improve scoring logic? Edit scoring/*.py, run the tests, PR the diff.
License
MIT
schliff (German) — the finishing cut. "Den letzten Schliff geben" = to give something its final polish.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schliff-7.1.1.tar.gz.
File metadata
- Download URL: schliff-7.1.1.tar.gz
- Upload date:
- Size: 319.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ca999b6c4a8c20c604c525e73e34285eea41ef957f48ec186058b59edb69fe7
|
|
| MD5 |
41c0601dd568df8d8550d04373522377
|
|
| BLAKE2b-256 |
4aebbe4751fd46a40e202262fb9c16f37d861fb6bc0a872f1ee34acab91b7a60
|
Provenance
The following attestation bundles were made for schliff-7.1.1.tar.gz:
Publisher:
publish.yml on Zandereins/schliff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schliff-7.1.1.tar.gz -
Subject digest:
4ca999b6c4a8c20c604c525e73e34285eea41ef957f48ec186058b59edb69fe7 - Sigstore transparency entry: 1339440420
- Sigstore integration time:
-
Permalink:
Zandereins/schliff@79cf468f5b29c49cdd653e5c92dcc28a266d0f61 -
Branch / Tag:
refs/tags/v7.1.1 - Owner: https://github.com/Zandereins
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@79cf468f5b29c49cdd653e5c92dcc28a266d0f61 -
Trigger Event:
release
-
Statement type:
File details
Details for the file schliff-7.1.1-py3-none-any.whl.
File metadata
- Download URL: schliff-7.1.1-py3-none-any.whl
- Upload date:
- Size: 376.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4ab49324e2f96884ef86f99d2d1f7e34aa5733ab47ab13100b2af1c2ab7df91
|
|
| MD5 |
780813e54d717590b47eba266baebde0
|
|
| BLAKE2b-256 |
65a37f45b00fd6be64a4858b07c44e8d7ec50b0ddbbaf040b1c765e5e6836756
|
Provenance
The following attestation bundles were made for schliff-7.1.1-py3-none-any.whl:
Publisher:
publish.yml on Zandereins/schliff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schliff-7.1.1-py3-none-any.whl -
Subject digest:
a4ab49324e2f96884ef86f99d2d1f7e34aa5733ab47ab13100b2af1c2ab7df91 - Sigstore transparency entry: 1339440462
- Sigstore integration time:
-
Permalink:
Zandereins/schliff@79cf468f5b29c49cdd653e5c92dcc28a266d0f61 -
Branch / Tag:
refs/tags/v7.1.1 - Owner: https://github.com/Zandereins
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@79cf468f5b29c49cdd653e5c92dcc28a266d0f61 -
Trigger Event:
release
-
Statement type: