Autonomous skill improvement and measurement framework for Claude Code
Project description
Schliff
The finishing cut for Claude Code skills.
Baseline: █████░░░░░░░░░░░░░░░ 54.0/100 [D]
After 18x: ████████████████████ 98.3/100 [S]
What changed:
Structure 70 → 100 Added description, examples, concrete commands
Efficiency 35 → 93 Removed hedging language, improved density
Composability 30 → 90 Added scope, error behavior, dependencies
Clarity 90 → 100 Resolved vague references
You wrote a skill. It worked. Three weeks later, triggers misfire, edge cases slip through, instructions contradict themselves. Schliff measures the damage (deterministic scoring, no LLM needed) and fixes it autonomously (Claude Code applies patches, measures delta, reverts regressions).
Try It — Demo in 3 minutes
Note: Schliff commands (
/schliff:*) run inside Claude Code, not in a regular terminal. Claude's intelligence decides which patches to apply — the scorer is deterministic, the improvement loop uses the LLM.
# 1. Install once (terminal, ~1 min)
git clone https://github.com/Zandereins/schliff.git && bash schliff/install.sh
# 2. Score the included demo skill (Claude Code, ~10 sec)
/schliff:init demo/bad-skill/SKILL.md
# 3. Watch it improve the demo skill (Claude Code, ~2 min)
/schliff:auto
What you'll see on the demo skill: 18 autonomous iterations. Each one: patch → measure → keep or revert. Score climbs from 54 [D] to 98 [S]. Stops when ROI plateaus. Real-world skills take longer and may not reach [S] — complex skills plateau around [A] to [S] depending on their eval suite coverage.
Prerequisites: Python 3.9+, Bash, Git, jq
Already have skills? Run /schliff:doctor to scan all installed skills and show health grades + token costs.
What Schliff Fixes
Real improvements from the included demo skill:
| Problem | What Schliff does | Result |
|---|---|---|
| Triggers misfire | Keyword matching + negative boundaries | 0% → 89% accuracy |
| Missing structure | Added examples, edge cases, frontmatter | 75 → 100/100 |
| Vague instructions | Replaced hedging with concrete commands | 35 → 93/100 |
| No scope boundaries | Added handoff declarations + "do NOT use" | 40 → 100/100 |
Automated. No human intervention. Stops when ROI plateaus.
This Is For You If
- Skill Creator — Run
/schliff:initon your v1 skill to get a baseline + eval suite - Skill Maintainer — Run
/schliff:autoto grind any skill from [C] to [S] overnight - Fleet Manager (10+ skills) — Run
/schliff:doctorto scan everything, detect conflicts + token costs - Quality Gate — Run
/schliff:evalbefore shipping, or use the GitHub Action in CI
Why It Works
Autonomous — Runs unattended. Applies patches, measures delta, reverts regressions, stops when ROI drops. No prompts, no babysitting.
Deterministic scoring — The 7-dimension scorer is pure Python, no LLM. Same input, same output. The improvement loop (/schliff:auto) runs inside Claude Code — Claude decides which patches to apply, but 60-70% of fixes follow deterministic rules (frontmatter, noise removal, TODO cleanup).
Empirical — 7 scoring dimensions (structure, triggers, quality, edges, efficiency, composability, clarity) + optional runtime validation against actual Claude behavior.
Learns — Episodic memory remembers which strategies worked across sessions. Predicts success before trying. Your 50th skill improves faster than your 1st.
Scales — MinHash + LSH mesh analysis detects trigger conflicts across 50+ skills in O(n). Doctor command shows health grades for your entire skill collection.
Autoresearch for Claude Code
Inspired by Karpathy's autoresearch (50K+ stars) — Schliff applies the same autonomous improvement loop to Claude Code skills:
| Karpathy's autoresearch | Schliff | |
|---|---|---|
| Target | ML training scripts | Claude Code SKILL.md files |
| Metric | 1 (val_bpb) | 7 dimensions |
| Patches | 100% LLM | 60-70% deterministic |
| Memory | None | Cross-session episodic store |
| Fleet | 1 file | 50+ skills (Doctor + Mesh) |
Both run overnight. Both stop when ROI plateaus. Both improve unattended.
Commands
Core
| Command | What It Does |
|---|---|
/schliff |
Full autonomous loop with GOAL + METRIC |
/schliff:doctor |
Scan ALL installed skills, show health summary |
/schliff:auto |
Self-driving auto-improve (deterministic patches, no prompts) |
/schliff:init |
Bootstrap eval suite + baseline from any SKILL.md |
/schliff:report |
Generate shareable markdown report with badge |
Analyze & Debug
| Command | What It Does |
|---|---|
/schliff:analyze |
One-shot gap analysis with ranked recommendations |
/schliff:bench |
Establish quality baseline for a skill |
/schliff:eval |
Run eval suite assertions |
/schliff:mesh |
Detect trigger conflicts across all installed skills |
/schliff:triage |
Cluster failures, auto-generate fixes |
/schliff:log-failure |
Log a skill failure for later triage |
/schliff:update |
Update Schliff to latest version |
How It Scores — 7 dimensions + optional runtime
Two modes, one decision:
Structural Score (default) — Instant, zero LLM cost. Pure Python analysis of file organization, trigger keywords, eval coverage, edge cases, efficiency, composability. No API calls needed. Use schliff score SKILL.md from any terminal or /schliff:bench in Claude Code.
Runtime Score (--runtime) — Invokes Claude with test prompts, validates actual behavior against assertions. Requires Claude CLI. Use before shipping to production.
Improvement Loop (/schliff:auto) — Runs inside Claude Code. Claude reads the scorer output, picks the highest-impact fix, patches the SKILL.md, re-scores, keeps or reverts. This is where the LLM intelligence lives. The scorer is the ruler; Claude is the craftsman.
| Dimension | Weight | What It Measures |
|---|---|---|
| Structure | 15% | Frontmatter, headers, examples, progressive disclosure |
| Trigger Accuracy | 20% | TF-IDF keyword overlap against eval suite prompts |
| Eval Coverage | 20% | Assertion breadth and eval suite coverage |
| Edge Coverage | 15% | Edge case definitions in eval suite |
| Token Efficiency | 10% | Information density, signal-to-noise ratio |
| Composability | 10% | Scope boundaries, handoff declarations |
| Clarity | 5% | Contradiction detection, vague references, ambiguity |
| Runtime (opt-in) | 10% | Actual Claude behavior against assertions |
Grades: S (>=95), A (>=85), B (>=75), C (>=65), D (>=50), E (>=35), F (<35).
Full scoring methodology: docs/SCORING.md
Dashboard — Health overview for any skill
======================================================================
Schliff Health Dashboard: schliff
======================================================================
Structural Score: ███████████████████░ 95.4/100 [S]
[7/8 dimensions, 90% coverage]
Dimensions:
structure ██████████ 100/100
triggers █████████░ 95/100
quality █████████░ 91/100
edges ██████████ 100/100
efficiency ████████░░ 84/100
composability ██████████ 100/100
clarity ██████████ 100/100
======================================================================
Auto-Improve — Autonomous grinding with EMA-based stopping
Scoring baseline...
Baseline: 95.4/100 (7 dims)
--- Iteration 1 ---
Stopping: composite >= 98 (95.4)
Schliff Auto-Improve Complete
──────────────────────────────────────────────────
Score: 95 → 95.4/100 ███████████████████░ (+0.0) [S]
Iters: 0 | Kept: 0 | Time: 1s
Stop: composite >= 98 (95.4)
(Already near-optimal — consider runtime eval for further gains)
Doctor — Scan all installed skills at once
======================================================================
Schliff Doctor — Skill Health Check
======================================================================
1 skills scanned | 1 healthy | 4 mesh issues
Skill Score Grade Dims Issues Action
--------------------------------------------------------------------
schliff 90 [A] 7/8 0 Healthy
Mesh Health: 68/100 (4 cross-skill issues)
Run /schliff:mesh for details.
NOTE: Scores are STRUCTURAL — they measure file organization,
not runtime effectiveness. Use --runtime for validated scoring.
======================================================================
What's New in v6.0
| Feature | Description |
|---|---|
| Rebrand to Schliff | "The finishing cut" — German for polish/grind |
| Clarity as Default | 7th dimension always active (contradictions, vague refs, ambiguity) |
| Token Cost Estimation | Doctor shows per-skill token cost + fleet total |
| GitHub Action | Zandereins/schliff@v6 — CI quality gate with PR comments |
| pip CLI | schliff score SKILL.md — works without Claude Code |
| Actionable Doctor | Copy-paste commands with full skill paths |
| Trigger Confidence | Small eval suites (<8 triggers) capped at score 60 |
| Context-aware Contradictions | "run tests" vs "run tests in production" distinguished |
| Anti-gaming | Empty headers, repetitive markers, binary composability fixed |
| 443 Tests (unit + integration + proof) | +70 stress tests, +28 edge cases, +76 patterns, +20 golden files |
| 40 Security Fixes | Shell injection, prompt injection, ReDoS, supply chain |
Quality & Security
Schliff scores itself — 7 dimensions, same engine, no exceptions.
| Metric | Value | What This Means |
|---|---|---|
| Structural Score | 95.4 / 100 [S] | Production-ready. 10 composability sub-checks, all passing. |
| Tests | 443 passing | 318 unit + 99 integration + 20 self + 6 proof. Every scorer rule tested. |
| Security | 40 fixes | Shell injection, prompt injection, ReDoS, supply chain. |
| Dimensions | 7 + runtime | Transparent, rule-based, explainable scoring. |
| Journey | v1.0 (62.5) → v6.0 (95.4) | 7 major versions. Continuous improvement, no regressions. |
Scoring methodology | Security details
GitHub Action
Score skills in CI. Block PRs that regress. The Codecov for SKILL.md files.
- uses: Zandereins/schliff@v6
with:
skill-path: '.claude/skills/my-skill/SKILL.md'
minimum-score: '75' # blocks PR if below
comment-on-pr: 'true' # posts score table on PR
CLI
Score any skill without Claude Code:
pip install schliff
schliff score path/to/SKILL.md # score a skill
schliff score path/to/SKILL.md --json # JSON output
schliff doctor # scan all installed skills
Ecosystem
skill-creator builds a v1 skill. Schliff grinds it to production quality.
skill-creator → v1 SKILL.md → /schliff:auto → autonomous grinding → ship
- skill-creator — generate the first draft
- autoresearch — generalized autonomous research for Claude Code
Badge
Score your skill and add this to your README:
[![Schliff: 95 [S]](https://img.shields.io/badge/Schliff-95%2F100_%5BS%5D-brightgreen)](https://github.com/Zandereins/schliff)
Contributing
Found a bug in the scorer? Add a test case to eval-suite.json and open an issue.
Want to improve scoring logic? Edit score-skill.py, run bash test-integration.sh, and PR the diff.
Next Steps
- Try the 3-minute demo — see a skill go from [D] to [S]
- Run
/schliff:doctoron your own skills — instant health check - Add the GitHub Action to your CI — quality gate for every PR
- Read the scoring methodology — understand what each dimension measures
Questions? Open an issue — we respond fast.
License
MIT — do whatever you want.
Built by Franz Paul with Claude Code.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file schliff-6.0.1.tar.gz.
File metadata
- Download URL: schliff-6.0.1.tar.gz
- Upload date:
- Size: 187.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8398a198680cd403fbce03cff5a760e1470e5f5bd1e23bec9dde404c504fb34
|
|
| MD5 |
1391e97eeb4d86275047bccba473f845
|
|
| BLAKE2b-256 |
85e7b400efa19acd0a999b450fe82adbcca8fa165bbf8cd09c791b1ef5796fc7
|
Provenance
The following attestation bundles were made for schliff-6.0.1.tar.gz:
Publisher:
publish.yml on Zandereins/schliff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schliff-6.0.1.tar.gz -
Subject digest:
c8398a198680cd403fbce03cff5a760e1470e5f5bd1e23bec9dde404c504fb34 - Sigstore transparency entry: 1173991109
- Sigstore integration time:
-
Permalink:
Zandereins/schliff@e1179ad004718fda280df2d765ca41215affc43c -
Branch / Tag:
refs/tags/v6.0.1 - Owner: https://github.com/Zandereins
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e1179ad004718fda280df2d765ca41215affc43c -
Trigger Event:
release
-
Statement type:
File details
Details for the file schliff-6.0.1-py3-none-any.whl.
File metadata
- Download URL: schliff-6.0.1-py3-none-any.whl
- Upload date:
- Size: 217.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ab15672fb3f3140f3185778db87f66d0c47d0bdefacd7acf4e6332d7f057472
|
|
| MD5 |
71e122d21a47e2301904d23a687272e2
|
|
| BLAKE2b-256 |
3e8067c7af50fe2b2faa26eb0ce3e11c9c763f0a8f2576ebc723680de84e17c8
|
Provenance
The following attestation bundles were made for schliff-6.0.1-py3-none-any.whl:
Publisher:
publish.yml on Zandereins/schliff
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
schliff-6.0.1-py3-none-any.whl -
Subject digest:
4ab15672fb3f3140f3185778db87f66d0c47d0bdefacd7acf4e6332d7f057472 - Sigstore transparency entry: 1173991151
- Sigstore integration time:
-
Permalink:
Zandereins/schliff@e1179ad004718fda280df2d765ca41215affc43c -
Branch / Tag:
refs/tags/v6.0.1 - Owner: https://github.com/Zandereins
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e1179ad004718fda280df2d765ca41215affc43c -
Trigger Event:
release
-
Statement type: