Skip to main content

Autonomous skill improvement and measurement framework for Claude Code

Project description

Schliff

The finishing cut for Claude Code skills.

Schliff improving a skill from 56.9 to 99.9

Baseline:  █████░░░░░░░░░░░░░░░  54.0/100  [D]
After 18x: ████████████████████  98.3/100  [S]

What changed:
  Structure         70 → 100     Added description, examples, concrete commands
  Efficiency        35 → 93      Removed hedging language, improved density
  Composability     30 → 90      Added scope, error behavior, dependencies
  Clarity           90 → 100     Resolved vague references

You wrote a skill. It worked. Three weeks later, triggers misfire, edge cases slip through, instructions contradict themselves. Schliff measures the damage (deterministic scoring, no LLM needed) and fixes it autonomously (Claude Code applies patches, measures delta, reverts regressions).

GitHub stars License: MIT Tests Structural Score v6.0.0 Claude Code Skill


Try It — Demo in 3 minutes

Note: Schliff commands (/schliff:*) run inside Claude Code, not in a regular terminal. Claude's intelligence decides which patches to apply — the scorer is deterministic, the improvement loop uses the LLM.

# 1. Install once (terminal, ~1 min)
git clone https://github.com/Zandereins/schliff.git && bash schliff/install.sh

# 2. Score the included demo skill (Claude Code, ~10 sec)
/schliff:init demo/bad-skill/SKILL.md

# 3. Watch it improve the demo skill (Claude Code, ~2 min)
/schliff:auto

What you'll see on the demo skill: 18 autonomous iterations. Each one: patch → measure → keep or revert. Score climbs from 54 [D] to 98 [S]. Stops when ROI plateaus. Real-world skills take longer and may not reach [S] — complex skills plateau around [A] to [S] depending on their eval suite coverage.

Prerequisites: Python 3.9+, Bash, Git, jq

Already have skills? Run /schliff:doctor to scan all installed skills and show health grades + token costs.


What Schliff Fixes

Real improvements from the included demo skill:

Problem What Schliff does Result
Triggers misfire Keyword matching + negative boundaries 0% → 89% accuracy
Missing structure Added examples, edge cases, frontmatter 75 → 100/100
Vague instructions Replaced hedging with concrete commands 35 → 93/100
No scope boundaries Added handoff declarations + "do NOT use" 40 → 100/100

Automated. No human intervention. Stops when ROI plateaus.


This Is For You If

  • Skill Creator — Run /schliff:init on your v1 skill to get a baseline + eval suite
  • Skill Maintainer — Run /schliff:auto to grind any skill from [C] to [S] overnight
  • Fleet Manager (10+ skills) — Run /schliff:doctor to scan everything, detect conflicts + token costs
  • Quality Gate — Run /schliff:eval before shipping, or use the GitHub Action in CI

Why It Works

Autonomous — Runs unattended. Applies patches, measures delta, reverts regressions, stops when ROI drops. No prompts, no babysitting.

Deterministic scoring — The 7-dimension scorer is pure Python, no LLM. Same input, same output. The improvement loop (/schliff:auto) runs inside Claude Code — Claude decides which patches to apply, but 60-70% of fixes follow deterministic rules (frontmatter, noise removal, TODO cleanup).

Empirical — 7 scoring dimensions (structure, triggers, quality, edges, efficiency, composability, clarity) + optional runtime validation against actual Claude behavior.

Learns — Episodic memory remembers which strategies worked across sessions. Predicts success before trying. Your 50th skill improves faster than your 1st.

Scales — MinHash + LSH mesh analysis detects trigger conflicts across 50+ skills in O(n). Doctor command shows health grades for your entire skill collection.


Autoresearch for Claude Code

Inspired by Karpathy's autoresearch (50K+ stars) — Schliff applies the same autonomous improvement loop to Claude Code skills:

Karpathy's autoresearch Schliff
Target ML training scripts Claude Code SKILL.md files
Metric 1 (val_bpb) 7 dimensions
Patches 100% LLM 60-70% deterministic
Memory None Cross-session episodic store
Fleet 1 file 50+ skills (Doctor + Mesh)

Both run overnight. Both stop when ROI plateaus. Both improve unattended.


Commands

Core

Command What It Does
/schliff Full autonomous loop with GOAL + METRIC
/schliff:doctor Scan ALL installed skills, show health summary
/schliff:auto Self-driving auto-improve (deterministic patches, no prompts)
/schliff:init Bootstrap eval suite + baseline from any SKILL.md
/schliff:report Generate shareable markdown report with badge

Analyze & Debug

Command What It Does
/schliff:analyze One-shot gap analysis with ranked recommendations
/schliff:bench Establish quality baseline for a skill
/schliff:eval Run eval suite assertions
/schliff:mesh Detect trigger conflicts across all installed skills
/schliff:triage Cluster failures, auto-generate fixes
/schliff:log-failure Log a skill failure for later triage
/schliff:update Update Schliff to latest version

How It Scores — 7 dimensions + optional runtime

Two modes, one decision:

Structural Score (default) — Instant, zero LLM cost. Pure Python analysis of file organization, trigger keywords, eval coverage, edge cases, efficiency, composability. No API calls needed. Use schliff score SKILL.md from any terminal or /schliff:bench in Claude Code.

Runtime Score (--runtime) — Invokes Claude with test prompts, validates actual behavior against assertions. Requires Claude CLI. Use before shipping to production.

Improvement Loop (/schliff:auto) — Runs inside Claude Code. Claude reads the scorer output, picks the highest-impact fix, patches the SKILL.md, re-scores, keeps or reverts. This is where the LLM intelligence lives. The scorer is the ruler; Claude is the craftsman.

Dimension Weight What It Measures
Structure 15% Frontmatter, headers, examples, progressive disclosure
Trigger Accuracy 20% TF-IDF keyword overlap against eval suite prompts
Eval Coverage 20% Assertion breadth and eval suite coverage
Edge Coverage 15% Edge case definitions in eval suite
Token Efficiency 10% Information density, signal-to-noise ratio
Composability 10% Scope boundaries, handoff declarations
Clarity 5% Contradiction detection, vague references, ambiguity
Runtime (opt-in) 10% Actual Claude behavior against assertions

Grades: S (>=95), A (>=85), B (>=75), C (>=65), D (>=50), E (>=35), F (<35).

Full scoring methodology: docs/SCORING.md

Dashboard — Health overview for any skill
======================================================================
  Schliff Health Dashboard: schliff
======================================================================

  Structural Score: ███████████████████░  95.4/100  [S]
    [7/8 dimensions, 90% coverage]

  Dimensions:
    structure       ██████████  100/100
    triggers        █████████░   95/100
    quality         █████████░   91/100
    edges           ██████████  100/100
    efficiency      ████████░░   84/100
    composability   ██████████  100/100
    clarity         ██████████  100/100
======================================================================
Auto-Improve — Autonomous grinding with EMA-based stopping
Scoring baseline...
Baseline: 95.4/100 (7 dims)

--- Iteration 1 ---
Stopping: composite >= 98 (95.4)

  Schliff Auto-Improve Complete
  ──────────────────────────────────────────────────
  Score:  95 → 95.4/100  ███████████████████░  (+0.0)  [S]
  Iters:  0  |  Kept: 0  |  Time: 1s
  Stop:   composite >= 98 (95.4)
  (Already near-optimal — consider runtime eval for further gains)
Doctor — Scan all installed skills at once
======================================================================
  Schliff Doctor — Skill Health Check
======================================================================

  1 skills scanned | 1 healthy | 4 mesh issues

  Skill                      Score  Grade   Dims  Issues  Action
  --------------------------------------------------------------------
  schliff                   90    [A]    7/8       0  Healthy

  Mesh Health: 68/100 (4 cross-skill issues)
  Run /schliff:mesh for details.

  NOTE: Scores are STRUCTURAL — they measure file organization,
  not runtime effectiveness. Use --runtime for validated scoring.
======================================================================
What's New in v6.0
Feature Description
Rebrand to Schliff "The finishing cut" — German for polish/grind
Clarity as Default 7th dimension always active (contradictions, vague refs, ambiguity)
Token Cost Estimation Doctor shows per-skill token cost + fleet total
GitHub Action Zandereins/schliff@v6 — CI quality gate with PR comments
pip CLI schliff score SKILL.md — works without Claude Code
Actionable Doctor Copy-paste commands with full skill paths
Trigger Confidence Small eval suites (<8 triggers) capped at score 60
Context-aware Contradictions "run tests" vs "run tests in production" distinguished
Anti-gaming Empty headers, repetitive markers, binary composability fixed
443 Tests (unit + integration + proof) +70 stress tests, +28 edge cases, +76 patterns, +20 golden files
40 Security Fixes Shell injection, prompt injection, ReDoS, supply chain

Quality & Security

Schliff scores itself — 7 dimensions, same engine, no exceptions.

Metric Value What This Means
Structural Score 95.4 / 100 [S] Production-ready. 10 composability sub-checks, all passing.
Tests 443 passing 318 unit + 99 integration + 20 self + 6 proof. Every scorer rule tested.
Security 40 fixes Shell injection, prompt injection, ReDoS, supply chain.
Dimensions 7 + runtime Transparent, rule-based, explainable scoring.
Journey v1.0 (62.5) → v6.0 (95.4) 7 major versions. Continuous improvement, no regressions.

Scoring methodology | Security details


GitHub Action

Score skills in CI. Block PRs that regress. The Codecov for SKILL.md files.

- uses: Zandereins/schliff@v6
  with:
    skill-path: '.claude/skills/my-skill/SKILL.md'
    minimum-score: '75'      # blocks PR if below
    comment-on-pr: 'true'    # posts score table on PR

CLI

Score any skill without Claude Code:

pip install schliff

schliff score path/to/SKILL.md          # score a skill
schliff score path/to/SKILL.md --json   # JSON output
schliff doctor                           # scan all installed skills

Ecosystem

skill-creator builds a v1 skill. Schliff grinds it to production quality.

skill-creator → v1 SKILL.md → /schliff:auto → autonomous grinding → ship

Badge

Score your skill and add this to your README:

[![Schliff: 95 [S]](https://img.shields.io/badge/Schliff-95%2F100_%5BS%5D-brightgreen)](https://github.com/Zandereins/schliff)

Schliff: 95 [S]


Contributing

Found a bug in the scorer? Add a test case to eval-suite.json and open an issue. Want to improve scoring logic? Edit score-skill.py, run bash test-integration.sh, and PR the diff.


Next Steps

  1. Try the 3-minute demo — see a skill go from [D] to [S]
  2. Run /schliff:doctor on your own skills — instant health check
  3. Add the GitHub Action to your CI — quality gate for every PR
  4. Read the scoring methodology — understand what each dimension measures

Questions? Open an issue — we respond fast.


License

MIT — do whatever you want.


Built by Franz Paul with Claude Code.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schliff-6.0.1.tar.gz (187.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

schliff-6.0.1-py3-none-any.whl (217.4 kB view details)

Uploaded Python 3

File details

Details for the file schliff-6.0.1.tar.gz.

File metadata

  • Download URL: schliff-6.0.1.tar.gz
  • Upload date:
  • Size: 187.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for schliff-6.0.1.tar.gz
Algorithm Hash digest
SHA256 c8398a198680cd403fbce03cff5a760e1470e5f5bd1e23bec9dde404c504fb34
MD5 1391e97eeb4d86275047bccba473f845
BLAKE2b-256 85e7b400efa19acd0a999b450fe82adbcca8fa165bbf8cd09c791b1ef5796fc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for schliff-6.0.1.tar.gz:

Publisher: publish.yml on Zandereins/schliff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file schliff-6.0.1-py3-none-any.whl.

File metadata

  • Download URL: schliff-6.0.1-py3-none-any.whl
  • Upload date:
  • Size: 217.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for schliff-6.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ab15672fb3f3140f3185778db87f66d0c47d0bdefacd7acf4e6332d7f057472
MD5 71e122d21a47e2301904d23a687272e2
BLAKE2b-256 3e8067c7af50fe2b2faa26eb0ce3e11c9c763f0a8f2576ebc723680de84e17c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for schliff-6.0.1-py3-none-any.whl:

Publisher: publish.yml on Zandereins/schliff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page