Deterministic quality scorer for AI agent instruction files. Multi-format (SKILL.md, CLAUDE.md, .cursorrules, AGENTS.md), 8-dimension scoring with security, anti-gaming detection, zero dependencies.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

FPaolo

These details have not been verified by PyPI

Project description

Schliff

Your AI instruction files silently degrade — and nothing catches it. A trigger phrase rots. An edge case slips. Your SKILL.md balloons past its token budget. No error, no red test — just an agent that quietly gets worse.

A deterministic quality scorer for AI instruction files. Same input, same score — every time, on every machine. Think the Ruff for SKILL.md, CLAUDE.md, and AGENTS.md. It measures the things linters miss, the same way every time, so degradation shows up as a number that drops instead of a bug you chase.

Schliff scores the instruction files that drive your AI agents — skills, system prompts, project memory — against an explicit, versioned rubric, so you can gate a release on the number in CI. No LLM judge in the critical path. No network. No randomness. Just a rule engine you can read, pin, and trust.

pip install schliff
schliff score path/to/SKILL.md

schliff v8.4.0

  structure      ████████░░   78/100  good
  triggers       ███████░░░   72/100  good
  quality        ██████░░░░   64/100  fair
  edges          █████░░░░░   55/100  fair
  efficiency     ████████░░   80/100  good
  composability  ███████░░░   70/100  good
  clarity        ██████████  100/100  perfect

  Structural Score  ██████████████░░░░░░  71.2/100  [C]

  Tokens: 740 / 1,000 (ok)

No model in the loop produced that number. Run it again on another laptop and you get 71.2 again. That is the whole point.

A real catch

A SKILL.md for ShieldClaw — a real prompt-injection-defense skill, now archived — scored 68.3 [C] — and Schliff showed exactly why: composability 20/100 (no scope boundaries, no I/O contract, no handoffs), and 3 of 7 dimensions unmeasurable because there was no eval suite. After adding the missing scope section and an eval suite, the same file scored 94.6 [A] on all 7 dimensions.

	Score	Grade	Dimensions measured
Before	68.3	C	4/7 (no eval suite)
After	94.6	A	7/7

Defects you'd otherwise ship caught as a number that's too low — see the full case study.

Why deterministic?

Most "AI quality" tools ask another LLM to grade your prompt. That makes the score non-reproducible (re-run it, get a different number), un-auditable (the rubric lives in a hidden prompt), and trivially gameable (write for the judge, not the user). A score you can't reproduce isn't a measurement — it's a vibe. You can't gate a release on a number that drifts.

Schliff takes the opposite position:

Reproducible. The headline composite is computed from a canonical, versioned weight registry. Calibration is off by default, so verify, badge, and the leaderboard return the same score on your laptop and in CI.
Auditable. Every dimension is a readable scorer in scripts/scoring/. The weights are a dict you can open. There is no hidden judge prompt.
Anti-gaming by design. A dedicated guard layer (guards.py) plus per-scorer heuristics detect padding, keyword stuffing, and structure-mimicry instead of rewarding them.
Zero core dependencies. Core Schliff is stdlib-only and runs on Python ≥ 3.10. (Optional [evolve] / [judge] extras pull in LLM clients for an opt-in smoke-test only — never for scoring.)

Because the number is stable, it does real work:

Diff it across two commits to see exactly what a refactor cost or earned.
Gate a pull request on a minimum score, with a non-zero exit code below the line.
Compare two files side by side on the same rubric.

An optional LLM judge exists for exploratory work, but it is never part of the deterministic score. The number you gate on is rule-based, end to end.

The 8 scored dimensions

For the SKILL.md family, Schliff runs 8 scorers per file. 7 of them form the headline composite; security and runtime are reported as separate opt-in signals so a security warning never silently inflates or deflates your quality grade.

Dimension	Weight	In headline?
`structure`	0.15	✅
`triggers`	0.20	✅
`quality`	0.20	✅
`edges`	0.15	✅
`efficiency`	0.10	✅
`composability`	0.10	✅
`clarity`	0.05	✅
`security`	0.05	Separate signal (gate threshold 70)
`runtime`	—	Separate signal (no profile weight)

The seven headline weights are renormalized to sum to 1.0 — that is the canonical basis.

Note: security is a side signal for the SKILL.md / CLAUDE.md / .cursorrules / AGENTS.md family, but a core 0.15 headline dimension for the system_prompt format, which uses its own scorer set. Only runtime is excluded everywhere.

The composite: a full-denominator model

Schliff does not quietly renormalize across whatever you happened to measure. Unmeasured dimensions contribute 0 and stay in the denominator — so coverage gaps lower your ceiling instead of quietly disappearing. Your score ceiling equals your measurement coverage. Measure 4 of the 7 headline dimensions and your maximum possible score is capped accordingly, with an explicit warning:

ℹ Scored 4/7 dimensions — the score can't exceed 42% until the rest
  are measured. Run /schliff:init to add an eval suite and score:
  triggers, quality, edges.

This is deliberate. A partial measurement is an honest partial score, never a flattering one. Unmeasured work is missing points, not invisible. To lift the ceiling, measure more — don't hide the gap.

Grade scale

S ≥ 95 · A ≥ 85 · B ≥ 75 · C ≥ 65 · D ≥ 50 · E ≥ 35 · F < 35

Multi-format support

One engine, five instruction-file formats — each with its own token budget and scorer set:

Format	Token budget	Scorers
`SKILL.md`	1,000	shared 8-scorer registry
`CLAUDE.md`	2,000	shared 8-scorer registry
`.cursorrules`	500	shared 8-scorer registry
`AGENTS.md`	3,000	shared 8 scorers + `operational_coverage` (own 3-dim headline)
system prompts	1,500	dedicated set (`structure_prompt`, `output_contract`, `efficiency`, `clarity`, `security`, `composability`, `completeness`)

Format is auto-detected; override with --format (skill, claude, cursor, agents, system-prompt).

Install

pip install schliff                  # core, stdlib-only
pip install "schliff[evolve,judge]"  # optional LLM-judge / evolve extras

Install	Pulls in	When you need it
`schliff`	stdlib only	Scoring, verify, badge, CI — everything that gates a release
`schliff[judge]`	LLM client	Opt-in exploratory LLM-judge smoke-test (never scoring)
`schliff[evolve]`	LLM client	Opt-in autonomous-improvement extras

GitHub Action

Gate pull requests on instruction-file quality. The action defaults to your repo-root AGENTS.md and posts a scored comment on every PR:

# .github/workflows/agents-lint.yml
name: AGENTS.md Lint
on: [pull_request]
jobs:
  score:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: Zandereins/schliff@v1
        with:
          minimum-score: '75'   # optional: fail the PR below this score

By default it scores AGENTS.md at the repo root; set skill-path: to lint a SKILL.md, CLAUDE.md, or .cursorrules instead.

Prefer not to depend on a third-party action? The dependency-light equivalent:

      - run: pip install schliff
      - run: schliff verify AGENTS.md --min-score 75

schliff verify exits non-zero below the threshold — a clean CI gate either way.

pre-commit

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/Zandereins/schliff
    rev: v8.4.0
    hooks:
      - id: schliff-verify
        args: ['--min-score', '75']

CLI

schliff <command> [path] [options]

Command	What it does
`score`	Score a file and print the grade bar
`verify`	CI gate — exit 0/1 based on a minimum score
`doctor`	Scan and grade every installed skill
`badge`	Generate a Markdown score badge
`diff`	Explain score changes between two git commits
`compare`	Compare two files side by side
`suggest`	Rank fixes by estimated score impact
`report`	Generate a Markdown score report
`demo`	Score a built-in bad skill to see Schliff in action
`evolve`	Improve an instruction file's score
`version`	Print the version

The version is single-sourced: the CLI resolves it at runtime via importlib.metadata.version("schliff"), falling back to dev from a source checkout.

Optional: closing the loop

Beyond grading, Schliff can apply fixes. The improvement engine measures first, then fixes (not the other way around):

Score the file across all dimensions.
Generate deterministic patch gradients for the weakest dimensions.
Apply the safe, rule-based patches automatically — ~32% of suggested fixes apply deterministically through the apply gate (confidence=high, single-edit; canonical measurement: measure_patch_ratio.py). The rest are handed to an optional LLM.
Re-score and keep the change only if the score improved — otherwise revert.
Stop on plateau detection or when the target is reached.

It also carries cross-session episodic memory (episodic_store.py), so improvement runs learn from prior attempts instead of repeating them. Drive it from Claude Code with /schliff:auto, or use schliff evolve directly. This is an optional convenience layer — the deterministic score is the product.

→ 7 deterministic fixes available. Run `/schliff:auto` to apply.

How it works

The full methodology — scorer internals, the full-denominator composite, the anti-gaming guards, and the calibration model — lives in docs/SCORING.md. Calibration is strictly opt-in: ambient auto-calibrated weights apply only when SCHLIFF_CALIBRATED_WEIGHTS is set and only for the interactive score command, and Schliff emits a weight_source=calibrated warning flagging that such scores are not comparable to the canonical scale. Everything that gates a release stays canonical.

scripts/
├── cli.py                  # CLI entrypoint + dynamic version resolution
├── scoring/
│   ├── registry.py         # canonical weights, scorer lists, headline exclusions
│   ├── composite.py        # full-denominator composite model
│   ├── formats.py          # format detection + token budgets
│   ├── guards.py           # anti-gaming detection
│   └── structure.py · triggers.py · quality.py · edges.py · …
├── text_gradient.py        # deterministic patch gradients (apply gate)
├── episodic_store.py       # cross-session episodic memory
└── measure_patch_ratio.py  # canonical source for the patch-ratio claim

Positioning

LLM-judge tools ask a model how good your prompt feels — a different answer every run. Schliff computes how good it measurably is — the same answer every run, in a number you can pin to a commit and gate a release on.

Ruff lints your Python. Biome lints your JS. Schliff lints the instruction files that drive your AI — deterministically, with no model in the loop.

Contributing & links

⭐ Star the repo: github.com/Zandereins/schliff
📖 Docs: docs/SCORING.md
🧪 Playground: schliff-playground.vercel.app — paste a SKILL.md, get a live structural score (or schliff demo in the CLI)
🏆 Leaderboard: schliff-leaderboard.vercel.app

Structural score = the composite renormalized over the dimensions Schliff can measure deterministically without an eval suite (structure, efficiency, composability, clarity). It is what the web playground reports. The full 7-dimension composite additionally folds in triggers, quality, and edges — which require an eval suite (schliff init).

Validated by 1,347 tests (unit + integration) in skills/schliff/tests, with separate self and proof suites via test-self.sh and test-integration.sh.

License

MIT © Franz Paul

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

FPaolo

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

8.4.0

Jul 3, 2026

8.3.0

Jun 25, 2026

8.2.0

Jun 11, 2026

8.1.0

Jun 3, 2026

8.0.0

May 29, 2026

7.2.0

Apr 24, 2026

7.1.1

Apr 18, 2026

7.1.0

Mar 27, 2026

7.0.0

Mar 26, 2026

6.3.0

Mar 26, 2026

6.2.0

Mar 25, 2026

6.1.0

Mar 24, 2026

6.0.1

Mar 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

schliff-8.4.0.tar.gz (239.8 kB view details)

Uploaded Jul 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

schliff-8.4.0-py3-none-any.whl (277.1 kB view details)

Uploaded Jul 3, 2026 Python 3

File details

Details for the file schliff-8.4.0.tar.gz.

File metadata

Download URL: schliff-8.4.0.tar.gz
Upload date: Jul 3, 2026
Size: 239.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for schliff-8.4.0.tar.gz
Algorithm	Hash digest
SHA256	`14d1600a4c5abe9e397b0791289142038b8163f42ac4299c6fc811cac28bb09d`
MD5	`fc80ea336486c516941d06f2442e0fa8`
BLAKE2b-256	`4da6ccbbf13acaecfa21ab06ac49492760e36d2968056e07656960cb0005d9fa`

See more details on using hashes here.

Provenance

The following attestation bundles were made for schliff-8.4.0.tar.gz:

Publisher: publish.yml on Zandereins/schliff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: schliff-8.4.0.tar.gz
- Subject digest: 14d1600a4c5abe9e397b0791289142038b8163f42ac4299c6fc811cac28bb09d
- Sigstore transparency entry: 2059952226
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: Zandereins/schliff@8ed786952d1715718f4c1f09431c26f6d1315a45
- Branch / Tag: refs/tags/v8.4.0
- Owner: https://github.com/Zandereins
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8ed786952d1715718f4c1f09431c26f6d1315a45
- Trigger Event: release

File details

Details for the file schliff-8.4.0-py3-none-any.whl.

File metadata

Download URL: schliff-8.4.0-py3-none-any.whl
Upload date: Jul 3, 2026
Size: 277.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for schliff-8.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0cbade341dd8d70d72a4e02d24ff17f2245f3113cbb9497113f457a9f1beb94c`
MD5	`1f6b8b8cf5705f55a0849093906bded3`
BLAKE2b-256	`85f35d8bb3f99238f9abf139358ec5445f4b66635844808f9d303907e5540c73`

See more details on using hashes here.

Provenance

The following attestation bundles were made for schliff-8.4.0-py3-none-any.whl:

Publisher: publish.yml on Zandereins/schliff

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: schliff-8.4.0-py3-none-any.whl
- Subject digest: 0cbade341dd8d70d72a4e02d24ff17f2245f3113cbb9497113f457a9f1beb94c
- Sigstore transparency entry: 2059952456
- Sigstore integration time: Jul 3, 2026
Source repository:
- Permalink: Zandereins/schliff@8ed786952d1715718f4c1f09431c26f6d1315a45
- Branch / Tag: refs/tags/v8.4.0
- Owner: https://github.com/Zandereins
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@8ed786952d1715718f4c1f09431c26f6d1315a45
- Trigger Event: release

schliff 8.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Schliff

A real catch

Why deterministic?

The 8 scored dimensions

The composite: a full-denominator model

Grade scale

Multi-format support

Install

GitHub Action

pre-commit

CLI

Optional: closing the loop

How it works

Positioning

Contributing & links

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance