Evidence-first structured scoring. Class-aware rubric templates for deterministic dim sets across runs.

These details have not been verified by PyPI

Project links

Project description

hermes-rubric

Evidence-first structured scoring for LLM-judged artifacts. 62.9% chance-corrected agreement (Cohen's κ = 0.629, N=96 paired runs) across three model families on the batch-equivalence test set. 115 tests with two adversarial gates. Forces a three-stage scaffold — synthesize a domain rubric, collect per-dimension citations, then score against the evidence — so the number at the end has an audit trail.

Cross-model Cohen's κ = 0.629 (62.9% chance-corrected agreement) across 96 paired runs on the batch-equivalence test set — 5 fixture targets (T1–T5) spanning paper-quality, deploy-readiness, and email-quality scoring, full target list at experiments/batch-equiv-2026-04-25/RESULTS.md. Per-backend: Gemini 2.5 Flash κ=0.642 (N=47), Qwen-Plus κ=0.621 (N=47); Claude κ=0.527 reported at N=2 — too few pairs for a stable estimate, included for transparency only. Passes the pre-registered ≥0.6 reproducibility floor. Raw runs and aggregation script in experiments/batch-equiv-2026-04-25/ — clone, run compute_kappa.py, get the same number. 115 tests including two adversarial gates that fail the build if the scaffold breaks. Most LLM-as-judge tools score in one prompt and call it consistent; hermes-rubric forces three stages and capping rules that catch fluency-inflation in tests, every release.

echo "rate this paper" | hermes-rubric --target paper.md  # score with full audit trail

Without a scaffold, LLM scores reward fluency. Well-written garbage outscores substantive-but-rough work. Re-run the same input — the number shifts. There's no audit trail, and no way to argue with it.

hermes-rubric replaces that with three sequential stages: (1) synthesize a domain-specific rubric from your intent + context + target type, (2) collect per-dimension evidence citations (file:line or quoted passage), explicitly hedging dimensions where evidence is thin, (3) score against the rubric and citations only. Fabricated claims can't outscore evidenced ones — enforced by adversarial test. See Examples below and evals/ for the worked-example reproducibility receipts.

Install

pip install hermes-rubric

Python 3.10+. No API key required out-of-the-box — works with the Claude Code CLI (claude) or local Ollama. See Backends for the full plugin matrix (Anthropic, OpenAI, Google, Qwen).

Quick start

hermes-rubric \
    --intent "rate this as a publication-ready research artifact" \
    --context STYLE-GUIDE.md \
    --target paper.md \
    --out result.json

Output (truncated):

{
  "rubric": {"dimensions": [{"id": "claim_density", "weight": 3}, ...]},
  "evidence_citations": [
    {"dim_id": "claim_density", "citation": "paper.md:42", "quote": "..."}
  ],
  "per_dim_scores": [{"dim_id": "claim_density", "score": 8, "rationale": "..."}, ...],
  "aggregate": 8.7,
  "max_possible": 10.0,
  "hedge_dims": ["Reproducibility"],
  "hedge_note": "1 dimension had thin evidence — score less reliable: Reproducibility",
  "dim_summaries": [
    {"dim_id": "claim_density", "name": "Claim Density", "score": 8, "weight": 3, "hedged": false}
  ],
  "receipt": {"backend": "claude-cli", "timestamp_utc": "...", "input_hashes": {...}}
}

CLI

hermes-rubric --target <path> [options]
hermes-rubric kappa <result_a.json> <result_b.json>     # cross-backend agreement

Flag	Default	Purpose
`--target <path>`	required	File or directory to score
`--intent <text>`	required (unless `--artifact-class`)	One-sentence goal
`--context <path>`	required (unless `--artifact-class`)	Context for rubric synthesis
`--target-type <label>`	`document`	Tag for the target kind (e.g. `paper`, `tool`, `repo`)
`--out <path>`	stdout	Output JSON path
`--backend <name>`	auto-detect	One of: `claude-cli`, `ollama-local`, `dashscope-qwen`, `google-gemini`, `openai`, `openai-sdk`, `google-genai`, or any registered plugin
`--scope-class <name>`	none	`gate-plan` / `sweep-plan` / `results-bundle` — biases the synthesizer toward the right axes
`--intent-debias`	off	Prepend a debias preamble that neutralizes valence-loaded framing in the intent
`--artifact-class <name>`	none	Use a deterministic class template instead of LLM synthesis (see Class-aware mode)
`--batch`	off	Bundle evidence + scoring into one LLM call per stage; falls back to per-dim on parse failure
`--target-window-bytes <n>`	`8000`	Truncation cap for target/context content; oversize files emit a stderr warning
`--verbose`	off	Print stage progress to stderr

Subcommand kappa: computes Cohen's κ between two completed runs. See hermes-rubric kappa --help.

Class-aware mode

When you score the same kind of artifact repeatedly, Stage-1 LLM synthesis re-invents the dim set on every run — same target, three runs, three different rubric hashes. Class templates fix that:

hermes-rubric --artifact-class social-post --target post.md --out result.json

Each class is a YAML at hermes_rubric/classes/<name>.yaml defining a fixed dim set, weights, voice priors, and class-specific slop signatures. Same input + same class = same rubric across runs, so dim-by-dim diff actually means something. Bundled classes: social-post, show-hn-post, linkedin-post, outreach-email.

To add your own: in a development checkout (pip install -e .), drop a YAML next to the bundled ones. For installed distributions, fork the repo or maintain class YAMLs in your own package and load them via hermes_rubric.classes.load_class() — see src/hermes_rubric/classes/__init__.py for the loader.

What changes for you immediately

After pip install hermes-rubric, the next time you ask a model to score something:

Every score comes with a citation list — file:line or quoted passage per dimension. No more "8.4/10" with no audit trail.
Dimensions where evidence was thin get clamped to [3, 7] and flagged as hedge_dims. The model can't bury weak evidence under a confident number.
Re-running the same input + backend + rubric source produces the same score (within ±1) — receipts record the input hashes, so drift is detectable.
Fluent-but-empty prose stops outscoring substantive-but-rough work — adversarial test in tests/test_adversarial.py fails the build if it does.
Domain-specific rubrics auto-synthesize from your intent + context, instead of falling back to a generic "academic quality" template.

Most users notice the receipts more than the score. The score is the headline; the audit trail is the product.

Known limitations (honest list)

The Stage-1 LLM rubric synthesis introduces a generic-rubric tail when context is sparse. Mitigated by --artifact-class <name> for repeated artifact types; not yet auto-suggested.
κ measured on N=96 paired runs across 5 fixture targets (T1–T5) — that's evidence for batch-vs-per-dim equivalence on this test set, not yet a generalization claim across all artifact domains. Cross-domain κ (paper-quality vs deploy-readiness vs lead-score) is on the roadmap (see experiments/rubric-quality-PROPOSAL.md).
Anthropic SDK backend exists but the cross-model κ figure includes only N=2 Claude pairs — small sample, deferred Claude paper-grade run noted in ACTIONABLES.md.
Stage-2 evidence collection is deterministic given a synthesized rubric, but Stage 1 is not — same intent + same context can produce slightly different rubric dim sets across runs. Use --artifact-class for full reproducibility.

Examples

Three real worked examples ship in-repo:

evals/wedge-variance/ — variance comparison: hermes-rubric aggregate vs raw 0–10 LLM rating, same target × same backend. Demonstrates the variance-reduction wedge with reproducible runner.
applied/papers-20260423.md — two publicly published Zenodo papers scored on publication-readiness as worked examples (Asymmetric Burden of Proof, Taxonomy of Epistemic Failure Modes).
calibration/dataset.jsonl — 7 labeled cases with human scores, used for cross-backend κ measurement and as a regression fixture.

Verify the cross-model κ claim yourself

The "Cohen's κ = 0.629" headline is the load-bearing public claim. Reproduce it from the raw artifacts in-repo:

git clone https://github.com/hermes-labs-ai/hermes-rubric && cd hermes-rubric
python experiments/batch-equiv-2026-04-25/compute_kappa.py
# Per-target κ table, per-backend mean, overall mean. Should match RESULTS.md.

If the script's output doesn't match the README number, file an issue — the chain is broken and we want to know.

What the output means

aggregate — weighted score (0–10). Signal, not verdict.
hedge_dims — dimensions where evidence was thin. Scores in these dims are clamped to [3, 7]. The more hedged dimensions, the less you should trust the aggregate.
evidence_citations — every score ties back to a quoted passage or file:line. This is the audit trail.
receipt — same inputs + same backend + same rubric source produces scores within ±1 across runs. Receipt records backend, timestamp, input hashes, and rubric source (synthesized vs class-template).

Backends

Seven built-in backends, auto-detected in priority order. Force one with --backend <name>:

Backend	Requires	Notes
`claude-cli`	Claude Code installed (`claude --print`)	Default. Highest consistency.
`ollama-local`	Ollama running locally (default `qwen3.5:14b`)	Zero cost, offline. Fallback chain: `gemma3:12b` → `gemma3:4b` → `mistral:7b` → `qwen3.5:9b` → `qwen3.5:4b`.
`dashscope-qwen`	`DASHSCOPE_API_KEY`	Alibaba Cloud Qwen.
`google-gemini`	`GOOGLE_GEMINI_API_KEY`	REST.
`openai`	`OPENAI_API_KEY`	REST, no SDK dep.
`openai-sdk`	`OPENAI_API_KEY` + `pip install hermes-rubric[openai]`	Uses official SDK.
`google-genai`	`GOOGLE_GEMINI_API_KEY` + `pip install hermes-rubric[google]`	Uses google-genai SDK.

Plugging in your own backend

Backends conform to a single BackendProtocol:

class BackendProtocol(Protocol):
    name: str
    def call(self, prompt: str, max_tokens: int = 2048) -> str: ...
    def detect_available(self) -> bool: ...

from hermes_rubric.backends import register

class MyBackend:
    name = "my-backend"
    def call(self, prompt, max_tokens=2048): ...
    def detect_available(self): ...

register(MyBackend())

Or ship as a third-party package via the hermes_rubric.backends entry-point group:

# pyproject.toml of your plugin package
[project.entry-points."hermes_rubric.backends"]
my-backend = "my_pkg.backend:MyBackend"

hermes-rubric discovers entry-point plugins on first call. See src/hermes_rubric/backends.py for the reference implementation.

Library usage

from hermes_rubric.synthesize import synthesize
from hermes_rubric.evidence import collect_evidence
from hermes_rubric.score import score_dimensions, compute_aggregate

rubric = synthesize(
    intent="...",
    context_summary="...",
    target_type="paper",
    target_excerpt="...",
)
evidence = collect_evidence(
    rubric=rubric,
    target_content="...",
    target_path="paper.md",
)
scores = score_dimensions(rubric=rubric, evidence_list=evidence)
result = compute_aggregate(rubric=rubric, scores=scores)

When to use it

Scoring artifacts where fluency-vs-substance divergence matters (papers, proposals, PRs, cold emails, lead dossiers).
You need an audit trail — "the model said 8.7" isn't enough; you need to know why.
You're calibrating against a specific style guide or rubric and generic "quality vibes" won't do.
You want the same input to produce a score you can reproduce and defend.

When not to use it

Binary pass/fail gates — use a deterministic linter instead.
Single-sentence inputs — there's no evidence surface for the rubric to cite.
Scoring at high volume where cost matters more than fidelity — use a cheaper heuristic.
Adversarial scoring where the author controls both the artifact and the rubric synthesis.

Calibration

calibration/dataset.jsonl — 7 labeled cases across paper-quality, tool-fit, and deploy-readiness domains. All targets are publicly available artifacts (Zenodo papers, public OSS tools).
calibration/META-RUBRIC.md — the rubric for evaluating rubric generators. 7 dimensions, each motivated by a specific LLM failure mode from the taxonomy below.
calibration/failure-mode-taxonomy.md — 24 failure modes mined from the Hermes Labs research corpus (1,892 experiment records + named post-mortem incidents). Each FM cites a source artifact.

Evals

evals/wedge-variance/ — variance comparison: hermes-rubric aggregate vs raw 0–10 LLM rating, same target × same backend. Demonstrates the variance-reduction wedge.
applied/papers-20260423.md — two publicly published research papers scored on publication-readiness as worked examples:

Paper Aggregate

Taxonomy of Epistemic Failure Modes 6.9

Asymmetric Burden of Proof 6.5

Each score has a full rubric + citations + per-dimension rationale in the file.

Paper	Aggregate
Taxonomy of Epistemic Failure Modes	6.9
Asymmetric Burden of Proof	6.5

Running the tests

git clone https://github.com/hermes-labs-ai/hermes-rubric
cd hermes-rubric
pip install -e ".[dev]"
pytest

115 tests (111 passing + 4 skipped) across 14 files, including two adversarial gates that fail the build if the scaffold breaks plus a mechanical doc-consistency gate (tests/test_docs_consistency.py) that fails CI if the README opener / pytest --collect-only count drift apart:

test_fluency_does_not_inflate_evidence_score — a fluent rewrite of weak evidence must not outscore a substantive-but-rough version by more than 1 point.
test_fabricated_claim_does_not_outscore_evidenced_claim — claims without supporting evidence are capped at ≤3.

CI status: see the GitHub Actions badge at the top of this README.

License

MIT — see LICENSE.

About Hermes Labs

Hermes Labs builds AI audit infrastructure for enterprise AI systems — EU AI Act readiness, ISO 42001 evidence bundles, continuous compliance monitoring, agent-level risk testing. We work with teams shipping AI into regulated environments.

Our OSS philosophy — read this if you're deciding whether to depend on us:

Everything we release is free, forever. MIT or Apache-2.0. No "open core," no SaaS tier upsell, no paid version with the features you actually need. You can run this repo commercially, without talking to us.
We open-source our own infrastructure. The tools we release are what Hermes Labs uses internally — we don't publish demo code, we publish production code.
We sell audit work, not licenses. If you want an ANNEX-IV pack, an ISO 42001 evidence bundle, gap analysis against the EU AI Act, or agent-level red-teaming delivered as a report, that's at hermes-labs.ai. If you just want the code to run it yourself, it's right here.

The Hermes Labs OSS audit stack (public, production-grade, no SaaS):

Static audit (before deployment)

lintlang — Static linter for AI agent configs, tool descriptions, system prompts. pip install lintlang
scaffold-lint — Static linter for LLM prompt scaffolds. pip install scaffold-lint
rule-audit — Static prompt audit — contradictions, coverage gaps, priority ambiguities
intent-verify — Repo intent verification + spec-drift checks
repo-audit — Multi-signal repo readiness check

Runtime observability (while the agent runs)

little-canary — Prompt injection detection via sacrificial canary-model probes
suy-sideguy — Runtime policy guard — user-space enforcement + forensic reports
colony-probe — Prompt confidentiality audit — detects system-prompt reconstruction

Scoring & regression (to prove what changed)

hermes-rubric — Evidence-first structured scoring (this tool). pip install hermes-rubric
hermes-jailbench — Jailbreak regression benchmark. pip install hermes-jailbench
agent-convergence-scorer — Score how similar N agent outputs are. pip install agent-convergence-scorer

Supporting infra

claude-router · zer0dex · forgetted · quick-gate-python · quick-gate-js · hermes-seal

Natural pairing: scaffold-lint catches how much scaffolding you have. lintlang catches how well-structured it is. rule-audit catches what the rules contradict. hermes-rubric scores the thing the agent finally produced — with citations.

Built by Hermes Labs · @roli-lpci

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hermes_rubric-1.0.0.tar.gz (343.8 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hermes_rubric-1.0.0-py3-none-any.whl (48.9 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file hermes_rubric-1.0.0.tar.gz.

File metadata

Download URL: hermes_rubric-1.0.0.tar.gz
Upload date: Apr 28, 2026
Size: 343.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hermes_rubric-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`157e8510d1d12a45c9ecad172361d69cd91f25761fd395a7d9bcbf33fb825050`
MD5	`caea6635b36edaead20cf38ce0cf43fa`
BLAKE2b-256	`c540b7999b5a993eabf29870462b89917dc5aea336edbff564b3ea59159d3ca9`

See more details on using hashes here.

File details

Details for the file hermes_rubric-1.0.0-py3-none-any.whl.

File metadata

Download URL: hermes_rubric-1.0.0-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 48.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hermes_rubric-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f8056f2c6acc57579e9b1bb934e6fd33118a867bed636ab5815cb9473d8be014`
MD5	`6c94402d97bcf6f4dd47d820c0324ab8`
BLAKE2b-256	`a151fefbc5d2b007af5e9f4a650345c393f42795a13792e6899740a321b4a1f2`

See more details on using hashes here.

hermes-rubric 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

hermes-rubric

Install

Quick start

CLI

Class-aware mode

What changes for you immediately

Known limitations (honest list)

Examples

Verify the cross-model κ claim yourself

What the output means

Backends

Plugging in your own backend

Library usage

When to use it

When not to use it

Calibration

Evals

Running the tests

License

About Hermes Labs

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes