Evidence-first structured scoring. Class-aware rubric templates for deterministic dim sets across runs.
Project description
hermes-rubric
Evidence-first structured scoring for LLM-judged artifacts. 62.9% chance-corrected agreement (Cohen's κ = 0.629, N=96 paired runs) across three model families on the batch-equivalence test set. 115 tests with two adversarial gates. Forces a three-stage scaffold — synthesize a domain rubric, collect per-dimension citations, then score against the evidence — so the number at the end has an audit trail.
Cross-model Cohen's κ = 0.629 (62.9% chance-corrected agreement) across 96 paired runs on the batch-equivalence test set — 5 fixture targets (T1–T5) spanning paper-quality, deploy-readiness, and email-quality scoring, full target list at experiments/batch-equiv-2026-04-25/RESULTS.md. Per-backend: Gemini 2.5 Flash κ=0.642 (N=47), Qwen-Plus κ=0.621 (N=47); Claude κ=0.527 reported at N=2 — too few pairs for a stable estimate, included for transparency only. Passes the pre-registered ≥0.6 reproducibility floor. Raw runs and aggregation script in experiments/batch-equiv-2026-04-25/ — clone, run compute_kappa.py, get the same number. 115 tests including two adversarial gates that fail the build if the scaffold breaks. Most LLM-as-judge tools score in one prompt and call it consistent; hermes-rubric forces three stages and capping rules that catch fluency-inflation in tests, every release.
echo "rate this paper" | hermes-rubric --target paper.md # score with full audit trail
Without a scaffold, LLM scores reward fluency. Well-written garbage outscores substantive-but-rough work. Re-run the same input — the number shifts. There's no audit trail, and no way to argue with it.
hermes-rubric replaces that with three sequential stages: (1) synthesize a domain-specific rubric from your intent + context + target type, (2) collect per-dimension evidence citations (file:line or quoted passage), explicitly hedging dimensions where evidence is thin, (3) score against the rubric and citations only. Fabricated claims can't outscore evidenced ones — enforced by adversarial test. See Examples below and evals/ for the worked-example reproducibility receipts.
Install
pip install hermes-rubric
Python 3.10+. No API key required out-of-the-box — works with the Claude Code CLI (claude) or local Ollama. See Backends for the full plugin matrix (Anthropic, OpenAI, Google, Qwen).
Quick start
hermes-rubric \
--intent "rate this as a publication-ready research artifact" \
--context STYLE-GUIDE.md \
--target paper.md \
--out result.json
Output (truncated):
{
"rubric": {"dimensions": [{"id": "claim_density", "weight": 3}, ...]},
"evidence_citations": [
{"dim_id": "claim_density", "citation": "paper.md:42", "quote": "..."}
],
"per_dim_scores": [{"dim_id": "claim_density", "score": 8, "rationale": "..."}, ...],
"aggregate": 8.7,
"max_possible": 10.0,
"hedge_dims": ["Reproducibility"],
"hedge_note": "1 dimension had thin evidence — score less reliable: Reproducibility",
"dim_summaries": [
{"dim_id": "claim_density", "name": "Claim Density", "score": 8, "weight": 3, "hedged": false}
],
"receipt": {"backend": "claude-cli", "timestamp_utc": "...", "input_hashes": {...}}
}
CLI
hermes-rubric --target <path> [options]
hermes-rubric kappa <result_a.json> <result_b.json> # cross-backend agreement
| Flag | Default | Purpose |
|---|---|---|
--target <path> |
required | File or directory to score |
--intent <text> |
required (unless --artifact-class) |
One-sentence goal |
--context <path> |
required (unless --artifact-class) |
Context for rubric synthesis |
--target-type <label> |
document |
Tag for the target kind (e.g. paper, tool, repo) |
--out <path> |
stdout | Output JSON path |
--backend <name> |
auto-detect | One of: claude-cli, ollama-local, dashscope-qwen, google-gemini, openai, openai-sdk, google-genai, or any registered plugin |
--scope-class <name> |
none | gate-plan / sweep-plan / results-bundle — biases the synthesizer toward the right axes |
--intent-debias |
off | Prepend a debias preamble that neutralizes valence-loaded framing in the intent |
--artifact-class <name> |
none | Use a deterministic class template instead of LLM synthesis (see Class-aware mode) |
--batch |
off | Bundle evidence + scoring into one LLM call per stage; falls back to per-dim on parse failure |
--target-window-bytes <n> |
8000 |
Truncation cap for target/context content; oversize files emit a stderr warning |
--verbose |
off | Print stage progress to stderr |
Subcommand kappa: computes Cohen's κ between two completed runs. See hermes-rubric kappa --help.
Class-aware mode
When you score the same kind of artifact repeatedly, Stage-1 LLM synthesis re-invents the dim set on every run — same target, three runs, three different rubric hashes. Class templates fix that:
hermes-rubric --artifact-class social-post --target post.md --out result.json
Each class is a YAML at hermes_rubric/classes/<name>.yaml defining a fixed dim set, weights, voice priors, and class-specific slop signatures. Same input + same class = same rubric across runs, so dim-by-dim diff actually means something. Bundled classes: social-post, show-hn-post, linkedin-post, outreach-email.
To add your own: in a development checkout (pip install -e .), drop a YAML next to the bundled ones. For installed distributions, fork the repo or maintain class YAMLs in your own package and load them via hermes_rubric.classes.load_class() — see src/hermes_rubric/classes/__init__.py for the loader.
What changes for you immediately
After pip install hermes-rubric, the next time you ask a model to score something:
- Every score comes with a citation list —
file:lineor quoted passage per dimension. No more "8.4/10" with no audit trail. - Dimensions where evidence was thin get clamped to [3, 7] and flagged as
hedge_dims. The model can't bury weak evidence under a confident number. - Re-running the same input + backend + rubric source produces the same score (within ±1) — receipts record the input hashes, so drift is detectable.
- Fluent-but-empty prose stops outscoring substantive-but-rough work — adversarial test in
tests/test_adversarial.pyfails the build if it does. - Domain-specific rubrics auto-synthesize from your intent + context, instead of falling back to a generic "academic quality" template.
Most users notice the receipts more than the score. The score is the headline; the audit trail is the product.
Known limitations (honest list)
- The Stage-1 LLM rubric synthesis introduces a generic-rubric tail when context is sparse. Mitigated by
--artifact-class <name>for repeated artifact types; not yet auto-suggested. - κ measured on N=96 paired runs across 5 fixture targets (T1–T5) — that's evidence for batch-vs-per-dim equivalence on this test set, not yet a generalization claim across all artifact domains. Cross-domain κ (paper-quality vs deploy-readiness vs lead-score) is on the roadmap (see
experiments/rubric-quality-PROPOSAL.md). - Anthropic SDK backend exists but the cross-model κ figure includes only N=2 Claude pairs — small sample, deferred Claude paper-grade run noted in ACTIONABLES.md.
- Stage-2 evidence collection is deterministic given a synthesized rubric, but Stage 1 is not — same intent + same context can produce slightly different rubric dim sets across runs. Use
--artifact-classfor full reproducibility.
Examples
Three real worked examples ship in-repo:
evals/wedge-variance/— variance comparison: hermes-rubricaggregatevs raw 0–10 LLM rating, same target × same backend. Demonstrates the variance-reduction wedge with reproducible runner.applied/papers-20260423.md— two publicly published Zenodo papers scored on publication-readiness as worked examples (Asymmetric Burden of Proof, Taxonomy of Epistemic Failure Modes).calibration/dataset.jsonl— 7 labeled cases with human scores, used for cross-backend κ measurement and as a regression fixture.
Verify the cross-model κ claim yourself
The "Cohen's κ = 0.629" headline is the load-bearing public claim. Reproduce it from the raw artifacts in-repo:
git clone https://github.com/hermes-labs-ai/hermes-rubric && cd hermes-rubric
python experiments/batch-equiv-2026-04-25/compute_kappa.py
# Per-target κ table, per-backend mean, overall mean. Should match RESULTS.md.
If the script's output doesn't match the README number, file an issue — the chain is broken and we want to know.
What the output means
aggregate— weighted score (0–10). Signal, not verdict.hedge_dims— dimensions where evidence was thin. Scores in these dims are clamped to [3, 7]. The more hedged dimensions, the less you should trust the aggregate.evidence_citations— every score ties back to a quoted passage orfile:line. This is the audit trail.receipt— same inputs + same backend + same rubric source produces scores within ±1 across runs. Receipt records backend, timestamp, input hashes, and rubric source (synthesizedvsclass-template).
Backends
Seven built-in backends, auto-detected in priority order. Force one with --backend <name>:
| Backend | Requires | Notes |
|---|---|---|
claude-cli |
Claude Code installed (claude --print) |
Default. Highest consistency. |
ollama-local |
Ollama running locally (default qwen3.5:14b) |
Zero cost, offline. Fallback chain: gemma3:12b → gemma3:4b → mistral:7b → qwen3.5:9b → qwen3.5:4b. |
dashscope-qwen |
DASHSCOPE_API_KEY |
Alibaba Cloud Qwen. |
google-gemini |
GOOGLE_GEMINI_API_KEY |
REST. |
openai |
OPENAI_API_KEY |
REST, no SDK dep. |
openai-sdk |
OPENAI_API_KEY + pip install hermes-rubric[openai] |
Uses official SDK. |
google-genai |
GOOGLE_GEMINI_API_KEY + pip install hermes-rubric[google] |
Uses google-genai SDK. |
Plugging in your own backend
Backends conform to a single BackendProtocol:
class BackendProtocol(Protocol):
name: str
def call(self, prompt: str, max_tokens: int = 2048) -> str: ...
def detect_available(self) -> bool: ...
Register at runtime:
from hermes_rubric.backends import register
class MyBackend:
name = "my-backend"
def call(self, prompt, max_tokens=2048): ...
def detect_available(self): ...
register(MyBackend())
Or ship as a third-party package via the hermes_rubric.backends entry-point group:
# pyproject.toml of your plugin package
[project.entry-points."hermes_rubric.backends"]
my-backend = "my_pkg.backend:MyBackend"
hermes-rubric discovers entry-point plugins on first call. See src/hermes_rubric/backends.py for the reference implementation.
Library usage
from hermes_rubric.synthesize import synthesize
from hermes_rubric.evidence import collect_evidence
from hermes_rubric.score import score_dimensions, compute_aggregate
rubric = synthesize(
intent="...",
context_summary="...",
target_type="paper",
target_excerpt="...",
)
evidence = collect_evidence(
rubric=rubric,
target_content="...",
target_path="paper.md",
)
scores = score_dimensions(rubric=rubric, evidence_list=evidence)
result = compute_aggregate(rubric=rubric, scores=scores)
When to use it
- Scoring artifacts where fluency-vs-substance divergence matters (papers, proposals, PRs, cold emails, lead dossiers).
- You need an audit trail — "the model said 8.7" isn't enough; you need to know why.
- You're calibrating against a specific style guide or rubric and generic "quality vibes" won't do.
- You want the same input to produce a score you can reproduce and defend.
When not to use it
- Binary pass/fail gates — use a deterministic linter instead.
- Single-sentence inputs — there's no evidence surface for the rubric to cite.
- Scoring at high volume where cost matters more than fidelity — use a cheaper heuristic.
- Adversarial scoring where the author controls both the artifact and the rubric synthesis.
Calibration
calibration/dataset.jsonl— 7 labeled cases across paper-quality, tool-fit, and deploy-readiness domains. All targets are publicly available artifacts (Zenodo papers, public OSS tools).calibration/META-RUBRIC.md— the rubric for evaluating rubric generators. 7 dimensions, each motivated by a specific LLM failure mode from the taxonomy below.calibration/failure-mode-taxonomy.md— 24 failure modes mined from the Hermes Labs research corpus (1,892 experiment records + named post-mortem incidents). Each FM cites a source artifact.
Evals
-
evals/wedge-variance/— variance comparison: hermes-rubricaggregatevs raw 0–10 LLM rating, same target × same backend. Demonstrates the variance-reduction wedge. -
applied/papers-20260423.md— two publicly published research papers scored on publication-readiness as worked examples:Paper Aggregate Taxonomy of Epistemic Failure Modes 6.9 Asymmetric Burden of Proof 6.5 Each score has a full rubric + citations + per-dimension rationale in the file.
Running the tests
git clone https://github.com/hermes-labs-ai/hermes-rubric
cd hermes-rubric
pip install -e ".[dev]"
pytest
115 tests (111 passing + 4 skipped) across 14 files, including two adversarial gates that fail the build if the scaffold breaks plus a mechanical doc-consistency gate (tests/test_docs_consistency.py) that fails CI if the README opener / pytest --collect-only count drift apart:
test_fluency_does_not_inflate_evidence_score— a fluent rewrite of weak evidence must not outscore a substantive-but-rough version by more than 1 point.test_fabricated_claim_does_not_outscore_evidenced_claim— claims without supporting evidence are capped at ≤3.
CI status: see the GitHub Actions badge at the top of this README.
License
MIT — see LICENSE.
About Hermes Labs
Hermes Labs builds AI audit infrastructure for enterprise AI systems — EU AI Act readiness, ISO 42001 evidence bundles, continuous compliance monitoring, agent-level risk testing. We work with teams shipping AI into regulated environments.
Our OSS philosophy — read this if you're deciding whether to depend on us:
- Everything we release is free, forever. MIT or Apache-2.0. No "open core," no SaaS tier upsell, no paid version with the features you actually need. You can run this repo commercially, without talking to us.
- We open-source our own infrastructure. The tools we release are what Hermes Labs uses internally — we don't publish demo code, we publish production code.
- We sell audit work, not licenses. If you want an ANNEX-IV pack, an ISO 42001 evidence bundle, gap analysis against the EU AI Act, or agent-level red-teaming delivered as a report, that's at hermes-labs.ai. If you just want the code to run it yourself, it's right here.
The Hermes Labs OSS audit stack (public, production-grade, no SaaS):
Static audit (before deployment)
- lintlang — Static linter for AI agent configs, tool descriptions, system prompts.
pip install lintlang - scaffold-lint — Static linter for LLM prompt scaffolds.
pip install scaffold-lint - rule-audit — Static prompt audit — contradictions, coverage gaps, priority ambiguities
- intent-verify — Repo intent verification + spec-drift checks
- repo-audit — Multi-signal repo readiness check
Runtime observability (while the agent runs)
- little-canary — Prompt injection detection via sacrificial canary-model probes
- suy-sideguy — Runtime policy guard — user-space enforcement + forensic reports
- colony-probe — Prompt confidentiality audit — detects system-prompt reconstruction
Scoring & regression (to prove what changed)
- hermes-rubric — Evidence-first structured scoring (this tool).
pip install hermes-rubric - hermes-jailbench — Jailbreak regression benchmark.
pip install hermes-jailbench - agent-convergence-scorer — Score how similar N agent outputs are.
pip install agent-convergence-scorer
Supporting infra
Natural pairing: scaffold-lint catches how much scaffolding you have. lintlang catches how well-structured it is. rule-audit catches what the rules contradict. hermes-rubric scores the thing the agent finally produced — with citations.
Built by Hermes Labs · @roli-lpci
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hermes_rubric-1.0.0.tar.gz.
File metadata
- Download URL: hermes_rubric-1.0.0.tar.gz
- Upload date:
- Size: 343.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
157e8510d1d12a45c9ecad172361d69cd91f25761fd395a7d9bcbf33fb825050
|
|
| MD5 |
caea6635b36edaead20cf38ce0cf43fa
|
|
| BLAKE2b-256 |
c540b7999b5a993eabf29870462b89917dc5aea336edbff564b3ea59159d3ca9
|
File details
Details for the file hermes_rubric-1.0.0-py3-none-any.whl.
File metadata
- Download URL: hermes_rubric-1.0.0-py3-none-any.whl
- Upload date:
- Size: 48.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8056f2c6acc57579e9b1bb934e6fd33118a867bed636ab5815cb9473d8be014
|
|
| MD5 |
6c94402d97bcf6f4dd47d820c0324ab8
|
|
| BLAKE2b-256 |
a151fefbc5d2b007af5e9f4a650345c393f42795a13792e6899740a321b4a1f2
|