Rubric compiler and judge engine for LLM evaluation

Project description

rubrify

Rubric compiler and judge engine for LLM evaluation.

rubrify lets you define structured evaluation rubrics as typed Python objects, compile them into immutable bundles, and run criterion-by-criterion LLM-based judgments against text responses. It also supports evolving rubrics against human-annotated datasets using GEPA's reflective prompt optimization.

Built on harn_ai for multi-provider LLM access (OpenAI, Anthropic, DeepSeek, Google, local proxies) and harn_agent for agent primitives. API keys are auto-discovered from environment variables.

Installation

Requires Python >= 3.12.

pip install rubrify

Or with uv:

uv add rubrify

Core dependencies: harn-ai, harn-agent, pydantic>=2.10.

For the rubric evolution system (GEPA integration):

pip install rubrify[evolve]

Or with uv:

uv add rubrify[evolve]

This adds gepa>=0.1.0 as a dependency.

To upgrade to the latest version:

pip install --upgrade rubrify

Or with uv:

uv add rubrify --upgrade

API Keys

rubrify discovers API keys from environment variables via harn. Each provider has a standard env var:

Provider	Environment Variable
DeepSeek	`DEEPSEEK_API_KEY`
OpenAI	`OPENAI_API_KEY`
Anthropic	`ANTHROPIC_API_KEY`
Google	`GEMINI_API_KEY`
Groq	`GROQ_API_KEY`
xAI	`XAI_API_KEY`
Mistral	`MISTRAL_API_KEY`
OpenRouter	`OPENROUTER_API_KEY`
Together	`TOGETHER_API_KEY`
Fireworks	`FIREWORKS_API_KEY`
Cerebras	`CEREBRAS_API_KEY`
HuggingFace	`HF_TOKEN`

Option 1: .env file (recommended)

Copy the included .env.example to .env and fill in your keys:

DEEPSEEK_API_KEY=sk-your-key-here

Then at the top of your script:

from dotenv import load_dotenv
load_dotenv()

Install with pip install python-dotenv or pip install rubrify[dev].

Option 2: Shell export

export DEEPSEEK_API_KEY=sk-your-key-here

Option 3: Direct parameter

judge = Judge(JudgeConfig(model=model, api_key="sk-your-key-here"))

Quick Start

Define and compile a rubric

from rubrify import (
    CorpusProfile, Criterion, NumericScale, Rubric, RubricMeta,
    ScaleAnchor, ScopeSpec, compile_rubric,
)

rubric = Rubric(
    meta=RubricMeta(name="MyRubric", version="1.0"),
    goal="Evaluate response quality.",
    corpus_profile=CorpusProfile(
        id="cp_responses",
        domain="Customer support responses",
        typical_behaviors=["Agent addresses the customer's question directly"],
        atypical_behaviors=["Agent ignores the question entirely"],
        quality_axis="Helpfulness and clarity of response",
    ),
    criteria=[
        Criterion(
            id="C1",
            title="Clarity",
            description="How clear is the response?",
            scale=NumericScale(
                minimum=0, maximum=5, step=1,
                anchors=[
                    ScaleAnchor(value=0, label="unclear", description="Meaning obscured."),
                    ScaleAnchor(value=5, label="crystal", description="Perfectly clear."),
                ],
            ),
            weight=1.0,
            scope=ScopeSpec(
                in_scope=["Sentence structure", "Word choice precision"],
                out_of_scope=["Factual accuracy", "Tone or politeness"],
            ),
        ),
    ],
)

result = compile_rubric(rubric)
assert result.ok          # True if no audit issues
bundle = result.bundle    # Immutable RubricBundle, ready for judging

Run a judgment

import asyncio
from harn_ai.models import get_model
from rubrify import Judge, JudgeConfig

judge = Judge(JudgeConfig(model=get_model("openai", "gpt-4o")))
judgment = asyncio.run(judge.evaluate(bundle, "The response text to evaluate."))

print(judgment.aggregation.normalized_score)  # 0-100
print(judgment.decision)                       # e.g. "Strong draft"
for cj in judgment.criterion_judgments:
    print(f"  {cj.criterion_id}: {cj.value} (unit={cj.unit_score:.2f})")

Use a custom OpenAI-compatible proxy

from harn_ai.models import get_model

model = get_model("openai", "gpt-4o").model_copy(update={
    "baseUrl": "http://localhost:8000/v1",
    "api": "openai-completions",
})
judge = Judge(JudgeConfig(model=model, api_key="your-key"))

Core Concepts

IR Type System

All rubric structures are Pydantic models with extra="forbid" (no unexpected fields allowed).

Scale types are polymorphic, discriminated by the kind field. Each scale knows its own domain and implements to_unit(value) -> float to normalize raw scores into [0, 1]:

Scale	`kind`	Domain	`to_unit` behavior
`BinaryScale`	`"binary"`	pass/fail (configurable labels and scores)	`True -> true_score`, `False -> false_score`
`OrdinalScale`	`"ordinal"`	Ordered levels with named anchors	Linearly maps anchor values to `[0, 1]`
`NominalScale`	`"nominal"`	Unordered categories with anchors	Maps category value to `[0, 1]` by range
`NumericScale`	`"numeric"`	Continuous range `[minimum, maximum]` with step	`(value - min) / (max - min)`, clamped

The union type Scale is: Annotated[BinaryScale | OrdinalScale | NominalScale | NumericScale, Field(discriminator="kind")].

ScopeSpec defines an explicit positive/negative interpretation boundary for a criterion. It is structurally bound to a Criterion (as an optional field, not a dangling reference). Rendered as <scope> children of <criterion> in XML -- survives all rendering modes (full, focused, group). Solves the definition mismatch problem where LLM judges interpret criteria more broadly or narrowly than intended.

in_scope -- list of observable behaviors that count toward this criterion
out_of_scope -- list of observable behaviors that do NOT count

from rubrify import ScopeSpec

scope = ScopeSpec(
    in_scope=[
        "Selects the correct API endpoint for the task",
        "Chooses appropriate HTTP method",
    ],
    out_of_scope=[
        "Quality of error handling code",
        "Code style or formatting choices",
    ],
)

CorpusProfile provides domain context for the evaluation corpus. It is an optional field on Rubric. Rendered as <corpus_profile> at root level in ALL XML rendering modes. Provides factual domain context (not hypotheses) so the judge calibrates expectations about what is normal vs. unusual in the dataset.

id -- unique identifier
domain -- the domain being evaluated
typical_behaviors -- observable things entities in the corpus normally do
atypical_behaviors -- observable things entities in the corpus normally do NOT do
quality_axis -- the primary quality dimension being measured

from rubrify import CorpusProfile

profile = CorpusProfile(
    id="cp_coding_agents",
    domain="AI coding agent tool-call traces",
    typical_behaviors=[
        "Agents call file-read before file-write",
        "Agents use grep/find for discovery before targeted reads",
    ],
    atypical_behaviors=[
        "Agents skip discovery and write files blind",
        "Agents call tools with no arguments",
    ],
    quality_axis="How effectively the agent uses available tools to accomplish the task",
)

Criterion is the atomic evaluation unit. Key fields:

id -- unique identifier
title, description -- human-readable
scale -- one of the four scale types above
weight -- contribution to the aggregate score (default 1.0)
evidence -- EvidenceSpec controlling what evidence the judge must cite (note: required, exact_quote, min_items, and max_items on EvidenceSpec are prompt-only -- they are rendered into the XML surface so the LLM can see them, but are not enforced post-hoc by the engine)
genre -- for genre-conditional activation
mechanical_rules -- free-text rules rendered in XML
scope -- optional ScopeSpec defining the criterion's interpretation boundary

CriterionGroup provides hierarchical aggregation over criteria. Supported aggregation strategies: weighted_sum, weighted_mean, min, max, all, any.

Disqualifier defines an auto-fail condition. Can be pattern-based (regex scanned first against judge rationales, then against the response text) or criterion-linked (triggers when a specific criterion's unit score is 0).

Rubric is the mutable, pre-compilation object. It contains criteria, groups, disqualifiers, instructions, patterns (PatternEntry for regex matching), definitions (Definition), advice rules (AdviceRule), calibration examples (CalibrationExample), and an optional corpus_profile (CorpusProfile). Model validators enforce unique criterion IDs and valid group/disqualifier references at construction time.

RubricBundle is the immutable, locked, executable form produced by the compiler. It contains the frozen rubric, compiled regex patterns, constraint bindings, authority blocks, surface policy, and output constraints. The bundle is frozen via Pydantic's frozen=True config.

Roles and Constraints

RoleSpec defines the judge's persona, authority level (absolute, advisory, peer), domain, obligations (what the model MUST do), and constraints (what the model MUST NOT do). It is a structural component, not a cosmetic prompt prefix.

SurfacePolicy governs how rubrics are rendered. Fields include:

input_codec -- currently only "xml"
output_codec -- currently only "json"
role -- optional RoleSpec
enforce_key_order -- whether to enforce JSON key ordering
criterion_focus -- "full" (send entire rubric per criterion) or "focused" (send only the relevant criterion)
decision_thresholds -- list of (min_score, label) tuples for custom decision labels
execution_strategy -- "per_criterion" (default), "grouped", or "holistic" (see Execution Strategies)

ConstraintBinding is the triple-layer alignment connecting a semantic criterion to its surface-layer projection (XML tag, JSON path) and output field. Each binding carries:

criterion_id -- which criterion
output_field -- JSON path where the judge writes its score (e.g. criterion_scores.C1)
evidence_source -- where evidence should come from (default "response")
projections -- list of SurfaceProjection objects (one per codec)
authority -- instruction, data, or meta

AuthorityBlock marks a prompt section as instruction vs. data, enforcing instruction/data separation.

OutputConstraint is a discriminated union of typed constraint variants, each with a concrete check(value) -> str | None method for enforcement. The union is Annotated[PrefixSuffixConstraint | WordCountConstraint | CharLimitConstraint | ItemCountConstraint | TokenConstraint, Field(discriminator="kind")]. Each variant carries id, description, target_field, enforcement ("hard" or "soft"), and scope. Hard constraints trigger disqualification; soft constraints produce warnings. Pydantic discriminates on the kind field ("prefix_suffix", "word_count", "char_limit", "item_count", "token").

The scope field controls when a constraint is checked relative to execution strategy:

Scope	Behavior	Default
`"call"`	Checked once per LLM call. For `per_criterion`, this is per criterion. For `grouped`/`holistic`, once per group/holistic call. Shared outputs (e.g. rationale) are deduplicated.	Yes
`"criterion"`	Checked per criterion individually, regardless of execution strategy.	No
`"judgment"`	Checked once on the final aggregated result (all criterion outputs concatenated).	No

from rubrify.ir.constraints import WordCountConstraint

constraint = WordCountConstraint(
    id="rationale_length",
    description="Rationale must be at least 10 words",
    target_field="rationale",
    enforcement="soft",
    scope="criterion",      # check every criterion's rationale, even in grouped mode
    count=10,
    mode="min",
)

Compiler Pipeline

compile_rubric(rubric, *, policy=None, output_constraints=None) -> CompilationResult

This is a synchronous, pure function (no LLM calls). It runs these passes:

Bind -- generates a ConstraintBinding for each criterion, with XML and JSON SurfaceProjection objects. This is the triple-layer alignment: criterion ID maps to XML attributes and JSON output path.
AuthorityBlocks -- creates standard authority blocks for rubric_spec, response_under_test, judge_instructions, and context_document.
Lock -- produces an immutable RubricBundle via lock_bundle(). Compiles all PatternEntry and Disqualifier regex patterns (fails loudly on invalid regex).
Audit -- audit passes check:
- audit_coverage -- every criterion has a binding
- audit_projection_completeness -- every binding has projections matching the policy's codecs
- audit_scale_consistency -- ordinal scales have anchors, numeric scales have max > min
- audit_output_constraints -- recognized fields, duplicate IDs, hard-enforcement safety
- audit_scope_completeness -- warns when a ScopeSpec has empty in_scope or out_of_scope (a scope that only defines one side is likely incomplete)
- audit_hypothesis_neutrality -- warns when instructions contain comparative or hypothesis language (e.g., "difference between", "designed to show") that embeds conclusions into the evaluation instrument

CompilationResult has a .ok property (True if no issues) and .issues list.

Judge Engine

Judge and JudgeConfig

from rubrify import Judge, JudgeConfig
from harn_ai.models import get_model

judge = Judge(JudgeConfig(
    model=get_model("deepseek", "deepseek-v4-flash"),
    api_key=None,           # auto-discovered from env
    temperature=0.0,
    max_tokens=2048,
    parallel=False,         # True to evaluate criteria concurrently
    use_tool=True,          # True for tool-based structured output
))

Judge is stateful: it tracks total_usage (token counts, API calls) and evaluation_count across all evaluations.

evaluate()

judgment = await judge.evaluate(
    bundle,
    response_text,
    context_text=None,     # optional reference context
    genre=None,            # optional genre for genre-conditional criteria
    on_criterion_start=None,  # callback(criterion_id)
    on_criterion_done=None,   # callback(criterion_id, CriterionJudgment)
)

The Judge Loop

run_judge_loop() is the core algorithm. It does not iterate on tool calls like an agent loop; it iterates over criteria (or groups of criteria, depending on the execution strategy). Steps:

Verify bundle is locked.
Resolve active criteria (genre filtering: criteria with genre=None are always active; others activate only when active_genre matches).
Partition active criteria into call units based on execution_strategy from the bundle's SurfacePolicy:
- "per_criterion" (default): one call unit per criterion.
- "grouped": one call unit per CriterionGroup; ungrouped criteria individually.
- "holistic": one call unit containing all active criteria.
Execute each call unit: single-criterion call units use execute_criterion(), multi-criterion call units use execute_group().
Check disqualifiers (pattern-based and criterion-linked).
Run mechanical pattern checks (PatternEntry patterns against the response).
Verify evidence quotes exist in the response text (exact containment, then normalized containment).
Verify output constraints against LLM output (respecting scope: call, criterion, or judgment).
Aggregate scores (weighted mean, or grouped aggregation if groups exist).
Compute decision label from thresholds (defaults: >=90 "Publish-ready", >=75 "Strong draft", >=60 "Workable draft", >=40 "Needs major revision", <40 "Fundamentally unclear"). Disqualifier violations produce "Rejected".

Execution supports parallel=True for concurrent call-unit evaluation via asyncio.gather.

Execution Strategies

The execution_strategy field on SurfacePolicy controls how criteria are dispatched to LLM calls:

Strategy	Call granularity	Use when
`"per_criterion"`	One LLM call per criterion (default)	Need maximum isolation, deep per-criterion analysis
`"grouped"`	One LLM call per `CriterionGroup`	Rubric has logical groups, want intra-group coherence with composability
`"holistic"`	One LLM call for ALL active criteria	Few criteria, need holistic coherence, cost-sensitive

Set via SurfacePolicy:

from rubrify.ir.roles import SurfacePolicy

policy = SurfacePolicy(execution_strategy="grouped")
bundle = compile_rubric(rubric, policy=policy).bundle

Implementation details:

The judge loop partitions active criteria into "call units" based on strategy. Each call unit is one LLM invocation.
"grouped" uses CriterionGroup.children to determine call boundaries. Ungrouped criteria fall back to individual calls.
"holistic" places all active criteria into a single call unit.
Multi-criterion call units use execute_group(), which renders a group-specific XML prompt via render_group_xml() and extracts per-criterion scores from a single criterion_scores response dict.
Single-criterion call units use the original execute_criterion() path.
parallel=True parallelizes across call units (not within them).

CriterionExecutor

execute_criterion() has two strategies:

Tool-based (default, use_tool=True): Builds a harn_ai.types.Tool named submit_judgment with a dynamic Pydantic model as the parameter type. The provider forces structured JSON output via native tool-calling. The response is pre-parsed.
Text-based (use_tool=False): Sends a text prompt, then parses JSON from the response text using harn_ai's repair-capable JSON parser.

Both strategies extract criterion scores via typed Pydantic model attribute access, not dict navigation with string splitting.

Judgment Output Types

CriterionJudgment -- per-criterion result: criterion_id, value (raw score), unit_score (normalized 0-1), evidence (list of EvidenceQuote), rationale, confidence, warnings.
AggregatedScore -- raw_score, normalized_score (0-100), method, group_scores.
Judgment -- the complete output: criterion_judgments, aggregation, decision, violations, constraint_warnings, pattern_hits, usage (JudgeUsage), timestamp.
JudgeUsage -- tracks input_tokens, output_tokens, total_tokens, api_calls.

Codecs

XML Codec

render_rubric_xml(bundle) -> str renders a locked RubricBundle as an <LLM_JUDGE_SPEC> XML document. Uses xml.etree.ElementTree for proper DOM construction and escaping (no string concatenation). No XML parsing of untrusted input occurs in this codec (it only constructs and serializes).

Key design: bindings drive the criterion rendering. Each criterion's XML attributes come from its binding's SurfaceProjection(codec="xml"), not from raw criterion fields. This closes the triple-layer alignment loop.

The XML document includes: mission, role, corpus profile, rubric (criteria with anchors, evidence specs, and scope specs), disqualifiers, definitions, calibration examples, advice rules, output schema (with JSON template derived from the dynamic Pydantic model), scoring formula, pattern library, validation (output constraints), and instructions. ScopeSpec is rendered as <scope> children (<in_scope>, <out_of_scope>) of <criterion>. CorpusProfile is rendered as <corpus_profile> at root level in all three rendering modes (full, focused, group).

render_criterion_xml(criterion, bundle) -> str renders a focused document for a single criterion, used when criterion_focus == "focused".

render_group_xml(criteria, bundle) -> str renders a subset document for a group of criteria, used by the "grouped" and "holistic" execution strategies. Includes only the specified criteria, relevant disqualifiers, and a subset-specific output schema.

JSON Codec

parse_judgment_json(raw) -> dict parses LLM output using harn_ai's parse_json_with_repair. Raises ParseError on failure.

build_judgment_model(bundle, criteria=None) -> type constructs a dynamic Pydantic model for the rubric's expected output structure. Cached per criterion specs (LRU cache, max 32 entries). The model has fields: score, rationale, evidence, violations, criterion_scores (a nested model with one field per criterion, typed by scale kind), confidence. If criteria is provided, the model is built for only that subset (used by grouped/holistic execution strategies).

build_judgment_tool(bundle, criteria=None) -> Tool wraps the dynamic model as a harn_ai Tool named submit_judgment for structured output via provider tool-calling. If criteria is provided, the tool schema covers only that subset.

validate_judgment_output(parsed, bundle) -> (model_instance | None, warnings) validates parsed JSON against the dynamic model.

generate_judgment_schema(bundle) and generate_judgment_template(bundle, criteria=None) produce the JSON Schema and a zero-valued JSON template respectively.

Evolution System

The rubrify.evolve module requires the optional gepa dependency (pip install rubrify[evolve]).

It evolves rubric text components against human-annotated datasets to maximize agreement between automated LLM-judge evaluations and human expert annotations. Structural invariants (criterion IDs, scale types, ranges, groups, disqualifiers, patterns) are never changed. Evolvable components include: goal, criterion descriptions, anchor descriptions, weights, role persona/obligations/constraints, instructions, definitions, advice rules, calibration examples, scope specs (ScopeSpec.in_scope and ScopeSpec.out_of_scope), and corpus profile fields (CorpusProfile.typical_behaviors, CorpusProfile.atypical_behaviors, CorpusProfile.quality_axis).

AnnotatedExample

from rubrify.evolve import AnnotatedExample

example = AnnotatedExample(
    id="ex_001",
    response_text="The response to evaluate...",
    context_text="Optional reference context",
    human_scores={"C1": 4, "C2": 2},   # criterion_id -> human-assigned score
    human_label="good",                  # optional overall label
    genre="travel",                      # optional genre tag
)

evolve_rubric (Mode 1: Granular)

from harn_ai.models import get_model
from rubrify.evolve import evolve_rubric, RubricEvolutionConfig

result = evolve_rubric(
    seed_rubric=my_rubric,
    annotated_dataset=my_examples,           # list[AnnotatedExample]
    judge_model=get_model("deepseek", "deepseek-v4-flash"),
    reflection_model=get_model("openai", "gpt-4o"),
    role=my_role,                            # optional RoleSpec
    config=RubricEvolutionConfig(
        train_split=0.7,
        max_metric_calls=300,
        reflection_minibatch_size=5,
        agreement_weight=0.6,
        consistency_weight=0.2,
        discrimination_weight=0.2,
    ),
)

evolved_rubric = result.best_rubric
evolved_role = result.best_role
print(result.best_score, result.total_iterations)

GEPA iteratively mutates the rubric's text components using a reflection LM, guided by structured feedback from rubrify's judge comparing against human annotations. Each mutation is evaluated on a training minibatch, accepted if improved, then tracked on a validation set with Pareto-based candidate selection across three objectives:

Agreement -- normalized absolute error vs. human annotations (1 - mean error, scaled 0-1).
Consistency -- 1 - coefficient of variation across repeated runs (optional, via consistency_runs > 1).
Discrimination -- normalized entropy of the score distribution (0 = all same score, 1 = uniform spread).

Practical guidance from the source:

30-50 annotated examples minimum. Fewer than ~15 training examples leads to overfitting.
Set discrimination_weight=0.0 with fewer than ~10 examples.
Set reflection_minibatch_size equal to training set size for tiny datasets.
Budget calibration: a "metric call" = one annotated example evaluated (not one batch or one iteration). Set max_metric_calls to at least 20x your dataset size for meaningful exploration. For example, with 40 examples, use max_metric_calls=800 or higher. Under-budgeting (e.g., 50 calls for a 40-example dataset) causes GEPA to flat-line after 1-2 iterations because it exhausts its budget before finding improvements.

evolve_rubric_v3 (Mode 3: Co-evolution)

Co-evolves the target rubric together with meta-components in a single GEPA loop:

Proposal quality gate rubric -- a lightweight rubric that pre-filters proposed mutations before expensive evaluation
Reflection prompt templates -- per-component-type specialized templates that guide the reflection LM
Acceptance parameters -- tolerance thresholds for multi-dimensional acceptance decisions

from rubrify.evolve import evolve_rubric_v3
from rubrify.evolve.evolver import CoEvolutionConfig

result = evolve_rubric_v3(
    seed_rubric=my_rubric,
    annotated_dataset=my_examples,
    judge_model=judge_model,
    reflection_model=reflection_model,
    config=CoEvolutionConfig(
        max_metric_calls=500,
        evolve_gate=True,
        evolve_reflection_templates=True,
        evolve_acceptance_params=True,
    ),
)

# Result includes evolved meta-components
result.evolved_gate_rubric
result.evolved_reflection_templates
result.evolved_acceptance_params

All four artifact types are packed into a single GEPA candidate dict with namespace prefixes (target., gate., reflection.template., acceptance.) and optimized via round-robin component selection. Meta-component mutations are accepted on non-degradation (lenient threshold) since they affect the search process, not the evaluation score directly.

Candidate Mapping

rubric_to_candidate(rubric, role) -> dict[str, str] decomposes a Rubric into GEPA's flat dict[str, str] format. Each value is a string; structured sub-components (anchor lists, instructions, scope specs, corpus profile behaviors) are serialized as JSON strings. Scope specs are mapped as criterion.{id}.scope.in_scope and criterion.{id}.scope.out_of_scope. Corpus profile fields are mapped as corpus_profile.typical, corpus_profile.atypical, and corpus_profile.quality_axis.

candidate_to_rubric(candidate, base_rubric, base_role) -> (Rubric, RoleSpec | None) reconstructs from the flat format, using the base rubric as a structural template.

Supporting Components

RubricEvolverAdapter -- GEPAAdapter implementation. Evaluates candidates by reconstructing a rubric, compiling it, running the Judge on each annotated example, and computing agreement. Builds rich reflective datasets per component type with detailed diagnostic feedback.
ProposalQualityGate -- pre-filters proposed rubric text using rubrify's own Judge against a 3-criterion quality rubric (structural validity, semantic specificity, improvement clarity). Costs 1 LLM call per proposal vs. N_examples * N_criteria for full evaluation.
GatedProposalFn -- wraps the standard GEPA reflection flow with proposal quality filtering. If a proposal is rejected, it re-proposes with gate feedback (up to max_retries times).
RubricAwareAcceptance -- multi-dimensional acceptance criterion. Accepts if any objective dimension improved and no dimension degraded beyond its tolerance threshold.
EvolutionProgress -- pretty progress logger implementing GEPA's LoggerProtocol with colored ANSI output, status symbols, and a summary table.
Reflection templates -- build_reflection_template_dict(rubric, role) produces per-component-type specialized templates (criterion descriptions, anchors, weights, goal, instructions, definitions, advice rules, calibration examples, role).

Examples

The examples/ directory contains four rubric definitions, a re-export facade, and a demo runner. Each rubric module exports a function that returns a CompilationResult.

`examples/compliance_judge.py`

ComplianceJudge: evaluates whether an assistant complied with a user's request without refusing, deflecting, or adding safety notices. 3 criteria (Directness 0-2, Refusal/Deflection 0-2, Task Fidelity 0-2), 2 disqualifiers, 16-pattern regex library, strict compliance-judge role, BECAUSE: output constraint, holistic execution strategy, and custom decision thresholds (Yes / Somewhat / No).

uv run python examples/compliance_judge.py

`examples/anti_slop_judge.py`

AntiLLMY: scores a passage for LLM-generated language patterns ("slop"). 5 criteria (Neutrality/Tone, Formulaic Scaffolding, Meta-Communication, Markup Artifacts, Watermarks -- each 0-3), 3 disqualifiers (AI self-disclosure, watermark tokens, placeholder text), extensive pattern library, inverted risk scoring (risk = 15 - score), holistic execution strategy, advice rules, and custom risk-band decision thresholds.

uv run python examples/anti_slop_judge.py

`examples/zinsser_judge.py`

ZinsserJudge XXL: evaluates English nonfiction craft quality grounded in Zinsser's principles. 12 core criteria (C1-C12, 0-5), 10 genre-conditional modules (0-3), 3 attitude lenses (0-2), 5 disqualifiers, 11 patterns, 3 groups (core/genre/attitude), grouped execution strategy, BECAUSE: + 35-word output constraints, and tiered decision thresholds. Accepts an optional genre parameter.

uv run python examples/zinsser_judge.py

`examples/completeness_judge.py`

CompletenessJudge: evaluates response completeness -- content coverage, no truncation, structural integrity. 5 criteria (Content Completeness 0-3, No Truncation binary, Structural Integrity 0-2, Step Coverage 0-3, Format Compliance 0-2), 2 disqualifiers, 11 patterns, definitions, calibration examples, completeness-auditor role, BECAUSE: + no-apology output constraints, holistic execution strategy, and custom decision thresholds (Complete / Partial / Incomplete).

uv run python examples/completeness_judge.py

`examples/rubric_library.py`

Re-export facade for all four rubrics. Imports and re-exports compliance_judge, zinsser_judge, anti_slop_judge, and completeness_judge so existing imports continue to work. Run to compile all rubrics and print summaries:

uv run python examples/rubric_library.py

`examples/red_team_judge.py`

Demo runner for the ComplianceJudge rubric. Imports the rubric from compliance_judge.py and runs it against 4 calibration cases (meta prefix + tactics, clean tactics, explicit refusal + deflection, total refusal) using rubrify's Judge class. Demonstrates dotenv loading for API keys. Contains no rubric definition of its own.

uv run python examples/red_team_judge.py

Testing

The test suite uses pytest with harn's faux provider for deterministic testing (no real LLM calls, no network).

pytest tests/test_rubrify.py

Or with uv:

uv run pytest tests/test_rubrify.py

The test suite covers (67 tests, 0 skipped):

IR type validation (scale constraints, duplicate IDs, invalid references, extra fields)
Execution strategy and constraint scope validation (valid/invalid strategies, scope defaults and validation)
Scale normalization (to_unit() bounds, clamping, label lookup)
Compiler pipeline (locking, freezing, binding generation, projection completeness, pattern compilation, audit)
XML codec (well-formed output, binding-driven attributes, special character escaping, element counts, output schema)
JSON codec (parsing, empty/invalid input, dynamic model caching, field presence, validation, coercion, tool construction)
Output constraint variants (check logic for PrefixSuffix, WordCount, CharLimit, ItemCount, Token constraints; validation errors; audit pass for duplicate IDs and unknown target fields)
Integration tests with faux provider (full pipeline with tool calls, text fallback, usage tracking, disqualifier behavior, binary scale, multiple evaluations)

Architecture

src/rubrify/
  __init__.py              -- Public API surface (re-exports)

  ir/                      -- Intermediate representation (typed core)
    types.py               -- Scale types, Criterion, ScopeSpec, CorpusProfile, CriterionGroup, Disqualifier, Rubric
    roles.py               -- RoleSpec, SurfacePolicy
    constraints.py         -- ConstraintBinding, SurfaceProjection, AuthorityBlock, OutputConstraint (discriminated union)
    bundle.py              -- RubricBundle (immutable), lock_bundle()

  compiler/                -- Rubric -> RubricBundle transformation
    compiler.py            -- compile_rubric(), CompilationResult
    passes.py              -- bind(), audit_coverage(), audit_projection_completeness(), audit_scale_consistency(), audit_output_constraints(), audit_scope_completeness(), audit_hypothesis_neutrality()

  codecs/                  -- Surface format rendering and parsing
    xml_codec.py           -- render_rubric_xml(), render_criterion_xml(), render_group_xml()
    json_codec.py          -- parse_judgment_json(), build_judgment_model(), build_judgment_tool(), validate_judgment_output()

  engine/                  -- Judge execution
    judgment.py            -- CriterionJudgment, AggregatedScore, Judgment, JudgeUsage
    executor.py            -- execute_criterion() (single criterion), execute_group() (multi-criterion in one LLM call)
    judge_loop.py          -- run_judge_loop() (strategy-aware dispatch: per_criterion/grouped/holistic)
    judge.py               -- Judge, JudgeConfig (stateful public API)

  evolve/                  -- Rubric evolution via GEPA (optional)
    types.py               -- AnnotatedExample, JudgmentTrajectory
    candidate.py           -- rubric_to_candidate(), candidate_to_rubric()
    lm_bridge.py           -- make_harn_lm() (wraps harn_ai Model as GEPA's LanguageModel protocol)
    adapter.py             -- RubricEvolverAdapter (GEPAAdapter implementation)
    evolver.py             -- evolve_rubric(), evolve_rubric_v3(), config/result dataclasses
    async_bridge.py        -- run_async() (async-to-sync bridge for nested event loops)
    meta_metric.py         -- compute_consistency(), compute_discrimination(), _get_scale_range(), _to_numeric()
    acceptance.py          -- RubricAwareAcceptance (multi-dimensional acceptance criterion)
    proposal_gate.py       -- ProposalQualityGate, make_proposal_quality_rubric()
    gated_proposer.py      -- GatedProposalFn
    coevolution_adapter.py -- CoEvolutionAdapter
    coevolution_candidate.py -- coevolution_to_candidate(), candidate_to_coevolution()
    reflection_templates.py -- build_reflection_template_dict(), per-component-type templates
    progress.py            -- EvolutionProgress (pretty ANSI progress logger)
    test_fixtures.py       -- make_compliance_rubric(), make_annotated_dataset() for testing

  bridge/                  -- Optional integrations with third-party RL/eval frameworks
    verifiers.py           -- make_rubrify_rubric(), make_rubrify_env() (verifiers library bridge)

examples/
  compliance_judge.py      -- ComplianceJudge rubric definition (3 criteria, 2 DQs, 16 patterns)
  anti_slop_judge.py       -- AntiLLMY rubric definition (5 criteria, 3 DQs, extensive pattern library)
  zinsser_judge.py         -- ZinsserJudge XXL rubric definition (25 criteria, 3 groups, genre-conditional)
  completeness_judge.py    -- CompletenessJudge rubric definition (5 criteria, 2 DQs, 11 patterns)
  rubric_library.py        -- Re-export facade for all four rubrics
  red_team_judge.py        -- Demo runner: ComplianceJudge with 4 calibration cases
  verifiers_env_example.py -- Example: wiring rubrify to the verifiers training loop
  debug_constraints.py     -- Constraint debugging utility
  debug_deepseek.py        -- DeepSeek provider debugging utility
  test_all_rubrics.py      -- Batch compile-and-test runner for all example rubrics

tests/
  test_rubrify.py          -- 67 tests covering IR, compiler, codecs, output constraints, execution strategies, and faux-provider integration

License

See the project configuration for license details.

Project details

Release history Release notifications | RSS feed

This version

0.1.6

Jun 20, 2026

0.1.5

Jun 10, 2026

0.1.4

Jun 6, 2026

0.1.3

Jun 6, 2026

0.1.2

Jun 6, 2026

0.1.1

Jun 6, 2026

0.1.0

Jun 6, 2026

0.0.1

Apr 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubrify-0.1.6.tar.gz (140.4 kB view details)

Uploaded Jun 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rubrify-0.1.6-py3-none-any.whl (94.6 kB view details)

Uploaded Jun 20, 2026 Python 3

File details

Details for the file rubrify-0.1.6.tar.gz.

File metadata

Download URL: rubrify-0.1.6.tar.gz
Upload date: Jun 20, 2026
Size: 140.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for rubrify-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`85100e70f467f285310e8ff58a1253ad7ebb4414e2b1b00ad14a18a56cc5e1fa`
MD5	`28326dd82774b775e0fda6ec0c00757a`
BLAKE2b-256	`1314e33ba9c894ba98a24cc67ff7705426c4abd843086507c480c5a101452ed5`

See more details on using hashes here.

File details

Details for the file rubrify-0.1.6-py3-none-any.whl.

File metadata

Download URL: rubrify-0.1.6-py3-none-any.whl
Upload date: Jun 20, 2026
Size: 94.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.23 {"installer":{"name":"uv","version":"0.11.23","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for rubrify-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d4e54699abcc17a37c26b77fee3a784ef3cf2e5d467d891f63b5a5931ec7ef1`
MD5	`7f0754c926bbcdb00d736e020ab7bf6f`
BLAKE2b-256	`8d8c9026e015715bf2d1b393926201807e038ced93e62f2356c5417af7fb4bda`

See more details on using hashes here.

rubrify 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

rubrify

Installation

API Keys

Quick Start

Define and compile a rubric

Run a judgment

Use a custom OpenAI-compatible proxy

Core Concepts

IR Type System

Roles and Constraints

Compiler Pipeline

Judge Engine

Judge and JudgeConfig

evaluate()

The Judge Loop

Execution Strategies

CriterionExecutor

Judgment Output Types

Codecs

XML Codec

JSON Codec

Evolution System

AnnotatedExample

evolve_rubric (Mode 1: Granular)

evolve_rubric_v3 (Mode 3: Co-evolution)

Candidate Mapping

Supporting Components

Examples

examples/compliance_judge.py

examples/anti_slop_judge.py

examples/zinsser_judge.py

examples/completeness_judge.py

examples/rubric_library.py

examples/red_team_judge.py

Testing

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`examples/compliance_judge.py`

`examples/anti_slop_judge.py`

`examples/zinsser_judge.py`

`examples/completeness_judge.py`

`examples/rubric_library.py`

`examples/red_team_judge.py`