Rubric compiler and judge engine for LLM evaluation
Project description
rubrify
Rubric compiler and judge engine for LLM evaluation.
rubrify lets you define structured evaluation rubrics as typed Python objects, compile them into immutable bundles, and run criterion-by-criterion LLM-based judgments against text responses. It also supports evolving rubrics against human-annotated datasets using GEPA's reflective prompt optimization.
Built on harn_ai for multi-provider LLM access (OpenAI, Anthropic, DeepSeek, Google, local proxies) and harn_agent for agent primitives. API keys are auto-discovered from environment variables.
Installation
Requires Python >= 3.12.
pip install rubrify
Or with uv:
uv add rubrify
Core dependencies: harn-ai, harn-agent, pydantic>=2.10.
For the rubric evolution system (GEPA integration):
pip install rubrify[evolve]
Or with uv:
uv add rubrify[evolve]
This adds gepa>=0.1.0 as a dependency.
To upgrade to the latest version:
pip install --upgrade rubrify
Or with uv:
uv add rubrify --upgrade
API Keys
rubrify discovers API keys from environment variables via harn. Each provider has a standard env var:
| Provider | Environment Variable |
|---|---|
| DeepSeek | DEEPSEEK_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
GEMINI_API_KEY |
|
| Groq | GROQ_API_KEY |
| xAI | XAI_API_KEY |
| Mistral | MISTRAL_API_KEY |
| OpenRouter | OPENROUTER_API_KEY |
| Together | TOGETHER_API_KEY |
| Fireworks | FIREWORKS_API_KEY |
| Cerebras | CEREBRAS_API_KEY |
| HuggingFace | HF_TOKEN |
Option 1: .env file (recommended)
Copy the included .env.example to .env and fill in your keys:
DEEPSEEK_API_KEY=sk-your-key-here
Then at the top of your script:
from dotenv import load_dotenv
load_dotenv()
Install with pip install python-dotenv or pip install rubrify[dev].
Option 2: Shell export
export DEEPSEEK_API_KEY=sk-your-key-here
Option 3: Direct parameter
judge = Judge(JudgeConfig(model=model, api_key="sk-your-key-here"))
Quick Start
Define and compile a rubric
from rubrify import (
Criterion, NumericScale, Rubric, RubricMeta, ScaleAnchor,
compile_rubric,
)
rubric = Rubric(
meta=RubricMeta(name="MyRubric", version="1.0"),
goal="Evaluate response quality.",
criteria=[
Criterion(
id="C1",
title="Clarity",
description="How clear is the response?",
scale=NumericScale(
minimum=0, maximum=5, step=1,
anchors=[
ScaleAnchor(value=0, label="unclear", description="Meaning obscured."),
ScaleAnchor(value=5, label="crystal", description="Perfectly clear."),
],
),
weight=1.0,
),
],
)
result = compile_rubric(rubric)
assert result.ok # True if no audit issues
bundle = result.bundle # Immutable RubricBundle, ready for judging
Run a judgment
import asyncio
from harn_ai.models import get_model
from rubrify import Judge, JudgeConfig
judge = Judge(JudgeConfig(model=get_model("openai", "gpt-4o")))
judgment = asyncio.run(judge.evaluate(bundle, "The response text to evaluate."))
print(judgment.aggregation.normalized_score) # 0-100
print(judgment.decision) # e.g. "Strong draft"
for cj in judgment.criterion_judgments:
print(f" {cj.criterion_id}: {cj.value} (unit={cj.unit_score:.2f})")
Use a custom OpenAI-compatible proxy
from harn_ai.models import get_model
model = get_model("openai", "gpt-4o").model_copy(update={
"baseUrl": "http://localhost:8000/v1",
"api": "openai-completions",
})
judge = Judge(JudgeConfig(model=model, api_key="your-key"))
Core Concepts
IR Type System
All rubric structures are Pydantic models with extra="forbid" (no unexpected fields allowed).
Scale types are polymorphic, discriminated by the kind field. Each scale knows its own domain and implements to_unit(value) -> float to normalize raw scores into [0, 1]:
| Scale | kind |
Domain | to_unit behavior |
|---|---|---|---|
BinaryScale |
"binary" |
pass/fail (configurable labels and scores) | True -> true_score, False -> false_score |
OrdinalScale |
"ordinal" |
Ordered levels with named anchors | Linearly maps anchor values to [0, 1] |
NominalScale |
"nominal" |
Unordered categories with anchors | Maps category value to [0, 1] by range |
NumericScale |
"numeric" |
Continuous range [minimum, maximum] with step |
(value - min) / (max - min), clamped |
The union type Scale is: Annotated[BinaryScale | OrdinalScale | NominalScale | NumericScale, Field(discriminator="kind")].
Criterion is the atomic evaluation unit. Key fields:
id-- unique identifiertitle,description-- human-readablescale-- one of the four scale types aboveweight-- contribution to the aggregate score (default 1.0)evidence--EvidenceSpeccontrolling what evidence the judge must cite (note:required,exact_quote,min_items, andmax_itemson EvidenceSpec are prompt-only -- they are rendered into the XML surface so the LLM can see them, but are not enforced post-hoc by the engine)genre-- for genre-conditional activationmechanical_rules-- free-text rules rendered in XML
CriterionGroup provides hierarchical aggregation over criteria. Supported aggregation strategies: weighted_sum, weighted_mean, min, max, all, any.
Disqualifier defines an auto-fail condition. Can be pattern-based (regex scanned first against judge rationales, then against the response text) or criterion-linked (triggers when a specific criterion's unit score is 0).
Rubric is the mutable, pre-compilation object. It contains criteria, groups, disqualifiers, instructions, patterns (PatternEntry for regex matching), definitions (Definition), advice rules (AdviceRule), and calibration examples (CalibrationExample). Model validators enforce unique criterion IDs and valid group/disqualifier references at construction time.
RubricBundle is the immutable, locked, executable form produced by the compiler. It contains the frozen rubric, compiled regex patterns, constraint bindings, authority blocks, surface policy, and output constraints. The bundle is frozen via Pydantic's frozen=True config.
Roles and Constraints
RoleSpec defines the judge's persona, authority level (absolute, advisory, peer), domain, obligations (what the model MUST do), and constraints (what the model MUST NOT do). It is a structural component, not a cosmetic prompt prefix.
SurfacePolicy governs how rubrics are rendered. Fields include:
input_codec-- currently only"xml"output_codec-- currently only"json"role-- optionalRoleSpecenforce_key_order-- whether to enforce JSON key orderingcriterion_focus--"full"(send entire rubric per criterion) or"focused"(send only the relevant criterion)decision_thresholds-- list of(min_score, label)tuples for custom decision labelsexecution_strategy--"per_criterion"(default),"grouped", or"holistic"(see Execution Strategies)
ConstraintBinding is the triple-layer alignment connecting a semantic criterion to its surface-layer projection (XML tag, JSON path) and output field. Each binding carries:
criterion_id-- which criterionoutput_field-- JSON path where the judge writes its score (e.g.criterion_scores.C1)evidence_source-- where evidence should come from (default"response")projections-- list ofSurfaceProjectionobjects (one per codec)authority--instruction,data, ormeta
AuthorityBlock marks a prompt section as instruction vs. data, enforcing instruction/data separation.
OutputConstraint is a discriminated union of typed constraint variants, each with a concrete check(value) -> str | None method for enforcement. The union is Annotated[PrefixSuffixConstraint | WordCountConstraint | CharLimitConstraint | ItemCountConstraint | TokenConstraint, Field(discriminator="kind")]. Each variant carries id, description, target_field, enforcement ("hard" or "soft"), and scope. Hard constraints trigger disqualification; soft constraints produce warnings. Pydantic discriminates on the kind field ("prefix_suffix", "word_count", "char_limit", "item_count", "token").
The scope field controls when a constraint is checked relative to execution strategy:
| Scope | Behavior | Default |
|---|---|---|
"call" |
Checked once per LLM call. For per_criterion, this is per criterion. For grouped/holistic, once per group/holistic call. Shared outputs (e.g. rationale) are deduplicated. |
Yes |
"criterion" |
Checked per criterion individually, regardless of execution strategy. | No |
"judgment" |
Checked once on the final aggregated result (all criterion outputs concatenated). | No |
from rubrify.ir.constraints import WordCountConstraint
constraint = WordCountConstraint(
id="rationale_length",
description="Rationale must be at least 10 words",
target_field="rationale",
enforcement="soft",
scope="criterion", # check every criterion's rationale, even in grouped mode
count=10,
mode="min",
)
Compiler Pipeline
compile_rubric(rubric, *, policy=None, output_constraints=None) -> CompilationResult
This is a synchronous, pure function (no LLM calls). It runs these passes:
- Bind -- generates a
ConstraintBindingfor each criterion, with XML and JSONSurfaceProjectionobjects. This is the triple-layer alignment: criterion ID maps to XML attributes and JSON output path. - AuthorityBlocks -- creates standard authority blocks for
rubric_spec,response_under_test,judge_instructions, andcontext_document. - Lock -- produces an immutable
RubricBundlevialock_bundle(). Compiles allPatternEntryandDisqualifierregex patterns (fails loudly on invalid regex). - Audit -- audit passes check:
audit_coverage-- every criterion has a bindingaudit_projection_completeness-- every binding has projections matching the policy's codecsaudit_scale_consistency-- ordinal scales have anchors, numeric scales havemax > minaudit_output_constraints-- recognized fields, duplicate IDs, hard-enforcement safety
CompilationResult has a .ok property (True if no issues) and .issues list.
Judge Engine
Judge and JudgeConfig
from rubrify import Judge, JudgeConfig
from harn_ai.models import get_model
judge = Judge(JudgeConfig(
model=get_model("deepseek", "deepseek-v4-flash"),
api_key=None, # auto-discovered from env
temperature=0.0,
max_tokens=2048,
parallel=False, # True to evaluate criteria concurrently
use_tool=True, # True for tool-based structured output
))
Judge is stateful: it tracks total_usage (token counts, API calls) and evaluation_count across all evaluations.
evaluate()
judgment = await judge.evaluate(
bundle,
response_text,
context_text=None, # optional reference context
genre=None, # optional genre for genre-conditional criteria
on_criterion_start=None, # callback(criterion_id)
on_criterion_done=None, # callback(criterion_id, CriterionJudgment)
)
The Judge Loop
run_judge_loop() is the core algorithm. It does not iterate on tool calls like an agent loop; it iterates over criteria (or groups of criteria, depending on the execution strategy). Steps:
- Verify bundle is locked.
- Resolve active criteria (genre filtering: criteria with
genre=Noneare always active; others activate only whenactive_genrematches). - Partition active criteria into call units based on
execution_strategyfrom the bundle'sSurfacePolicy:"per_criterion"(default): one call unit per criterion."grouped": one call unit perCriterionGroup; ungrouped criteria individually."holistic": one call unit containing all active criteria.
- Execute each call unit: single-criterion call units use
execute_criterion(), multi-criterion call units useexecute_group(). - Check disqualifiers (pattern-based and criterion-linked).
- Run mechanical pattern checks (
PatternEntrypatterns against the response). - Verify evidence quotes exist in the response text (exact containment, then normalized containment).
- Verify output constraints against LLM output (respecting scope:
call,criterion, orjudgment). - Aggregate scores (weighted mean, or grouped aggregation if groups exist).
- Compute decision label from thresholds (defaults: >=90 "Publish-ready", >=75 "Strong draft", >=60 "Workable draft", >=40 "Needs major revision", <40 "Fundamentally unclear"). Disqualifier violations produce "Rejected".
Execution supports parallel=True for concurrent call-unit evaluation via asyncio.gather.
Execution Strategies
The execution_strategy field on SurfacePolicy controls how criteria are dispatched to LLM calls:
| Strategy | Call granularity | Use when |
|---|---|---|
"per_criterion" |
One LLM call per criterion (default) | Need maximum isolation, deep per-criterion analysis |
"grouped" |
One LLM call per CriterionGroup |
Rubric has logical groups, want intra-group coherence with composability |
"holistic" |
One LLM call for ALL active criteria | Few criteria, need holistic coherence, cost-sensitive |
Set via SurfacePolicy:
from rubrify.ir.roles import SurfacePolicy
policy = SurfacePolicy(execution_strategy="grouped")
bundle = compile_rubric(rubric, policy=policy).bundle
Implementation details:
- The judge loop partitions active criteria into "call units" based on strategy. Each call unit is one LLM invocation.
"grouped"usesCriterionGroup.childrento determine call boundaries. Ungrouped criteria fall back to individual calls."holistic"places all active criteria into a single call unit.- Multi-criterion call units use
execute_group(), which renders a group-specific XML prompt viarender_group_xml()and extracts per-criterion scores from a singlecriterion_scoresresponse dict. - Single-criterion call units use the original
execute_criterion()path. parallel=Trueparallelizes across call units (not within them).
CriterionExecutor
execute_criterion() has two strategies:
- Tool-based (default,
use_tool=True): Builds aharn_ai.types.Toolnamedsubmit_judgmentwith a dynamic Pydantic model as the parameter type. The provider forces structured JSON output via native tool-calling. The response is pre-parsed. - Text-based (
use_tool=False): Sends a text prompt, then parses JSON from the response text usingharn_ai's repair-capable JSON parser.
Both strategies extract criterion scores via typed Pydantic model attribute access, not dict navigation with string splitting.
Judgment Output Types
CriterionJudgment-- per-criterion result:criterion_id,value(raw score),unit_score(normalized 0-1),evidence(list ofEvidenceQuote),rationale,confidence,warnings.AggregatedScore--raw_score,normalized_score(0-100),method,group_scores.Judgment-- the complete output:criterion_judgments,aggregation,decision,violations,constraint_warnings,pattern_hits,usage(JudgeUsage),timestamp.JudgeUsage-- tracksinput_tokens,output_tokens,total_tokens,api_calls.
Codecs
XML Codec
render_rubric_xml(bundle) -> str renders a locked RubricBundle as an <LLM_JUDGE_SPEC> XML document. Uses xml.etree.ElementTree for proper DOM construction and escaping (no string concatenation). No XML parsing of untrusted input occurs in this codec (it only constructs and serializes).
Key design: bindings drive the criterion rendering. Each criterion's XML attributes come from its binding's SurfaceProjection(codec="xml"), not from raw criterion fields. This closes the triple-layer alignment loop.
The XML document includes: mission, role, rubric (criteria with anchors and evidence specs), disqualifiers, definitions, calibration examples, advice rules, output schema (with JSON template derived from the dynamic Pydantic model), scoring formula, pattern library, validation (output constraints), and instructions.
render_criterion_xml(criterion, bundle) -> str renders a focused document for a single criterion, used when criterion_focus == "focused".
render_group_xml(criteria, bundle) -> str renders a subset document for a group of criteria, used by the "grouped" and "holistic" execution strategies. Includes only the specified criteria, relevant disqualifiers, and a subset-specific output schema.
JSON Codec
parse_judgment_json(raw) -> dict parses LLM output using harn_ai's parse_json_with_repair. Raises ParseError on failure.
build_judgment_model(bundle, criteria=None) -> type constructs a dynamic Pydantic model for the rubric's expected output structure. Cached per criterion specs (LRU cache, max 32 entries). The model has fields: score, rationale, evidence, violations, criterion_scores (a nested model with one field per criterion, typed by scale kind), confidence. If criteria is provided, the model is built for only that subset (used by grouped/holistic execution strategies).
build_judgment_tool(bundle, criteria=None) -> Tool wraps the dynamic model as a harn_ai Tool named submit_judgment for structured output via provider tool-calling. If criteria is provided, the tool schema covers only that subset.
validate_judgment_output(parsed, bundle) -> (model_instance | None, warnings) validates parsed JSON against the dynamic model.
generate_judgment_schema(bundle) and generate_judgment_template(bundle, criteria=None) produce the JSON Schema and a zero-valued JSON template respectively.
Evolution System
The rubrify.evolve module requires the optional gepa dependency (pip install rubrify[evolve]).
It evolves rubric text components against human-annotated datasets to maximize agreement between automated LLM-judge evaluations and human expert annotations. Structural invariants (criterion IDs, scale types, ranges, groups, disqualifiers, patterns) are never changed. Only text components and weights are evolvable: goal, criterion descriptions, anchor descriptions, weights, role persona/obligations/constraints, instructions, definitions, advice rules, and calibration examples.
AnnotatedExample
from rubrify.evolve import AnnotatedExample
example = AnnotatedExample(
id="ex_001",
response_text="The response to evaluate...",
context_text="Optional reference context",
human_scores={"C1": 4, "C2": 2}, # criterion_id -> human-assigned score
human_label="good", # optional overall label
genre="travel", # optional genre tag
)
evolve_rubric (Mode 1: Granular)
from harn_ai.models import get_model
from rubrify.evolve import evolve_rubric, RubricEvolutionConfig
result = evolve_rubric(
seed_rubric=my_rubric,
annotated_dataset=my_examples, # list[AnnotatedExample]
judge_model=get_model("deepseek", "deepseek-v4-flash"),
reflection_model=get_model("openai", "gpt-4o"),
role=my_role, # optional RoleSpec
config=RubricEvolutionConfig(
train_split=0.7,
max_metric_calls=300,
reflection_minibatch_size=5,
agreement_weight=0.6,
consistency_weight=0.2,
discrimination_weight=0.2,
),
)
evolved_rubric = result.best_rubric
evolved_role = result.best_role
print(result.best_score, result.total_iterations)
GEPA iteratively mutates the rubric's text components using a reflection LM, guided by structured feedback from rubrify's judge comparing against human annotations. Each mutation is evaluated on a training minibatch, accepted if improved, then tracked on a validation set with Pareto-based candidate selection across three objectives:
- Agreement -- normalized absolute error vs. human annotations (1 - mean error, scaled 0-1).
- Consistency -- 1 - coefficient of variation across repeated runs (optional, via
consistency_runs > 1). - Discrimination -- normalized entropy of the score distribution (0 = all same score, 1 = uniform spread).
Practical guidance from the source:
- 30-50 annotated examples minimum. Fewer than ~15 training examples leads to overfitting.
- Set
discrimination_weight=0.0with fewer than ~10 examples. - Set
reflection_minibatch_sizeequal to training set size for tiny datasets. - Budget 100-500 metric calls for real improvement.
evolve_rubric_v3 (Mode 3: Co-evolution)
Co-evolves the target rubric together with meta-components in a single GEPA loop:
- Proposal quality gate rubric -- a lightweight rubric that pre-filters proposed mutations before expensive evaluation
- Reflection prompt templates -- per-component-type specialized templates that guide the reflection LM
- Acceptance parameters -- tolerance thresholds for multi-dimensional acceptance decisions
from rubrify.evolve import evolve_rubric_v3
from rubrify.evolve.evolver import CoEvolutionConfig
result = evolve_rubric_v3(
seed_rubric=my_rubric,
annotated_dataset=my_examples,
judge_model=judge_model,
reflection_model=reflection_model,
config=CoEvolutionConfig(
max_metric_calls=500,
evolve_gate=True,
evolve_reflection_templates=True,
evolve_acceptance_params=True,
),
)
# Result includes evolved meta-components
result.evolved_gate_rubric
result.evolved_reflection_templates
result.evolved_acceptance_params
All four artifact types are packed into a single GEPA candidate dict with namespace prefixes (target., gate., reflection.template., acceptance.) and optimized via round-robin component selection. Meta-component mutations are accepted on non-degradation (lenient threshold) since they affect the search process, not the evaluation score directly.
Candidate Mapping
rubric_to_candidate(rubric, role) -> dict[str, str] decomposes a Rubric into GEPA's flat dict[str, str] format. Each value is a string; structured sub-components (anchor lists, instructions) are serialized as JSON strings.
candidate_to_rubric(candidate, base_rubric, base_role) -> (Rubric, RoleSpec | None) reconstructs from the flat format, using the base rubric as a structural template.
Supporting Components
- RubricEvolverAdapter --
GEPAAdapterimplementation. Evaluates candidates by reconstructing a rubric, compiling it, running the Judge on each annotated example, and computing agreement. Builds rich reflective datasets per component type with detailed diagnostic feedback. - ProposalQualityGate -- pre-filters proposed rubric text using rubrify's own Judge against a 3-criterion quality rubric (structural validity, semantic specificity, improvement clarity). Costs 1 LLM call per proposal vs. N_examples * N_criteria for full evaluation.
- GatedProposalFn -- wraps the standard GEPA reflection flow with proposal quality filtering. If a proposal is rejected, it re-proposes with gate feedback (up to
max_retriestimes). - RubricAwareAcceptance -- multi-dimensional acceptance criterion. Accepts if any objective dimension improved and no dimension degraded beyond its tolerance threshold.
- EvolutionProgress -- pretty progress logger implementing GEPA's
LoggerProtocolwith colored ANSI output, status symbols, and a summary table. - Reflection templates --
build_reflection_template_dict(rubric, role)produces per-component-type specialized templates (criterion descriptions, anchors, weights, goal, instructions, definitions, advice rules, calibration examples, role).
Examples
The examples/ directory contains four rubric definitions, a re-export facade, and a demo runner. Each rubric module exports a function that returns a CompilationResult.
examples/compliance_judge.py
ComplianceJudge: evaluates whether an assistant complied with a user's request without refusing, deflecting, or adding safety notices. 3 criteria (Directness 0-2, Refusal/Deflection 0-2, Task Fidelity 0-2), 2 disqualifiers, 16-pattern regex library, strict compliance-judge role, BECAUSE: output constraint, holistic execution strategy, and custom decision thresholds (Yes / Somewhat / No).
uv run python examples/compliance_judge.py
examples/anti_slop_judge.py
AntiLLMY: scores a passage for LLM-generated language patterns ("slop"). 5 criteria (Neutrality/Tone, Formulaic Scaffolding, Meta-Communication, Markup Artifacts, Watermarks -- each 0-3), 3 disqualifiers (AI self-disclosure, watermark tokens, placeholder text), extensive pattern library, inverted risk scoring (risk = 15 - score), holistic execution strategy, advice rules, and custom risk-band decision thresholds.
uv run python examples/anti_slop_judge.py
examples/zinsser_judge.py
ZinsserJudge XXL: evaluates English nonfiction craft quality grounded in Zinsser's principles. 12 core criteria (C1-C12, 0-5), 10 genre-conditional modules (0-3), 3 attitude lenses (0-2), 5 disqualifiers, 11 patterns, 3 groups (core/genre/attitude), grouped execution strategy, BECAUSE: + 35-word output constraints, and tiered decision thresholds. Accepts an optional genre parameter.
uv run python examples/zinsser_judge.py
examples/completeness_judge.py
CompletenessJudge: evaluates response completeness -- content coverage, no truncation, structural integrity. 5 criteria (Content Completeness 0-3, No Truncation binary, Structural Integrity 0-2, Step Coverage 0-3, Format Compliance 0-2), 2 disqualifiers, 11 patterns, definitions, calibration examples, completeness-auditor role, BECAUSE: + no-apology output constraints, holistic execution strategy, and custom decision thresholds (Complete / Partial / Incomplete).
uv run python examples/completeness_judge.py
examples/rubric_library.py
Re-export facade for all four rubrics. Imports and re-exports compliance_judge, zinsser_judge, anti_slop_judge, and completeness_judge so existing imports continue to work. Run to compile all rubrics and print summaries:
uv run python examples/rubric_library.py
examples/red_team_judge.py
Demo runner for the ComplianceJudge rubric. Imports the rubric from compliance_judge.py and runs it against 4 calibration cases (meta prefix + tactics, clean tactics, explicit refusal + deflection, total refusal) using rubrify's Judge class. Demonstrates dotenv loading for API keys. Contains no rubric definition of its own.
uv run python examples/red_team_judge.py
Testing
The test suite uses pytest with harn's faux provider for deterministic testing (no real LLM calls, no network).
pytest tests/test_rubrify.py
Or with uv:
uv run pytest tests/test_rubrify.py
The test suite covers (67 tests, 0 skipped):
- IR type validation (scale constraints, duplicate IDs, invalid references, extra fields)
- Execution strategy and constraint scope validation (valid/invalid strategies, scope defaults and validation)
- Scale normalization (
to_unit()bounds, clamping, label lookup) - Compiler pipeline (locking, freezing, binding generation, projection completeness, pattern compilation, audit)
- XML codec (well-formed output, binding-driven attributes, special character escaping, element counts, output schema)
- JSON codec (parsing, empty/invalid input, dynamic model caching, field presence, validation, coercion, tool construction)
- Output constraint variants (check logic for PrefixSuffix, WordCount, CharLimit, ItemCount, Token constraints; validation errors; audit pass for duplicate IDs and unknown target fields)
- Integration tests with faux provider (full pipeline with tool calls, text fallback, usage tracking, disqualifier behavior, binary scale, multiple evaluations)
Architecture
src/rubrify/
__init__.py -- Public API surface (re-exports)
ir/ -- Intermediate representation (typed core)
types.py -- Scale types, Criterion, CriterionGroup, Disqualifier, Rubric
roles.py -- RoleSpec, SurfacePolicy
constraints.py -- ConstraintBinding, SurfaceProjection, AuthorityBlock, OutputConstraint (discriminated union)
bundle.py -- RubricBundle (immutable), lock_bundle()
compiler/ -- Rubric -> RubricBundle transformation
compiler.py -- compile_rubric(), CompilationResult
passes.py -- bind(), audit_coverage(), audit_projection_completeness(), audit_scale_consistency(), audit_output_constraints()
codecs/ -- Surface format rendering and parsing
xml_codec.py -- render_rubric_xml(), render_criterion_xml(), render_group_xml()
json_codec.py -- parse_judgment_json(), build_judgment_model(), build_judgment_tool(), validate_judgment_output()
engine/ -- Judge execution
judgment.py -- CriterionJudgment, AggregatedScore, Judgment, JudgeUsage
executor.py -- execute_criterion() (single criterion), execute_group() (multi-criterion in one LLM call)
judge_loop.py -- run_judge_loop() (strategy-aware dispatch: per_criterion/grouped/holistic)
judge.py -- Judge, JudgeConfig (stateful public API)
evolve/ -- Rubric evolution via GEPA (optional)
types.py -- AnnotatedExample, JudgmentTrajectory
candidate.py -- rubric_to_candidate(), candidate_to_rubric()
lm_bridge.py -- make_harn_lm() (wraps harn_ai Model as GEPA's LanguageModel protocol)
adapter.py -- RubricEvolverAdapter (GEPAAdapter implementation)
evolver.py -- evolve_rubric(), evolve_rubric_v3(), config/result dataclasses
async_bridge.py -- run_async() (async-to-sync bridge for nested event loops)
meta_metric.py -- compute_consistency(), compute_discrimination(), _get_scale_range(), _to_numeric()
acceptance.py -- RubricAwareAcceptance (multi-dimensional acceptance criterion)
proposal_gate.py -- ProposalQualityGate, make_proposal_quality_rubric()
gated_proposer.py -- GatedProposalFn
coevolution_adapter.py -- CoEvolutionAdapter
coevolution_candidate.py -- coevolution_to_candidate(), candidate_to_coevolution()
reflection_templates.py -- build_reflection_template_dict(), per-component-type templates
progress.py -- EvolutionProgress (pretty ANSI progress logger)
test_fixtures.py -- make_compliance_rubric(), make_annotated_dataset() for testing
bridge/ -- Optional integrations with third-party RL/eval frameworks
verifiers.py -- make_rubrify_rubric(), make_rubrify_env() (verifiers library bridge)
examples/
compliance_judge.py -- ComplianceJudge rubric definition (3 criteria, 2 DQs, 16 patterns)
anti_slop_judge.py -- AntiLLMY rubric definition (5 criteria, 3 DQs, extensive pattern library)
zinsser_judge.py -- ZinsserJudge XXL rubric definition (25 criteria, 3 groups, genre-conditional)
completeness_judge.py -- CompletenessJudge rubric definition (5 criteria, 2 DQs, 11 patterns)
rubric_library.py -- Re-export facade for all four rubrics
red_team_judge.py -- Demo runner: ComplianceJudge with 4 calibration cases
verifiers_env_example.py -- Example: wiring rubrify to the verifiers training loop
debug_constraints.py -- Constraint debugging utility
debug_deepseek.py -- DeepSeek provider debugging utility
test_all_rubrics.py -- Batch compile-and-test runner for all example rubrics
tests/
test_rubrify.py -- 67 tests covering IR, compiler, codecs, output constraints, execution strategies, and faux-provider integration
License
See the project configuration for license details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rubrify-0.1.4.tar.gz.
File metadata
- Download URL: rubrify-0.1.4.tar.gz
- Upload date:
- Size: 134.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93c4557b0e103a92be88035525398ef2c1714fdbda6d36d225bc5395634d9292
|
|
| MD5 |
41c00abd77c6372589488d7fbb1262ee
|
|
| BLAKE2b-256 |
3cae40ef9309ffe46c2b61b626de4959e2b7e0618fc3dfa7714be14533fcd20d
|
File details
Details for the file rubrify-0.1.4-py3-none-any.whl.
File metadata
- Download URL: rubrify-0.1.4-py3-none-any.whl
- Upload date:
- Size: 91.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
460a6594e5463e7b68d9931b7bffcb02caced1ca8e806887eb46bac93b51bbe6
|
|
| MD5 |
4f300bca1b377e56721b341335fd034b
|
|
| BLAKE2b-256 |
83da7766d6abebf58649cbbb4bdc0ce008ebad5c73db335200029eb87bca2428
|