Rubricon — specification-first generation for LLMs. Cross the rubricon: rubric-conditioned generation with failure-weighted reattention.

These details have not been verified by PyPI

Project links

Project description

Rubricon

Specification-first generation for LLMs. Cross the rubricon — produce evaluation criteria before generation and use them as conditioning targets with failure-weighted reattention.

pip install rubricon

from rubricon import RubriconPipeline, RubriconConfig

cfg = RubriconConfig.from_dict({
    "generator": {"model": "gpt-4o"},
    "evaluator": {"model": "claude-sonnet-4-20250514"},
    "criteria": {"n": 5},
    "convergence": {"threshold": 0.6},
    "iteration": {"max_iterations": 3},
})
result = RubriconPipeline(cfg).run("Explain quantum entanglement to a high school student.")
print(result.response, result.rubric_adherence_score, result.all_pass)

Highly configurable by design

Layer	Plugin protocol	Built-ins	Override via
Backends	`LLMBackend`	`litellm`, `mock`	`backend_registry.register("vllm")`
Evaluators	`Evaluator`	`llm_judge`, `regex`, `function`, `ensemble`	per-criterion mix in config
Reattention	`ReattentionStrategy`	`focal` (FWRL), `uniform`, `softmax`	`reattention.strategy`
Convergence	`ConvergencePolicy`	`all_pass`, `mean_threshold`, `no_improvement`, `composite`	`convergence.policy`
Templates	Jinja2	bundled defaults	`templates.*` paths in config
Callbacks	`PipelineCallback`	`console`, `jsonl`	`callbacks: [{type: ...}]`
Budget	hard limits	tokens, cost, wall, iters	`budget: {...}`
Retry	`RetryConfig`	constant/linear/exponential	`retry: {...}`

Every knob lives on a single Pydantic RubriconConfig. Layered loading: defaults < YAML < env (RUBRICON_*) < kwargs.

rubricon run "Explain entropy" --config configs/prod.yaml
rubricon eval --prompts mtbench.json --config configs/prod.yaml --output results.jsonl
rubricon plugins
rubricon config show --config configs/prod.yaml

The original efa research package (paper code, baselines, ablations) ships in the same wheel for reproducibility — from efa import EFAPipeline keeps working.

Evaluation-First Attention (EFA) — research

What if LLMs knew what "good" looks like before they started writing?

EFA inverts the generate-then-evaluate paradigm by producing evaluation criteria before generation and using them as structured conditioning targets with failure-proportional reweighting — like TDD for text generation.

Author: Karthick Raja M | Affiliation: Independent Researcher, Chennai, India | Date: March 2026

Paper (PDF) | LaTeX Source

News

[2026-03-30] Full MT-Bench results: 960 runs (12 methods × 80 prompts) with cross-model evaluation (MiniMax-M2.5 gen + Qwen-3.5-9B eval). EFA achieves 96.2% APR, +25pp over single-pass. Paper updated with results, figures, honest FWRL analysis.
[2026-03-29] Best-of-N baseline completed (80/80 prompts). All 12 methods now have full MT-Bench coverage.
[2026-03-23] Initial release: full pipeline, 7 baselines, 4 ablations, cross-model evaluation support, 11 unit tests.

The Problem

Current LLM generation pipelines follow a generate-then-evaluate pattern. This has three fundamental limitations:

Evaluation is disconnected from generation. The model has no awareness of quality dimensions during generation. Rubric-based reward models (CARMO, OpenRubrics) generate criteria but use them only for post-hoc scoring — never as generation targets.
Refinement feedback is holistic, not dimensional. Self-Refine (Madaan et al., 2023) produces feedback like "the response lacks specificity," requiring the model to self-diagnose which dimensions failed. Diagnosis — not repair — is the bottleneck (RefineBench, 2025).
All quality dimensions receive equal emphasis. Whether a response nails relevance but fails on accuracy, refinement passes treat all dimensions uniformly. No system allocates more generation budget to failing dimensions.

The Solution

EFA operates in three phases:

Criteria Generation: A dedicated LLM call analyzes the prompt and produces query-specific evaluation criteria with measurable rubrics.
Criteria-Masked Progressive Generation (CMPG): The generator sees criteria one-at-a-time via progressive unmasking — like causal masking over quality dimensions.
Failure-Weighted Reattention (FWRL): Per-criterion scores map to emphasis weight adjustments, amplifying focus on failing dimensions in subsequent passes — like focal loss at inference time.

EFA Architecture

How EFA Differs from Prior Work

Method	Criteria-Aware Generation	Per-Criterion Scoring	Failure-Proportional Reweighting	Progressive Masking
Single-pass	-	-	-	-
Self-Refine (Madaan et al., 2023)	-	-	-	-
Reflexion (Shinn et al., 2023)	-	-	-	-
CARMO (Zhang et al., 2025)	-	Yes	-	-
ReFeed (2025)	-	Yes (3 dims)	-	-
RSD (Xu et al., 2025)	Process reward	Step-level	-	-
EFA (ours)	Yes	Yes	Yes	Yes

EFA is the first system that combines all four: dynamic rubric generation used as generation-time conditioning with progressive unmasking and failure-proportional reweighting.

Novel Mechanisms

1. Criteria-Masked Progressive Generation (CMPG)

Instead of showing all criteria at once, CMPG progressively unmasks them — ensuring foundational criteria are satisfied before the model optimizes for secondary dimensions:

Sub-step 1: Generate with {c₁} only          → draft d₁    (e.g., factual accuracy)
Sub-step 2: Refine with {c₁, c₂}             → draft d₂    (+ completeness)
Sub-step 3: Refine with {c₁, c₂, c₃}         → draft d₃    (+ clarity)
...
Sub-step n: Refine with {c₁, c₂, ..., cₙ}    → response R  (all criteria)

Intuition: Like curriculum learning at inference time — satisfy fundamentals before polish.

2. Failure-Weighted Reattention Loop (FWRL)

Failed criteria receive mathematically boosted emphasis weights proportional to how badly they failed:

w_i^(k+1) = w_i^(k) * (1 + α * max(0, τ - s_i^(k)))

Passed criteria (s ≥ τ): weights unchanged — don't fix what isn't broken
Failed criteria (s < τ): weight boost proportional to failure gap
Checkpoint mechanism: locks previously-passing criteria to prevent regression

Intuition: Like focal loss for inference — focus compute budget where it's needed most.

Experiment Results

960 runs completed: 12 methods × 80 MT-Bench prompts. Cross-model evaluation: MiniMax-M2.5 (generator) + Qwen-3.5-9B via Ollama (evaluator) — eliminates self-preference bias.

APR Comparison

Main Results

Method	RAS ↑	APR (%) ↑	TTC ↓
Single-pass	0.881	71.2	9,210
Rubric-then-Score	0.863	72.5	9,469
All-Criteria-at-Once	0.925	90.0	13,408
Uniform Reattention	0.935	90.0	17,814
Best-of-5	0.957	91.2	21,999
Self-Refine	0.953	92.5	13,652
FusioN	0.956	92.5	15,449
EFA (Full)	0.962	96.2	16,449

Key Findings

EFA beats all 7 baselines on APR: 96.2% vs best baseline 92.5% (+3.7pp). +25.0pp over single-pass generation.
Iteration is the biggest driver (−12.4pp): Removing iterative refinement drops APR from 96.2% to 83.8%. Multi-pass generation with criterion-level feedback is essential.
Dynamic criteria matter (−7.4pp): Replacing per-query criteria with a fixed universal set drops APR from 96.2% to 88.8%.
CMPG provides measurable gains (−3.7pp): Progressive masking over quality dimensions outperforms presenting all criteria simultaneously.
FWRL shows no measurable contribution (0.0pp): Removing failure-weighted reattention yields identical APR (96.2%) and slightly higher RAS (0.967). We attribute this to a ceiling effect with a strong generator — the model fixes failing criteria even with uniform weights. This is honestly reported in the paper.
EFA is more token-efficient than brute-force: 75% of Best-of-5's token cost with +5.0pp higher APR.

Ablation Study Cost-Quality Tradeoff

Metrics

Metric	Full Name	What It Measures	Range
RAS	Rubric Adherence Score	Mean per-criterion score across all criteria	[0, 1]
APR	All-Pass Rate	% of prompts where every criterion meets threshold τ	[0%, 100%]
ITC	Iterations to Convergence	Mean iterations before all criteria pass or K_max	[1, K_max]
TTC	Total Token Cost	Total tokens consumed across the full pipeline	Tokens

Quick Start

Installation

git clone https://github.com/karthyick/evaluation-first-attention.git
cd evaluation-first-attention
pip install -e ".[dev]"

API Keys

EFA works with any LiteLLM-compatible model:

# Cloud APIs (pick one or more)
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GROQ_API_KEY="gsk_..."

# Local inference (Ollama — free, no API key needed)
# ollama serve && ollama pull qwen3.5:9b

Basic Usage

from efa import EFAPipeline

pipeline = EFAPipeline(
    model="gpt-4o",                              # Generator
    evaluator_model="claude-sonnet-4-20250514",   # Cross-model evaluator
    n_criteria=5,
    threshold=0.6,
    max_iterations=3,
    alpha=2.0,
)

result = pipeline.run("Explain quantum entanglement to a high school student")

print(result.response)                    # Final response
print(result.rubric_adherence_score)      # RAS
print(result.all_pass)                    # APR (bool)
print(result.n_iterations)               # ITC
print(result.total_tokens)               # TTC
print([c.name for c in result.criteria]) # Criterion names
print(result.final_scores)              # Per-criterion scores [0,1]

Local Mode (Ollama — zero cost)

pipeline = EFAPipeline(
    model="ollama/qwen3.5:9b",
    evaluator_model="ollama/qwen3.5:9b",
    n_criteria=3,
    threshold=0.6,
    max_iterations=2,
)

Running Experiments

# Full suite: EFA + 6 baselines + 4 ablations
python experiments/run_experiment.py --config configs/ollama_local.yaml

# Single method
python experiments/run_experiment.py --method efa --prompts sample --max-prompts 10

# Cross-model evaluation
python experiments/run_experiment.py --config configs/groq_gemini.yaml

Baselines (7)

#	Baseline	What It Isolates
1	Single-pass	No criteria, no refinement — pure baseline
2	Self-Refine (Madaan et al., 2023)	Holistic feedback loop without criteria structure
3	Rubric-then-Score	Criteria used for eval only, not generation conditioning
4	All-Criteria-at-Once	No progressive masking — tests CMPG's value
5	Uniform Reattention	No failure weighting — tests FWRL's value
6	Best-of-N	Independent sampling — cheapest scaling alternative
7	FusioN (Agarwal et al., 2025)	Multi-candidate synthesis — generate N, synthesize one superior response

Ablations (4)

Ablation	Component Removed	Expected Impact
-DynCriteria	Dynamic per-query criteria → fixed universal set	Tests value of query-specific rubrics
-CMPG	Progressive masking → all criteria shown at once	Tests curriculum-style unmasking
-FWRL	Failure weighting → uniform weights on all criteria	Tests targeted vs uniform reattention
-Iteration	Refinement loop → single pass with criteria	Tests iterative improvement value

Project Structure

evaluation-first-attention/
├── src/efa/                      # Core EFA implementation
│   ├── pipeline.py               # Full pipeline — Algorithm 1 from paper
│   ├── criteria_generator.py     # Component 1: Dynamic criteria generation
│   ├── progressive_generator.py  # Component 2: CMPG progressive masking
│   ├── evaluator.py              # Component 3a: Per-criterion evaluation
│   ├── reattention.py            # Component 3b: FWRL weight updates + checkpointing
│   ├── baselines.py              # All 7 baseline implementations
│   ├── models.py                 # Data models (Criterion, EvaluationResult, etc.)
│   └── llm_client.py             # LiteLLM abstraction with retry + JSON repair
├── experiments/
│   ├── prompts/                  # Benchmark prompt sets (sample, MT-Bench, etc.)
│   ├── results/                  # Experiment outputs (JSON)
│   └── run_experiment.py         # Experiment runner with rich table output
├── configs/                      # Hyperparameter configs (YAML)
│   ├── default.yaml              # GPT-4o + Claude cross-model
│   ├── ollama_local.yaml         # Local Ollama (zero cost)
│   └── groq_gemini.yaml         # Groq + Gemini cross-model
├── scripts/                      # Visualization and analysis scripts
├── tests/                        # Unit tests (11 passing)
├── docs/                         # Diagrams, screenshots, analysis guides
└── paper/                        # LaTeX source + compiled PDF

Hyperparameters

Parameter	Symbol	Default	Description
`n_criteria`	n	5	Number of evaluation criteria per query
`threshold`	τ	0.6	Passing threshold (rubric score >= 3/5)
`max_iterations`	K_max	3	Maximum refinement loops
`alpha`	α	2.0	Reattention strength (higher = more aggressive reweighting)
`epsilon`	ε	0.1	Regression tolerance for checkpoint locking

See configs/alpha_sensitivity.yaml for alpha sweep configuration (α ∈ {1.0, 2.0, 5.0}).

Tests

python -m pytest tests/ -v

11 tests covering: priority label mapping, evaluation scoring, FWRL weight updates, checkpoint locking, convergence detection.

Reproducibility

All experiment results are saved as JSON in experiments/results/ with per-prompt, per-method scores.
Configs capture exact hyperparameters used for each experiment run.
Local experiments (Ollama) are fully reproducible at zero cost.
API-based experiments may produce different results due to model versioning and temperature sampling.
Pre-computed results from our initial run are included in the repo.

Citation

@article{mohan2026efa,
  title={Evaluation-First Generation: Specification-Driven LLM Output Quality
         via Dynamic Rubric Conditioning and Iterative Criteria Refinement},
  author={Mohan, Karthick Raja},
  year={2026},
  month={March}
}

License

MIT License - Karthick Raja M, 2026

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.2

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubricon-0.2.2.tar.gz (54.2 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rubricon-0.2.2-py3-none-any.whl (56.8 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file rubricon-0.2.2.tar.gz.

File metadata

Download URL: rubricon-0.2.2.tar.gz
Upload date: Apr 26, 2026
Size: 54.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for rubricon-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`eca15b000b84506e758dd76ce80d97a16f506ee20a5f0422b748b3fa3e287822`
MD5	`d7917ef71f64b9cb6f5b7d342b50d2d3`
BLAKE2b-256	`03d62a79501726cbab261d75c734de8225f55d83c70fef8af16010cdf1108ebe`

See more details on using hashes here.

File details

Details for the file rubricon-0.2.2-py3-none-any.whl.

File metadata

Download URL: rubricon-0.2.2-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 56.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for rubricon-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f1317253afae24b5619e96d304d8c63c6b523878fec0ead3955b021d12ef191a`
MD5	`fe53f77a4e7346dfb66ffe36b12e7df4`
BLAKE2b-256	`c287e4b77b6454c5d62873b5c95da61a14a08040c47f9af7a9cd031a82cbaf02`

See more details on using hashes here.

rubricon 0.2.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Rubricon

Highly configurable by design

Evaluation-First Attention (EFA) — research

News

The Problem

The Solution

How EFA Differs from Prior Work

Novel Mechanisms

1. Criteria-Masked Progressive Generation (CMPG)

2. Failure-Weighted Reattention Loop (FWRL)

Experiment Results

Main Results

Key Findings

Metrics

Quick Start

Installation

API Keys

Basic Usage

Local Mode (Ollama — zero cost)

Running Experiments

Baselines (7)

Ablations (4)

Project Structure

Hyperparameters

Tests

Reproducibility

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes