rubric

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gavinbains jbesgen

These details have not been verified by PyPI

Project description

Rubric: A Python library for LLM-based evaluation using weighted rubrics.

Installation

uv add rubric

Usage

Quick Start with Default Generate Functions

For quick testing, use the built-in Gemini generate functions:

export GEMINI_API_KEY=your_api_key_here

import asyncio
from rubric import Rubric, default_per_criterion_generate_fn
from rubric.autograders import PerCriterionGrader

async def main():
    rubric = Rubric.from_dict([
        {"weight": 10.0, "requirement": "Response mentions Paris"},
        {"weight": 5.0, "requirement": "Response is concise"}
    ])

    grader = PerCriterionGrader(generate_fn=default_per_criterion_generate_fn)
    result = await rubric.grade("Paris is the capital of France.", autograder=grader)
    print(f"Score: {result.score}")

asyncio.run(main())

See examples/basic_usage.py for more examples with all three autograder types.

Custom Generate Function with OpenAI

For production use, implement your own generate_fn with structured outputs:

import asyncio
import os
from openai import AsyncOpenAI
from rubric import Rubric, PerCriterionOutput
from rubric.autograders import PerCriterionGrader

# Declare custom generate function with any model and inference provider
async def generate_with_openai(system_prompt: str, user_prompt: str) -> PerCriterionOutput:
    client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    response = await client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        response_format={"type": "json_schema", "json_schema": {
            "name": "criterion_output",
            "schema": PerCriterionOutput.model_json_schema()
        }},
        max_tokens=400,
        temperature=0.0,
    )
    content = response.choices[0].message.content or "{}"
    return PerCriterionOutput.model_validate_json(content)

async def main():
    # Build rubric
    rubric = Rubric.from_dict([
        {"weight": 10.0, "requirement": "States Q4 2023 base margin as 17.2%"},
        {"weight": 8.0, "requirement": "Explicitly uses Shapley attribution for decomposition"},
        {"weight": -15.0, "requirement": "Uses total deliveries instead of cash-only deliveries"}
    ])

    # Select autograder strategy
    grader = PerCriterionGrader(
        generate_fn=generate_with_openai,
        system_prompt="This overrides the default grader system prompt",
    )

    # Grade output
    result = await rubric.grade(
        query="Input query...",
        to_grade="Output to evaluate...",
        autograder=grader
    )

    print(f"Score: {result.score:.2f}")  # Score is 0.0-1.0
    for criterion in result.report:
        print(f"  [{criterion.verdict}] {criterion.requirement}")
        print(f"    → {criterion.reason}")

asyncio.run(main())

Autograder Strategies

PerCriterionGrader

Evaluates each criterion in parallel inference calls.

Scoring Formula:

For each criterion $i$:

If verdict = MET, contribution = $w_i$
If verdict = UNMET, contribution = 0

Final score:

$$ \text{score} = \max\left(0, \min\left(1, \frac{\sum_{i=1}^{n} \mathbb{1}[\text{verdict}i = \text{MET}] \cdot w_i}{\sum{i=1}^{n} \max(0, w_i)}\right)\right) $$

Where:

$w_i$ = weight of criterion $i$
$\mathbb{1}[\text{verdict}_i = \text{MET}]$ = 1 if criterion is MET, 0 otherwise
Denominator = $\sum_{i=1}^{n} \max(0, w_i)$ (positive weights only)
Numerator = sum of weights for MET criteria
Result clamped to [0, 1]

All-Negative Criteria Rubrics:

For rubrics containing only negative criteria (e.g., error detection rubrics), a different formula is used:

$$ \text{score} = \max\left(0, \min\left(1, 1 + \frac{\sum_{i=1}^{n} \mathbb{1}[\text{verdict}i = \text{MET}] \cdot w_i}{\sum{i=1}^{n} |w_i|}\right)\right) $$

This ensures:

Score = 1.0 when all errors are avoided (all criteria UNMET)
Score = 0.0 when all errors are present (all criteria MET)
Proportional scores for partial error presence

PerCriterionOneShotGrader

PerCriterionOneShotGrader makes 1 inference call that evaluates all criteria together and returns a structured output, unlike PerCriterionGrader which makes $n$ inference calls.

Scoring Formula:

Same as PerCriterionGrader:

$$ \text{score} = \max\left(0, \min\left(1, \frac{\sum_{i=1}^{n} \mathbb{1}[\text{verdict}i = \text{MET}] \cdot w_i}{\sum{i=1}^{n} \max(0, w_i)}\right)\right) $$

RubricAsJudgeGrader

Holistic evaluation where the model returns a final score directly.

Scoring Formula:

The model is instructed to mentally evaluate all criteria and return a score from 0-100:

$$ \text{score} = \frac{\text{LLM-judged score}}{100} $$

Clamped to [0, 1]. The model is guided to use the same weighted scoring logic, but computes the result in-context rather than aggregating score post-hoc.

raw_score Consistency: The LLM's 0-100 score is converted to weighted-sum semantics for raw_score, ensuring consistency with other graders:

raw_score = (llm_score / 100.0) * total_positive_weight

The original LLM score is preserved in llm_raw_score for debugging.

Default System Prompts

Each autograder uses a specialized system prompt optimized for its evaluation approach:

PerCriterionGrader - Detailed criterion-by-criterion evaluation with strict JSON formatting requirements. The prompt instructs the LLM to evaluate each criterion independently, handling both positive and negative criteria with specific response formats.

PerCriterionOneShotGrader - Streamlined prompt for evaluating all criteria in a single response. Focuses on providing verdicts (MET/UNMET) and explanations for each criterion in a structured JSON format.

RubricAsJudgeGrader - Holistic evaluation prompt that asks the LLM to consider the output as a whole and provide a single overall score from 0-100, taking into account the weights of all criteria.

You can view the complete default prompts in the source files:

Customizing System Prompts: You can override the default system prompt by passing a system_prompt parameter to any autograder:

grader = PerCriterionGrader(
    generate_fn=your_function,
    system_prompt="Your custom system prompt here"
)

XML Tag Structure: The autograders wrap content in <response> XML tags. If a query is provided (optional), it's wrapped in <query> tags. If you provide a custom system prompt, ensure it handles the response structure you're using:

<!-- Plain string response -->
<response>
{content}
</response>

<!-- Or nested with thinking/output -->
<response>
<thinking>{thinking_content}</thinking>
<output>{output_content}</output>
</response>

The structure depends on what you pass to rubric.grade(). Customize your system prompt to handle your preferred format.

Customization

You can customize grading at multiple levels:

1. Custom generate_fn (most common) Pass any typed function that returns a Pydantic model. Use any LLM provider (OpenAI, Anthropic, local models, etc.):

from rubric import PerCriterionOutput

async def your_custom_function(system_prompt: str, user_prompt: str) -> PerCriterionOutput:
    # Your LLM call here with structured outputs
    ...
    return PerCriterionOutput(criterion_status="MET", explanation="...")

grader = PerCriterionGrader(generate_fn=your_custom_function)

Each autograder requires a specific return type:

PerCriterionGrader → PerCriterionOutput
PerCriterionOneShotGrader → OneShotOutput
RubricAsJudgeGrader → RubricAsJudgeOutput

2. Create custom autograder Subclass Autograder and implement the abstract methods:

judge() - Evaluates the submission and returns raw results
aggregate() - Transforms judge results into an EvaluationReport

The generate_fn pattern is optional - you can make LLM calls directly, use multiple functions, or skip LLMs entirely.

3. Override system prompts Customize the default prompts for built-in autograders:

grader = PerCriterionGrader(
    generate_fn=your_function,
    system_prompt="Your custom system prompt here"
)

Error Handling

In v2.0.0, validation happens at generation time via Pydantic models. Your generate_fn is responsible for:

Structured outputs - Use your LLM provider's structured output features (JSON schema, function calling, etc.) to ensure valid responses
Retry logic - Implement retries within your generate_fn if needed
Validation - Return a validated Pydantic model (PerCriterionOutput, OneShotOutput, or RubricAsJudgeOutput)

If your generate_fn returns invalid data, Pydantic will raise a ValidationError.

Example with retries:

from pydantic import ValidationError
from rubric import PerCriterionOutput

async def generate_with_retries(system_prompt: str, user_prompt: str, max_retries: int = 3) -> PerCriterionOutput:
    for attempt in range(max_retries):
        try:
            response = await your_llm_call(system_prompt, user_prompt)
            return PerCriterionOutput.model_validate_json(response)
        except ValidationError as e:
            if attempt == max_retries - 1:
                raise
            continue  # Retry on validation error

Best practice: Use structured outputs (JSON schema constrained decoding) in your LLM client to avoid validation errors entirely.

Score Fields

The EvaluationReport returned by rubric.grade() contains several score fields:

Field	Description
`score`	Final score (0-1 if normalized, raw weighted sum if `normalize=False`)
`raw_score`	Weighted sum before normalization. Consistent semantics across all graders.
`llm_raw_score`	Original LLM output before conversion. For `RubricAsJudgeGrader`, this is the 0-100 score.
`report`	Per-criterion breakdown (None for `RubricAsJudgeGrader`)

Cross-Grader Consistency: raw_score uses weighted-sum semantics across all graders, enabling direct comparison:

# Same rubric, different graders - raw_score is comparable
result1 = await rubric.grade(text, autograder=PerCriterionGrader())
result2 = await rubric.grade(text, autograder=RubricAsJudgeGrader())

# Both raw_scores are on the same scale (weighted sum)
print(result1.raw_score)      # e.g., 12.75
print(result2.raw_score)      # e.g., 12.75 (converted from LLM's 85/100)
print(result2.llm_raw_score)  # e.g., 85.0 (original LLM output)

Loading Rubrics

# Direct construction
rubric = Rubric([
    Criterion(weight=10.0, requirement="States Q4 2023 base margin as 17.2%"),
    Criterion(weight=8.0, requirement="Explicitly uses Shapley attribution for decomposition"),
    Criterion(weight=-15.0, requirement="Uses total deliveries instead of cash-only deliveries")
])

# From list of dictionaries
rubric = Rubric.from_dict([
    {"weight": 10.0, "requirement": "States Q4 2023 base margin as 17.2%"},
    {"weight": 8.0, "requirement": "Explicitly uses Shapley attribution for decomposition"},
    {"weight": -15.0, "requirement": "Uses total deliveries instead of cash-only deliveries"}
])

# From JSON string
rubric = Rubric.from_json('[{"weight": 10.0, "requirement": "Example requirement"}]')

# From YAML string
yaml_data = '''
- weight: 10.0
  requirement: "Example requirement"
'''
rubric = Rubric.from_yaml(yaml_data)

# From files
rubric = Rubric.from_file('rubric.json')
rubric = Rubric.from_file('rubric.yaml')

JSON Format

[
  {
    "weight": 10.0,
    "requirement": "States Q4 2023 base margin as 17.2%"
  },
  {
    "weight": 8.0,
    "requirement": "Explicitly uses Shapley attribution for decomposition"
  },
  {
    "weight": -15.0,
    "requirement": "Uses total deliveries instead of cash-only deliveries"
  }
]

YAML Format

- weight: 10.0
  requirement: "States Q4 2023 base margin as 17.2%"
- weight: 8.0
  requirement: "Explicitly uses Shapley attribution for decomposition"
- weight: -15.0
  requirement: "Uses total deliveries instead of cash-only deliveries"

Requirements

Python 3.10+
An LLM API (e.g., OpenAI, Anthropic, OpenRouter) - set appropriate API keys as environment variables

License

MIT License - see LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

gavinbains jbesgen

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.2.0

Jan 21, 2026

2.1.0

Jan 21, 2026

2.0.1

Jan 14, 2026

This version

2.0.0

Jan 14, 2026

1.3.2

Jan 10, 2026

1.3.1

Jan 8, 2026

1.3.0

Jan 7, 2026

1.2.8

Jan 6, 2026

1.2.7

Jan 5, 2026

1.2.6

Dec 15, 2025

1.2.5

Dec 15, 2025

1.2.4

Nov 3, 2025

1.2.3

Nov 3, 2025

1.2.2

Oct 29, 2025

1.2.1

Oct 28, 2025

1.2.0

Oct 27, 2025

1.1.8

Oct 24, 2025

1.1.7

Oct 23, 2025

1.1.6

Oct 23, 2025

1.1.5

Oct 23, 2025

1.1.4

Oct 23, 2025

1.1.3

Oct 22, 2025

1.1.2

Oct 23, 2025

1.1.1

Oct 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rubric-2.0.0.tar.gz (18.7 kB view details)

Uploaded Jan 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

rubric-2.0.0-py3-none-any.whl (26.4 kB view details)

Uploaded Jan 14, 2026 Python 3

File details

Details for the file rubric-2.0.0.tar.gz.

File metadata

Download URL: rubric-2.0.0.tar.gz
Upload date: Jan 14, 2026
Size: 18.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for rubric-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e148fd394745f97116bc62e37d012c0802312a52f0c6d79c552cc9ba5dbd22f8`
MD5	`3d9582db06481306ec157c7e7526061e`
BLAKE2b-256	`75847d7f753d98fcdcd996bbe622165a9c29480510c2c0f3561b1caef822abab`

See more details on using hashes here.

File details

Details for the file rubric-2.0.0-py3-none-any.whl.

File metadata

Download URL: rubric-2.0.0-py3-none-any.whl
Upload date: Jan 14, 2026
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.9.25 {"installer":{"name":"uv","version":"0.9.25","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for rubric-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`097eaf1a081fd93e0582818fdc4c94df351a6d9ccf1f86c3e386d3fff05fba36`
MD5	`f6753864def928331ddc682fcb277955`
BLAKE2b-256	`3a2eadd63af87fb2e3539ac4899754a0766748485b0ea98de46afe0c0004548d`

See more details on using hashes here.

rubric 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Rubric: A Python library for LLM-based evaluation using weighted rubrics.

Installation

Usage

Quick Start with Default Generate Functions

Custom Generate Function with OpenAI

Autograder Strategies

PerCriterionGrader

PerCriterionOneShotGrader

RubricAsJudgeGrader

Default System Prompts

Customization

Error Handling

Score Fields

Loading Rubrics

JSON Format

YAML Format

Requirements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes