Skip to main content

A pytest plugin for rubric-based LLM-as-judge testing with auto-discovery and calibration

Project description

pytest-llm-rubric

Experimental — this plugin is in early development. APIs may change without notice.

Minimal pytest plugin for LLM-as-a-Judge — simple semantic PASS/FAIL checks against text or documents.

Why pytest?

Your CI already runs pytest. Semantic text checks shouldn't need a separate framework. Just another test file.

Use When

  • Wording varies but meaning must be preserved
  • Exact string assertions are too brittle
  • Tests need binary semantic judgments: PASS or FAIL

e.g.

  • Agent skill regression — instruction docs still contain required rules after edits
  • Prompt regression — LLM output quality hasn't degraded after prompt changes
  • Doc generation CI — auto-generated docs include all required sections
  • Translation fidelity — specific meanings are preserved across languages

Not a general essay grader or multi-dimensional scoring system.

Quick Start

Prerequisites

pip install pytest-llm-rubric  # or: uv add --dev pytest-llm-rubric
ollama serve                   # start Ollama (if not already running)
ollama pull granite4:3b        # any chat model works

Minimal Test

def test_mentions_deadline(judge_llm):
    # In practice, text is usually much longer —
    # policy docs, generated reports, LLM outputs, etc.
    text = "The report is due by March 31st."
    criterion = "The delivery deadline is mentioned."
    response = judge_llm.complete([
        {"role": "system", "content": "Does this text express the criterion? Reply PASS or FAIL."},
        {"role": "user", "content": f"TEXT:\n{text}\n\nCRITERION:\n{criterion}"},
    ])
    assert "PASS" in response.upper()

Execution Flow

  1. Discover — find an available LLM backend (local Ollama by default)
  2. Calibrate — run 12 golden tests to verify reliable PASS/FAIL judgment (skippable)
  3. Provide — expose the judge_llm session fixture on success
  4. Skip — skip dependent tests on backend absence or calibration failure (not fail)

By default, only local Ollama is tried. Paid cloud APIs require explicit opt-in.

Example: Policy Document Checks

Verify that each policy document semantically expresses required rules.

import pytest
from pathlib import Path
from pytest_llm_rubric import JudgeLLM

DOCS_DIR = Path("policies")
POLICY_DOCS = sorted(DOCS_DIR.rglob("*.md"))
REQUIRED_RULES = [
    "Personal data must be encrypted at rest",
    "Access logs are retained for at least 90 days",
    "Third-party integrations require security review",
]

@pytest.mark.parametrize("doc", POLICY_DOCS)
@pytest.mark.parametrize("rule", REQUIRED_RULES)
def test_policy_expresses_rule(judge_llm: JudgeLLM, doc, rule):
    text = doc.read_text()
    response = judge_llm.complete([
        {"role": "system", "content": "Does this document express the criterion? Reply PASS or FAIL."},
        {"role": "user", "content": f"DOCUMENT:\n{text}\n\nCRITERION:\n{rule}"},
    ])
    assert "PASS" in response.upper(), f"{doc} is missing rule: {rule}"

Configuration

Variable Default
PYTEST_LLM_RUBRIC_BACKEND (empty) = Ollama only. ollama, anthropic, openai, auto
PYTEST_LLM_RUBRIC_MODEL Provider-specific default
PYTEST_LLM_RUBRIC_<PROVIDER>_MODEL Overrides MODEL per provider
PYTEST_LLM_RUBRIC_SKIP_CALIBRATION (disabled)

Model resolution: <PROVIDER>_MODEL > MODEL > default in defaults.py.

Backend Behavior

  • (empty) — Ollama only. Safe default, no API costs.
  • auto — Ollama → Anthropic → OpenAI (first available)
  • ollama / anthropic / openai — use only the specified backend

If no backend is available or calibration fails, dependent tests are skipped (not failed).

CI

Set PYTEST_LLM_RUBRIC_BACKEND and the matching provider credentials in your CI secrets.

env:
  PYTEST_LLM_RUBRIC_BACKEND: openai  # or: anthropic
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Markers

Tests that use the judge_llm fixture automatically receive the llm_rubric marker.

pytest -m "not llm_rubric"  # run everything except LLM-judged tests
pytest -m llm_rubric        # run only LLM-judged tests

Custom Backend

Override the fixture for a custom LLM client or internal gateway.

import pytest

class MyBackend:
    def complete(self, messages, max_tokens=256):
        # Call your internal LLM gateway
        resp = requests.post("https://internal-llm.corp/v1/chat", json={"messages": messages})
        return resp.json()["content"]

@pytest.fixture(scope="session")
def judge_llm():
    return MyBackend()

Custom System Prompt

Tweak the calibration system prompt if your model needs specific instructions to pass calibration.

from pytest_llm_rubric.calibration import calibrate, JUDGE_SYSTEM_PROMPT

result = calibrate(llm, system_prompt="Your custom prompt here.")

The default JUDGE_SYSTEM_PROMPT is used when system_prompt is omitted.

Find Best Local Model

uv run python -m pytest_llm_rubric.find_local_model

Runs calibration against all local Ollama models and recommends the smallest one that passes.

Development

git clone https://github.com/ugai/pytest-llm-rubric.git
cd pytest-llm-rubric
uv sync
uv run pre-commit install           # ruff + ty on every commit
uv run pytest -m "not integration"  # no LLM calls, runs offline
uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run ty check src/

References

This plugin's design — decomposing evaluation into multiple binary PASS/FAIL criteria instead of multi-level scoring — aligns with Anthropic's recommended practices:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_llm_rubric-0.1.0.tar.gz (10.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_llm_rubric-0.1.0-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file pytest_llm_rubric-0.1.0.tar.gz.

File metadata

  • Download URL: pytest_llm_rubric-0.1.0.tar.gz
  • Upload date:
  • Size: 10.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_llm_rubric-0.1.0.tar.gz
Algorithm Hash digest
SHA256 969cdfe0720ac20d1112d3bc94a58343d13d2c6a52064d124af41c33b3bca884
MD5 32b13a5e21fec5fab34e25ece6880d68
BLAKE2b-256 df55594e67e95e52822f8b4ed78657c97e734b1d396fff70be62187500827d31

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.1.0.tar.gz:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytest_llm_rubric-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_llm_rubric-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7c2eb4b941140352afade432af318d875177e9ad998f1e29f78191a35b8ec3e3
MD5 e9ade75e8bcedf762fc6838eeacfd60d
BLAKE2b-256 2f206dcfb1191ab158a8ab08283d0da568d9cee1a9b52f15d80ec437fcd3c792

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.1.0-py3-none-any.whl:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page