Skip to main content

A pytest plugin for rubric-based LLM-as-judge testing with auto-discovery and preflight

Project description

pytest-llm-rubric

CI PyPI

Experimental — this plugin is in early development. APIs may change without notice.

Minimal pytest plugin for LLM-as-a-Judge — simple semantic PASS/FAIL checks against text or documents.

Why pytest?

Your CI already runs pytest. Semantic text checks shouldn't need a separate framework. Just another test file.

Use When

  • Wording varies but meaning must be preserved
  • Exact string assertions are too brittle
  • Tests need binary semantic judgments: PASS or FAIL

e.g.

  • Agent skill regression — instruction docs still contain required rules after edits
  • Prompt regression — LLM output quality hasn't degraded after prompt changes
  • Doc generation CI — auto-generated docs include all required sections
  • Translation fidelity — specific meanings are preserved across languages

Not a general essay grader or multi-dimensional scoring system.

Quick Start

Prerequisites

pip install pytest-llm-rubric          # or: uv add --dev pytest-llm-rubric
ollama serve                           # start Ollama (if not already running)
ollama pull gpt-oss:20b               # default model (or set PYTEST_LLM_RUBRIC_MODEL)

Minimal Test

def test_mentions_deadline(judge_llm):
    # In practice, text is usually much longer —
    # policy docs, generated reports, LLM outputs, etc.
    text = "The report is due by March 31st."
    assert judge_llm.judge(text, "The delivery deadline is mentioned.")

Execution Flow

  1. Discover — auto-detect available backends based on installed extras and env vars
  2. Preflight — verify the discovered backend can reliably judge PASS/FAIL before exposing it as judge_llm (skippable)
  3. Provide, skip, or fail — expose the judge_llm session fixture on success. If the default (empty) backend is unavailable or preflight fails, dependent tests are skipped. If an explicit backend is unavailable, tests fail

Paid cloud APIs never run unless explicitly configured.

Example: Policy Document Checks

Verify that each policy document semantically expresses required rules.

import pytest
from pathlib import Path
from pytest_llm_rubric import JudgeLLM

POLICY_DOCS = sorted(Path("docs/policies").rglob("*.md"))
REQUIRED_RULES = [
    "Personal data must be encrypted at rest",
    "Access logs are retained for at least 90 days",
    "Third-party integrations require security review",
]

# @pytest.mark.flaky(reruns=2)  # requires `pytest-rerunfailures` (recommended)
@pytest.mark.parametrize("doc", POLICY_DOCS)
@pytest.mark.parametrize("rule", REQUIRED_RULES)
def test_policy_expresses_rule(judge_llm: JudgeLLM, doc, rule):
    assert judge_llm.judge(doc.read_text(), rule), f"{doc} is missing rule: {rule}"

Configuration

All configuration is through environment variables.

Provider selection

PYTEST_LLM_RUBRIC_PROVIDER Extra API key If unavailable
(empty) — (included) tests skip
ollama — (included) tests fail
anthropic [anthropic] ANTHROPIC_API_KEY tests fail
openai [openai] OPENAI_API_KEY tests fail
auto any of the above tests fail
<other> (e.g. mistral, groq) install provider SDK provider's own env var tests fail

auto tries Ollama → Anthropic → OpenAI, using the first available. If the default (empty) provider is unavailable or preflight fails, dependent tests are skipped. If an explicit provider is set but unavailable, tests fail to surface CI misconfigurations.

Providers beyond the built-in three are passed through to any-llm, which handles API key and base URL resolution for 38+ providers.

CI example:

env:
  PYTEST_LLM_RUBRIC_PROVIDER: openai  # or: anthropic, mistral, groq, ...
  PYTEST_LLM_RUBRIC_MODEL: gpt-5.4-nano
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Model selection

Override the default model with PYTEST_LLM_RUBRIC_MODEL. Curated provider defaults are in defaults.py. Passthrough providers (mistral, groq, etc.) require PYTEST_LLM_RUBRIC_MODEL to be set.

Pro tip: Models with verbose reasoning traces (e.g. qwen3.5 in thinking mode) can be much slower on PASS/FAIL tasks. gpt-oss is a good default — fast despite using medium-level reasoning.

Skipping preflight

Set PYTEST_LLM_RUBRIC_SKIP_PREFLIGHT=1 to bypass the built-in golden tests.

Markers

Tests that use the judge_llm fixture automatically receive the llm_rubric marker.

pytest -m "not llm_rubric"  # run everything except LLM-judged tests
pytest -m llm_rubric        # run only LLM-judged tests

Flaky test mitigation

LLM-based tests are inherently non-deterministic — the same input may produce different judgments across runs. This is a feature, not a bug: deterministic settings (temperature=0) would undermine the fuzzy semantic matching that makes this approach valuable.

Preflight screens out models that are too unreliable, but borderline cases may still produce occasional flaky results. Rather than fighting non-determinism, use pytest's existing ecosystem:

pip install pytest-rerunfailures
pytest --reruns 2 -m llm_rubric  # rerun failed LLM tests up to 2 times

See the pytest documentation on flaky tests for more strategies.

Customization

Custom backend

Override the judge_llm fixture for a custom LLM client or internal gateway.

import pytest
import requests
from pytest_llm_rubric import AnyLLMJudge

class MyBackend(AnyLLMJudge):
    def complete(self, messages, max_output_tokens=256, response_format=None):
        # Call your internal LLM gateway
        resp = requests.post("https://internal-llm.corp/v1/chat", json={"messages": messages})
        return resp.json()["content"]

@pytest.fixture(scope="session")
def judge_llm():
    return MyBackend("my-model", "custom")

Extending AnyLLMJudge gives you the judge() convenience method for free. If you prefer a standalone class, implement both complete() and judge() (see the JudgeLLM protocol).

Message-level API

The judge() method covers most use cases. For full control over messages, use complete() directly:

from pytest_llm_rubric import parse_verdict

def test_custom_prompt(judge_llm):
    response = judge_llm.complete([
        {"role": "system", "content": "Your custom system prompt. Reply PASS or FAIL."},
        {"role": "user", "content": f"DOCUMENT:\n{text}\n\nCRITERION:\n{criterion}"},
    ])
    verdict = parse_verdict(response)
    assert verdict == "PASS"

Custom system prompt

Tweak the preflight system prompt if your model needs specific instructions to pass preflight.

from pytest_llm_rubric.preflight import preflight, JUDGE_SYSTEM_PROMPT

result = preflight(llm, system_prompt="Your custom prompt here.")

The default JUDGE_SYSTEM_PROMPT is used when system_prompt is omitted.

Find Best Local Model

uv run python -m pytest_llm_rubric.find_local_model

Runs preflight against all local Ollama models and recommends the smallest one that passes.

Not sure which models to pull? These tools help you find models that fit your hardware:

  • canirun.ai — browser-based hardware detection, shows which models and quantization levels your machine can handle
  • llmfit — CLI tool that scores models by fit, speed, and quality for your specific GPU/RAM

Development

git clone https://github.com/ugai/pytest-llm-rubric.git
cd pytest-llm-rubric
uv sync --extra ollama
uv run pre-commit install           # ruff + ty on every commit
uv run pytest -m "not integration"  # no LLM calls, runs offline
uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run ty check src/

References

This plugin's design — decomposing evaluation into multiple binary PASS/FAIL criteria instead of multi-level scoring — aligns with Anthropic's recommended practices:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_llm_rubric-0.2.0.tar.gz (13.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_llm_rubric-0.2.0-py3-none-any.whl (16.7 kB view details)

Uploaded Python 3

File details

Details for the file pytest_llm_rubric-0.2.0.tar.gz.

File metadata

  • Download URL: pytest_llm_rubric-0.2.0.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_llm_rubric-0.2.0.tar.gz
Algorithm Hash digest
SHA256 667b75069026424e3a1b65df441ea3294bae1804456f54fcb2add065b62fcf14
MD5 182e7924b95a813aec38972e289aa469
BLAKE2b-256 e43414af37698f4fb3cf56cce99b62e4eb8fe1774e7ffac6e354e9195e094852

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.2.0.tar.gz:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytest_llm_rubric-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_llm_rubric-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 96e345253024f333e120d6e84cf46a9a22c993ff681ca22a295d2786d2046c91
MD5 cdcb5c5a1c450b4085e84ae6b3b9712d
BLAKE2b-256 76e55d5ea8483a0a4d0495ad575a8b63b0f82c2facc12543fbc9df9a89b93956

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.2.0-py3-none-any.whl:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page