Skip to main content

A pytest plugin for rubric-based LLM-as-judge testing with auto-discovery and preflight

Project description

pytest-llm-rubric

CI PyPI

Experimental — this plugin is in early development. APIs may change without notice.

Pytest plugin for LLM-as-a-Judge semantic PASS/FAIL checks.
Just a thin layer between pytest and your LLM stack.

Use Cases

Catch semantic regressions in:

  • Agent skills: instruction docs still contain rules after edits
  • Prompts: LLM output quality hasn't degraded after changes
  • Generated docs: auto-generated content includes all required sections
  • Translations: specific meanings are preserved across languages

Not a general essay grader or multi-dimensional scoring system.

Quick Start

Install and configure with a local Ollama model:

pip install pytest-llm-rubric
ollama pull gpt-oss:20b
export PYTEST_LLM_RUBRIC_MODELS="ollama:gpt-oss:20b"

See Model selection for other backends.

# test code
def test_semantic_check(judge_llm):
    text = "The quick brown fox jumps over the lazy dog."
    assert judge_llm.judge(text, "Two animals appear in the text.")

    results = [
        judge_llm.judge(text, "A fox leaps over a dog."),
        judge_llm.judge(text, "The dog is beneath the fox."),
    ]
    assert sum(results) / len(results) >= 0.5
# output
$ pytest test_example.py -v
================================= LLM Rubric ==================================
Model: ollama:gpt-oss:20b  Preflight: preflight passed (12/12) in 231.8s
3 passed, 0 failed

How It Works

  1. Discover - resolve the LLM backend from PYTEST_LLM_RUBRIC_MODELS
  2. Preflight - run a sanity-check to verify the backend can reliably judge PASS/FAIL (skippable)
  3. Provide - pass the judge_llm fixture to your tests
    • If the backend is unavailable, tests fail
    • If preflight fails, tests are skipped

Example: Policy Document Checks

Verify that each policy document expresses required rules.

import pytest
from pathlib import Path
from pytest_llm_rubric import JudgeLLM

POLICY_DOC = Path("docs/policies/data-security.md")
REQUIRED_RULES = [
    "Personal data must be encrypted at rest",
    "Access logs are retained for at least 90 days",
    "Third-party integrations require security review",
]

@pytest.mark.flaky(reruns=2)  # requires `pytest-rerunfailures`
@pytest.mark.parametrize("rule", REQUIRED_RULES)
def test_data_security_policy(judge_llm: JudgeLLM, rule):
    assert judge_llm.judge(POLICY_DOC.read_text(), rule)

Configuration

Model selection

Set PYTEST_LLM_RUBRIC_MODELS to one or more provider:model values:

Value Description
ollama:gpt-oss:20b Ollama
anthropic:claude-haiku-4-5 Requires ANTHROPIC_API_KEY *
openai:gpt-5.4-nano Requires OPENAI_API_KEY *
groq:llama-3.3-70b Requires GROQ_API_KEY *
ollama:gpt-oss:20b,anthropic:claude-haiku-4-5 Comma-separated: use first available
auto Try the default model list
(unset) Error, unless llm_rubric_models is configured in ini

Cloud providers (*) need their SDK via any-llm-sdk: pip install any-llm-sdk[anthropic] (or [openai], [groq]). Ollama is included by default.

# GitHub Actions workflow
env:
  PYTEST_LLM_RUBRIC_MODELS: anthropic:claude-haiku-4-5
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Fallback list

Model resolution order: env var PYTEST_LLM_RUBRIC_MODELS > ini option llm_rubric_models.

[!IMPORTANT] The default list includes cloud providers (Anthropic, OpenAI). If their API keys are set, auto may incur API costs. To avoid this, list only providers you intend to use.

# pyproject.toml
[tool.pytest.ini_options]
llm_rubric_models = [
    "ollama:qwen3.5:9b",
    "anthropic:claude-haiku-4-5",
]

Skipping preflight

Set PYTEST_LLM_RUBRIC_SKIP_PREFLIGHT=1 to bypass the built-in golden tests, or add an ini option:

[tool.pytest.ini_options]
llm_rubric_skip_preflight = true

The env var takes precedence over the ini option.

Markers

Tests that use the judge_llm fixture automatically receive the llm_rubric marker, so you can run or skip them selectively:

pytest -m llm_rubric        # run only LLM-judged tests
pytest -m "not llm_rubric"  # skip LLM-judged tests

Find best local model

A tiny CLI utility that runs preflight against all local Ollama models and recommends the smallest one that passes.

$ uv run python -m pytest_llm_rubric.find_local_model --base-url http://localhost:11434 gemma4:e2b gemma4:e4b gemma4:26b
Found 3 model(s) in Ollama. Running preflight...

  gemma4:e2b                     ( 6.7GB) ... FAIL (0/12 stopped at 1/12)
  gemma4:e4b                     ( 8.9GB) ... FAIL (0/12 stopped at 1/12)
  gemma4:26b                     (16.8GB) ... PASS (12/12)
Recommended: gemma4:26b (smallest passing model)

These tools can also help you find models that fit your hardware:

  • canirun.ai - browser-based, shows which models fit your hardware
  • llmfit - CLI tool that scores models by fit, speed, and quality

Advanced Usage

complete()

complete() gives you full control over the LLM interaction: you provide the messages and get back the raw response. Use it when judge() is too opinionated.

from pytest_llm_rubric import parse_verdict

def test_custom_prompt(judge_llm):
    response = judge_llm.complete([
        {"role": "system", "content": "You are a compliance auditor. Reply PASS or FAIL."},
        {"role": "user", "content": f"DOCUMENT:\n{POLICY_DOC.read_text()}\n\nRULE:\nPersonal data must be encrypted at rest"},
    ])
    passed = parse_verdict(response) == "PASS"
    judge_llm.record(criterion="encryption at rest", passed=passed)
    assert passed

Custom backend

Override the judge_llm fixture for a custom LLM client or internal gateway.

import pytest
import requests
from pytest_llm_rubric import AnyLLMJudge, register_judge

class MyBackend(AnyLLMJudge):
    def complete(self, messages, max_output_tokens=256, response_format=None):
        resp = requests.post("https://internal-llm.corp/v1/chat", json={"messages": messages})
        return resp.json()["content"]

# Override the fixture directly
@pytest.fixture(scope="session")
def judge_llm(request):
    judge = MyBackend("my-model", "internal")
    register_judge(request.config, judge, model="internal:my-model")
    return judge

Extend AnyLLMJudge and override complete(). Call register_judge() in your fixture so the terminal summary picks up the results.

AI coding assistant CLIs as backends

AI coding assistant CLIs like Claude Code or GitHub Copilot can also be used as backends without an API key:

import subprocess
from pytest_llm_rubric import AnyLLMJudge

class ClaudeCLIBackend(AnyLLMJudge):
    def complete(self, messages, max_output_tokens=256, response_format=None):
        prompt = messages[-1]["content"]
        result = subprocess.run(
            ["claude", "-p", prompt],  # or ["copilot", "-p", prompt]
            capture_output=True, timeout=300,
        )
        return result.stdout.decode("utf-8")

Parallel execution (pytest-xdist)

Works with pytest-xdist. Preflight runs once across workers. Not extensively tested yet, please report issues.

pip install pytest-xdist
pytest -n auto -m llm_rubric

Flaky tests

LLM-based tests are inherently non-deterministic. Preflight screens out unreliable models, but borderline cases may still flake. Use pytest-rerunfailures to retry:

pip install pytest-rerunfailures
pytest --reruns 2 -m llm_rubric  # rerun failed LLM tests up to 2 times

Deterministic settings (temperature=0) would undermine the fuzzy semantic matching that makes this approach valuable. See the pytest documentation on flaky tests for more strategies.

Development

git clone https://github.com/ugai/pytest-llm-rubric.git
cd pytest-llm-rubric
uv sync
uv run pre-commit install           # ruff + ty on every commit
uv run pytest -m "not integration"  # no LLM calls, runs offline
uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run ty check src/

References

This plugin's design — binary PASS/FAIL criteria, not multi-level scoring — aligns with Anthropic's recommended practices:

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_llm_rubric-0.4.0.tar.gz (17.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_llm_rubric-0.4.0-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file pytest_llm_rubric-0.4.0.tar.gz.

File metadata

  • Download URL: pytest_llm_rubric-0.4.0.tar.gz
  • Upload date:
  • Size: 17.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_llm_rubric-0.4.0.tar.gz
Algorithm Hash digest
SHA256 ef0a7413bd329f837f7d306275b16084c4bff4f7eb087d7a8805c8d09cdec6f0
MD5 13024075a4ebf652d11854c05bc7928b
BLAKE2b-256 9a3cb8f71223b42981a893c741a9b9c238493a2612a37fdd730a26c7777062b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.4.0.tar.gz:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pytest_llm_rubric-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for pytest_llm_rubric-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bf8a111c4d3f9ab86f61a0f5bdf7bf8e59a119ab0440a84c39aed557e3f7e3a4
MD5 6375559fb17970fb195e08267073837e
BLAKE2b-256 af290c334be3edd4f42c6761f3c730fe3c507fde4995091346d92e8b440ed64e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.4.0-py3-none-any.whl:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page