A pytest plugin for rubric-based LLM-as-judge testing with auto-discovery and preflight

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ugai

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Framework
- Pytest
Programming Language

Project description

pytest-llm-rubric

Experimental — this plugin is in early development. APIs may change without notice.

Pytest plugin for LLM-as-a-Judge semantic PASS/FAIL checks.
Just a thin layer between pytest and your LLM stack.

Use Cases

Catch semantic regressions in:

Agent skills: instruction docs still contain rules after edits
Prompts: LLM output quality hasn't degraded after changes
Generated docs: auto-generated content includes all required sections
Translations: specific meanings are preserved across languages

Not a general essay grader or multi-dimensional scoring system.

Quick Start

Install and configure with a local Ollama model:

pip install pytest-llm-rubric
ollama pull gpt-oss:20b
export PYTEST_LLM_RUBRIC_MODELS="ollama:gpt-oss:20b"

See Model selection for other backends.

# test code
def test_semantic_check(judge_llm):
    text = "The quick brown fox jumps over the lazy dog."
    assert judge_llm.judge(text, "Two animals appear in the text.")

    results = [
        judge_llm.judge(text, "A fox leaps over a dog."),
        judge_llm.judge(text, "The dog is beneath the fox."),
    ]
    assert sum(results) / len(results) >= 0.5

# output
$ pytest test_example.py -v
================================= LLM Rubric ==================================
Model: ollama:gpt-oss:20b  Preflight: preflight passed (12/12) in 231.8s
3 passed, 0 failed

How It Works

Discover - resolve the LLM backend from PYTEST_LLM_RUBRIC_MODELS
Preflight - run a sanity-check to verify the backend can reliably judge PASS/FAIL (skippable)
Provide - pass the judge_llm fixture to your tests
- If the backend is unavailable, tests fail
- If preflight fails, tests are skipped

Example: Policy Document Checks

Verify that each policy document expresses required rules.

import pytest
from pathlib import Path
from pytest_llm_rubric import JudgeLLM

POLICY_DOC = Path("docs/policies/data-security.md")
REQUIRED_RULES = [
    "Personal data must be encrypted at rest",
    "Access logs are retained for at least 90 days",
    "Third-party integrations require security review",
]

@pytest.mark.flaky(reruns=2)  # requires `pytest-rerunfailures`
@pytest.mark.parametrize("rule", REQUIRED_RULES)
def test_data_security_policy(judge_llm: JudgeLLM, rule):
    assert judge_llm.judge(POLICY_DOC.read_text(), rule)

Configuration

Model selection

Set PYTEST_LLM_RUBRIC_MODELS to one or more provider:model values:

Value	Description
`ollama:gpt-oss:20b`	Ollama
`anthropic:claude-haiku-4-5`	Requires `ANTHROPIC_API_KEY` *
`openai:gpt-5.4-nano`	Requires `OPENAI_API_KEY` *
`groq:llama-3.3-70b`	Requires `GROQ_API_KEY` *
`ollama:gpt-oss:20b,anthropic:claude-haiku-4-5`	Comma-separated: use first available
`auto`	Try the default model list
(unset)	Error, unless `llm_rubric_models` is configured in ini

Cloud providers (*) need their SDK via any-llm-sdk: pip install any-llm-sdk[anthropic] (or [openai], [groq]). Ollama is included by default.

# GitHub Actions workflow
env:
  PYTEST_LLM_RUBRIC_MODELS: anthropic:claude-haiku-4-5
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Fallback list

Model resolution order: env var PYTEST_LLM_RUBRIC_MODELS > ini option llm_rubric_models.

[!IMPORTANT] The default list includes cloud providers (Anthropic, OpenAI). If their API keys are set, auto may incur API costs. To avoid this, list only providers you intend to use.

# pyproject.toml
[tool.pytest.ini_options]
llm_rubric_models = [
    "ollama:qwen3.5:9b",
    "anthropic:claude-haiku-4-5",
]

Skipping preflight

Set PYTEST_LLM_RUBRIC_SKIP_PREFLIGHT=1 to bypass the built-in golden tests, or add an ini option:

[tool.pytest.ini_options]
llm_rubric_skip_preflight = true

The env var takes precedence over the ini option.

Markers

Tests that use the judge_llm fixture automatically receive the llm_rubric marker, so you can run or skip them selectively:

pytest -m llm_rubric        # run only LLM-judged tests
pytest -m "not llm_rubric"  # skip LLM-judged tests

Find best local model

A tiny CLI utility that runs preflight against all local Ollama models and recommends the smallest one that passes.

$ uv run python -m pytest_llm_rubric.find_local_model --base-url http://localhost:11434 gemma4:e2b gemma4:e4b gemma4:26b
Found 3 model(s) in Ollama. Running preflight...

  gemma4:e2b                     ( 6.7GB) ... FAIL (0/12 stopped at 1/12)
  gemma4:e4b                     ( 8.9GB) ... FAIL (0/12 stopped at 1/12)
  gemma4:26b                     (16.8GB) ... PASS (12/12)
Recommended: gemma4:26b (smallest passing model)

These tools can also help you find models that fit your hardware:

canirun.ai - browser-based, shows which models fit your hardware
llmfit - CLI tool that scores models by fit, speed, and quality

Advanced Usage

`complete()`

complete() gives you full control over the LLM interaction: you provide the messages and get back the raw response. Use it when judge() is too opinionated.

from pytest_llm_rubric import parse_verdict

def test_custom_prompt(judge_llm):
    response = judge_llm.complete([
        {"role": "system", "content": "You are a compliance auditor. Reply PASS or FAIL."},
        {"role": "user", "content": f"DOCUMENT:\n{POLICY_DOC.read_text()}\n\nRULE:\nPersonal data must be encrypted at rest"},
    ])
    passed = parse_verdict(response) == "PASS"
    judge_llm.record(criterion="encryption at rest", passed=passed)
    assert passed

Custom backend

Override the judge_llm fixture for a custom LLM client or internal gateway.

import pytest
import requests
from pytest_llm_rubric import AnyLLMJudge, register_judge

class MyBackend(AnyLLMJudge):
    def complete(self, messages, max_output_tokens=256, response_format=None):
        resp = requests.post("https://internal-llm.corp/v1/chat", json={"messages": messages})
        return resp.json()["content"]

# Override the fixture directly
@pytest.fixture(scope="session")
def judge_llm(request):
    judge = MyBackend("my-model", "internal")
    register_judge(request.config, judge, model="internal:my-model")
    return judge

Extend AnyLLMJudge and override complete(). Call register_judge() in your fixture so the terminal summary picks up the results.

AI coding assistant CLIs as backends

AI coding assistant CLIs like Claude Code or GitHub Copilot can also be used as backends without an API key:

import subprocess
from pytest_llm_rubric import AnyLLMJudge

class ClaudeCLIBackend(AnyLLMJudge):
    def complete(self, messages, max_output_tokens=256, response_format=None):
        prompt = messages[-1]["content"]
        result = subprocess.run(
            ["claude", "-p", prompt],  # or ["copilot", "-p", prompt]
            capture_output=True, timeout=300,
        )
        return result.stdout.decode("utf-8")

Parallel execution (pytest-xdist)

Works with pytest-xdist. Preflight runs once across workers. Not extensively tested yet, please report issues.

pip install pytest-xdist
pytest -n auto -m llm_rubric

Flaky tests

LLM-based tests are inherently non-deterministic. Preflight screens out unreliable models, but borderline cases may still flake. Use pytest-rerunfailures to retry:

pip install pytest-rerunfailures
pytest --reruns 2 -m llm_rubric  # rerun failed LLM tests up to 2 times

Deterministic settings (temperature=0) would undermine the fuzzy semantic matching that makes this approach valuable. See the pytest documentation on flaky tests for more strategies.

Development

git clone https://github.com/ugai/pytest-llm-rubric.git
cd pytest-llm-rubric
uv sync
uv run pre-commit install           # ruff + ty on every commit
uv run pytest -m "not integration"  # no LLM calls, runs offline
uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run ty check src/

References

This plugin's design — binary PASS/FAIL criteria, not multi-level scoring — aligns with Anthropic's recommended practices:

Define success criteria and build evaluations — binary classification with clear rubrics over qualitative scales
Skill authoring best practices — expected_behavior as individually verifiable statements, not a single aggregate score

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

ugai

These details have not been verified by PyPI

Development Status
- 3 - Alpha
Framework
- Pytest
Programming Language

Release history Release notifications | RSS feed

This version

0.4.0

Apr 7, 2026

0.3.0

Mar 28, 2026

0.2.0

Mar 26, 2026

0.1.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_llm_rubric-0.4.0.tar.gz (17.3 kB view details)

Uploaded Apr 7, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pytest_llm_rubric-0.4.0-py3-none-any.whl (20.8 kB view details)

Uploaded Apr 7, 2026 Python 3

File details

Details for the file pytest_llm_rubric-0.4.0.tar.gz.

File metadata

Download URL: pytest_llm_rubric-0.4.0.tar.gz
Upload date: Apr 7, 2026
Size: 17.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_llm_rubric-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ef0a7413bd329f837f7d306275b16084c4bff4f7eb087d7a8805c8d09cdec6f0`
MD5	`13024075a4ebf652d11854c05bc7928b`
BLAKE2b-256	`9a3cb8f71223b42981a893c741a9b9c238493a2612a37fdd730a26c7777062b7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.4.0.tar.gz:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_llm_rubric-0.4.0.tar.gz
- Subject digest: ef0a7413bd329f837f7d306275b16084c4bff4f7eb087d7a8805c8d09cdec6f0
- Sigstore transparency entry: 1245642341
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: ugai/pytest-llm-rubric@e58c725a5ecba3ba3304c12d850898eebfbf326c
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/ugai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e58c725a5ecba3ba3304c12d850898eebfbf326c
- Trigger Event: push

File details

Details for the file pytest_llm_rubric-0.4.0-py3-none-any.whl.

File metadata

Download URL: pytest_llm_rubric-0.4.0-py3-none-any.whl
Upload date: Apr 7, 2026
Size: 20.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pytest_llm_rubric-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bf8a111c4d3f9ab86f61a0f5bdf7bf8e59a119ab0440a84c39aed557e3f7e3a4`
MD5	`6375559fb17970fb195e08267073837e`
BLAKE2b-256	`af290c334be3edd4f42c6761f3c730fe3c507fde4995091346d92e8b440ed64e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pytest_llm_rubric-0.4.0-py3-none-any.whl:

Publisher: release.yml on ugai/pytest-llm-rubric

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pytest_llm_rubric-0.4.0-py3-none-any.whl
- Subject digest: bf8a111c4d3f9ab86f61a0f5bdf7bf8e59a119ab0440a84c39aed557e3f7e3a4
- Sigstore transparency entry: 1245642342
- Sigstore integration time: Apr 7, 2026
Source repository:
- Permalink: ugai/pytest-llm-rubric@e58c725a5ecba3ba3304c12d850898eebfbf326c
- Branch / Tag: refs/tags/v0.4.0
- Owner: https://github.com/ugai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e58c725a5ecba3ba3304c12d850898eebfbf326c
- Trigger Event: push

pytest-llm-rubric 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

pytest-llm-rubric

Use Cases

Quick Start

How It Works

Example: Policy Document Checks

Configuration

Model selection

Fallback list

Skipping preflight

Markers

Find best local model

Advanced Usage

complete()

Custom backend

AI coding assistant CLIs as backends

Parallel execution (pytest-xdist)

Flaky tests

Development

References

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`complete()`