A pytest plugin for rubric-based LLM-as-judge testing with auto-discovery and preflight
Project description
pytest-llm-rubric
Experimental — this plugin is in early development. APIs may change without notice.
Pytest plugin for LLM-as-a-Judge semantic PASS/FAIL checks.
Just a thin layer between pytest and your LLM stack.
Use Cases
Catch semantic regressions in:
- Agent skills: instruction docs still contain rules after edits
- Prompts: LLM output quality hasn't degraded after changes
- Generated docs: auto-generated content includes all required sections
- Translations: specific meanings are preserved across languages
Not a general essay grader or multi-dimensional scoring system.
Quick Start
Install and configure with a local Ollama model:
pip install pytest-llm-rubric
ollama pull gpt-oss:20b
export PYTEST_LLM_RUBRIC_MODELS="ollama:gpt-oss:20b"
See Model selection for other backends.
# test code
def test_semantic_check(judge_llm):
text = "The quick brown fox jumps over the lazy dog."
assert judge_llm.judge(text, "Two animals appear in the text.")
results = [
judge_llm.judge(text, "A fox leaps over a dog."),
judge_llm.judge(text, "The dog is beneath the fox."),
]
assert sum(results) / len(results) >= 0.5
# output
$ pytest test_example.py -v
================================= LLM Rubric ==================================
Model: ollama:gpt-oss:20b Preflight: preflight passed (12/12) in 231.8s
3 passed, 0 failed
How It Works
- Discover - resolve the LLM backend from
PYTEST_LLM_RUBRIC_MODELS - Preflight - run a sanity-check to verify the backend can reliably judge PASS/FAIL (skippable)
- Provide - pass the
judge_llmfixture to your tests- If the backend is unavailable, tests fail
- If preflight fails, tests are skipped
Example: Policy Document Checks
Verify that each policy document expresses required rules.
import pytest
from pathlib import Path
from pytest_llm_rubric import JudgeLLM
POLICY_DOC = Path("docs/policies/data-security.md")
REQUIRED_RULES = [
"Personal data must be encrypted at rest",
"Access logs are retained for at least 90 days",
"Third-party integrations require security review",
]
@pytest.mark.flaky(reruns=2) # requires `pytest-rerunfailures`
@pytest.mark.parametrize("rule", REQUIRED_RULES)
def test_data_security_policy(judge_llm: JudgeLLM, rule):
assert judge_llm.judge(POLICY_DOC.read_text(), rule)
Configuration
Model selection
Set PYTEST_LLM_RUBRIC_MODELS to one or more provider:model values:
| Value | Description |
|---|---|
ollama:gpt-oss:20b |
Ollama |
anthropic:claude-haiku-4-5 |
Requires ANTHROPIC_API_KEY * |
openai:gpt-5.4-nano |
Requires OPENAI_API_KEY * |
groq:llama-3.3-70b |
Requires GROQ_API_KEY * |
ollama:gpt-oss:20b,anthropic:claude-haiku-4-5 |
Comma-separated: use first available |
auto |
Try the default model list |
| (unset) | Error, unless llm_rubric_models is configured in ini |
Cloud providers (*) need their SDK via any-llm-sdk: pip install any-llm-sdk[anthropic] (or [openai], [groq]). Ollama is included by default.
# GitHub Actions workflow
env:
PYTEST_LLM_RUBRIC_MODELS: anthropic:claude-haiku-4-5
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Fallback list
Model resolution order: env var PYTEST_LLM_RUBRIC_MODELS > ini option llm_rubric_models.
[!IMPORTANT] The default list includes cloud providers (Anthropic, OpenAI). If their API keys are set,
automay incur API costs. To avoid this, list only providers you intend to use.
# pyproject.toml
[tool.pytest.ini_options]
llm_rubric_models = [
"ollama:qwen3.5:9b",
"anthropic:claude-haiku-4-5",
]
Skipping preflight
Set PYTEST_LLM_RUBRIC_SKIP_PREFLIGHT=1 to bypass the built-in golden tests, or add an ini option:
[tool.pytest.ini_options]
llm_rubric_skip_preflight = true
The env var takes precedence over the ini option.
Markers
Tests that use the judge_llm fixture automatically receive the llm_rubric marker, so you can run or skip them selectively:
pytest -m llm_rubric # run only LLM-judged tests
pytest -m "not llm_rubric" # skip LLM-judged tests
Find best local model
A tiny CLI utility that runs preflight against all local Ollama models and recommends the smallest one that passes.
$ uv run python -m pytest_llm_rubric.find_local_model --base-url http://localhost:11434 gemma4:e2b gemma4:e4b gemma4:26b
Found 3 model(s) in Ollama. Running preflight...
gemma4:e2b ( 6.7GB) ... FAIL (0/12 stopped at 1/12)
gemma4:e4b ( 8.9GB) ... FAIL (0/12 stopped at 1/12)
gemma4:26b (16.8GB) ... PASS (12/12)
Recommended: gemma4:26b (smallest passing model)
These tools can also help you find models that fit your hardware:
- canirun.ai - browser-based, shows which models fit your hardware
- llmfit - CLI tool that scores models by fit, speed, and quality
Advanced Usage
complete()
complete() gives you full control over the LLM interaction: you provide the messages and get back the raw response. Use it when judge() is too opinionated.
from pytest_llm_rubric import parse_verdict
def test_custom_prompt(judge_llm):
response = judge_llm.complete([
{"role": "system", "content": "You are a compliance auditor. Reply PASS or FAIL."},
{"role": "user", "content": f"DOCUMENT:\n{POLICY_DOC.read_text()}\n\nRULE:\nPersonal data must be encrypted at rest"},
])
passed = parse_verdict(response) == "PASS"
judge_llm.record(criterion="encryption at rest", passed=passed)
assert passed
Custom backend
Override the judge_llm fixture for a custom LLM client or internal gateway.
import pytest
import requests
from pytest_llm_rubric import AnyLLMJudge, register_judge
class MyBackend(AnyLLMJudge):
def complete(self, messages, max_output_tokens=256, response_format=None):
resp = requests.post("https://internal-llm.corp/v1/chat", json={"messages": messages})
return resp.json()["content"]
# Override the fixture directly
@pytest.fixture(scope="session")
def judge_llm(request):
judge = MyBackend("my-model", "internal")
register_judge(request.config, judge, model="internal:my-model")
return judge
Extend AnyLLMJudge and override complete(). Call register_judge() in your fixture so the terminal summary picks up the results.
AI coding assistant CLIs as backends
AI coding assistant CLIs like Claude Code or GitHub Copilot can also be used as backends without an API key:
import subprocess
from pytest_llm_rubric import AnyLLMJudge
class ClaudeCLIBackend(AnyLLMJudge):
def complete(self, messages, max_output_tokens=256, response_format=None):
prompt = messages[-1]["content"]
result = subprocess.run(
["claude", "-p", prompt], # or ["copilot", "-p", prompt]
capture_output=True, timeout=300,
)
return result.stdout.decode("utf-8")
Parallel execution (pytest-xdist)
Works with pytest-xdist. Preflight runs once across workers. Not extensively tested yet, please report issues.
pip install pytest-xdist
pytest -n auto -m llm_rubric
Flaky tests
LLM-based tests are inherently non-deterministic. Preflight screens out unreliable models, but borderline cases may still flake. Use pytest-rerunfailures to retry:
pip install pytest-rerunfailures
pytest --reruns 2 -m llm_rubric # rerun failed LLM tests up to 2 times
Deterministic settings (temperature=0) would undermine the fuzzy semantic matching that makes this approach valuable. See the pytest documentation on flaky tests for more strategies.
Development
git clone https://github.com/ugai/pytest-llm-rubric.git
cd pytest-llm-rubric
uv sync
uv run pre-commit install # ruff + ty on every commit
uv run pytest -m "not integration" # no LLM calls, runs offline
uv run ruff check src/ tests/
uv run ruff format src/ tests/
uv run ty check src/
References
This plugin's design — binary PASS/FAIL criteria, not multi-level scoring — aligns with Anthropic's recommended practices:
- Define success criteria and build evaluations — binary classification with clear rubrics over qualitative scales
- Skill authoring best practices —
expected_behavioras individually verifiable statements, not a single aggregate score
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_llm_rubric-0.4.0.tar.gz.
File metadata
- Download URL: pytest_llm_rubric-0.4.0.tar.gz
- Upload date:
- Size: 17.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ef0a7413bd329f837f7d306275b16084c4bff4f7eb087d7a8805c8d09cdec6f0
|
|
| MD5 |
13024075a4ebf652d11854c05bc7928b
|
|
| BLAKE2b-256 |
9a3cb8f71223b42981a893c741a9b9c238493a2612a37fdd730a26c7777062b7
|
Provenance
The following attestation bundles were made for pytest_llm_rubric-0.4.0.tar.gz:
Publisher:
release.yml on ugai/pytest-llm-rubric
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_llm_rubric-0.4.0.tar.gz -
Subject digest:
ef0a7413bd329f837f7d306275b16084c4bff4f7eb087d7a8805c8d09cdec6f0 - Sigstore transparency entry: 1245642341
- Sigstore integration time:
-
Permalink:
ugai/pytest-llm-rubric@e58c725a5ecba3ba3304c12d850898eebfbf326c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/ugai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e58c725a5ecba3ba3304c12d850898eebfbf326c -
Trigger Event:
push
-
Statement type:
File details
Details for the file pytest_llm_rubric-0.4.0-py3-none-any.whl.
File metadata
- Download URL: pytest_llm_rubric-0.4.0-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf8a111c4d3f9ab86f61a0f5bdf7bf8e59a119ab0440a84c39aed557e3f7e3a4
|
|
| MD5 |
6375559fb17970fb195e08267073837e
|
|
| BLAKE2b-256 |
af290c334be3edd4f42c6761f3c730fe3c507fde4995091346d92e8b440ed64e
|
Provenance
The following attestation bundles were made for pytest_llm_rubric-0.4.0-py3-none-any.whl:
Publisher:
release.yml on ugai/pytest-llm-rubric
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pytest_llm_rubric-0.4.0-py3-none-any.whl -
Subject digest:
bf8a111c4d3f9ab86f61a0f5bdf7bf8e59a119ab0440a84c39aed557e3f7e3a4 - Sigstore transparency entry: 1245642342
- Sigstore integration time:
-
Permalink:
ugai/pytest-llm-rubric@e58c725a5ecba3ba3304c12d850898eebfbf326c -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/ugai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@e58c725a5ecba3ba3304c12d850898eebfbf326c -
Trigger Event:
push
-
Statement type: