Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.
Project description
LLMJudge
Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.
LLMJudge is one tiny, well-tested module for scoring model outputs with an LLM judge — the part most projects re-implement badly. The core has zero required runtime dependencies, a stable typed API, and runs fully offline in tests via a deterministic mock.
Install
pip install llm-judge-kit # core only, zero deps
pip install "llm-judge-kit[openai]" # + OpenAI-compatible provider
pip install "llm-judge-kit[anthropic]" # + Anthropic provider
pip install "llm-judge-kit[all]" # all providers
Quickstart
This runs as-is — no API key, deterministic (examples/quickstart.py):
from llm_judge_kit import Judge, MockProvider
# MockProvider(fixed_score=...) keeps this example deterministic and offline.
judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")
result = judge.score(
prompt="What is the capital of France?",
response="The capital of France is Paris.",
)
assert result > 0.8 # a JudgeResult compares like its float score
print(f"score={result.score} confidence={result.confidence}")
print(f"passed={result.passed()} reason={result.reason!r}")
With a real model, swap the provider for a spec string — nothing else changes:
judge = Judge(provider="openai:gpt-5", rubric="factuality")
result = judge.score(prompt, response)
if not result.passed(0.7):
print("Failed:", result.reason, result.violations)
Core concepts
| Piece | What it is |
|---|---|
Judge(provider, rubric) |
Pairs a model backend with a rubric; score() → JudgeResult. |
JudgeResult |
Frozen, typed verdict: score, confidence, reason, evidence, violations, raw, metadata. Compares and casts like its score. |
Provider |
A Protocol with one method, complete(prompt) -> ProviderResponse. |
Rubric |
Declarative description of what to evaluate; renders a strict-JSON judging prompt. |
JudgeResult ergonomics
r = judge.score(prompt, response)
r.score # float in [0, 1]
float(r) # same number
r > 0.8 # compares like its score
r.passed(0.7) # bool against a threshold
r.reason # short justification
r.evidence # tuple of supporting quotes/facts
r.violations # tuple of failed criteria
r.metadata # provider, model, token usage, latency, cost
Built-in rubrics
factuality, groundedness (requires context=), relevance,
instruction_following, safety. List them with
llm_judge_kit.available_rubrics().
judge = Judge(provider="openai:gpt-5", rubric="groundedness")
result = judge.score(question, answer, context=retrieved_docs) # RAG check
Consensus (vote across models)
Run several judge models and aggregate — confidence reflects how much they
agree (examples/consensus.py):
judge = Judge.consensus(
["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
rubric="factuality",
)
result = judge.score(prompt, response)
result.score # mean (or median) of member scores
result.confidence # high when members agree, low when they diverge
result.metadata["votes"] # each member's score
Reliability & caching
Both wrappers are providers, so they compose around any backend
(examples/reliability_and_cache.py):
from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider
provider = CachingProvider( # memoize identical calls
RetryProvider( # retry w/ backoff + timeout
OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
)
)
judge = Judge(provider=provider, rubric="factuality")
The cache key is version + provider + model + prompt + kwargs, so it is stable
and invalidates correctly across library versions. Logs are emitted on the
llm_judge_kit logger (silent by default; call enable_debug_logging() to see them).
Integrations
pytest — eval as ordinary tests
Installing llm_judge_kit registers a pytest plugin (no conftest wiring). The
llm_judge_kit fixture turns an eval into a normal test; a failure reads like any
other failing assertion (score, reason, violations):
def test_answer_is_grounded(llm_judge_kit):
llm_judge_kit.assert_passes(
prompt="How tall is the Eiffel Tower?",
response=my_rag_pipeline("How tall is the Eiffel Tower?"),
rubric="groundedness",
context=retrieved_docs,
threshold=0.7,
)
Pick the judge model once for the whole suite — it defaults to mock (offline),
so tests are green until you point them at a real model:
pytest --llm-judge-kit-provider "openai:gpt-5" # or: export LLM_JUDGE_KIT_PROVIDER=...
Any framework
LLMJudge judges strings, so it drops into any stack — LangChain, LlamaIndex, DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:
output = my_chain.invoke(question) # LangChain / LlamaIndex / your code
result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)
Extend without touching the core
Add a rubric (examples/custom_rubric.py):
from llm_judge_kit import Rubric, register_rubric
register_rubric(Rubric(
name="conciseness",
description="Whether the response is as short as possible while complete.",
criteria=("No filler or repetition.", "Every sentence earns its place."),
))
judge = Judge(provider="openai:gpt-5", rubric="conciseness")
Add a provider — implement one method, optionally register a scheme:
from llm_judge_kit import ProviderResponse, register_provider
class MyProvider:
name = "mine"
def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
return ProviderResponse(text=call_my_model(prompt))
register_provider("mine", lambda model: MyProvider())
judge = Judge(provider="mine:v1", rubric="relevance")
CLI & batch evaluation
Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is
JSON Lines (prompt + response, optional context/reference/id); see
examples/sample_dataset.jsonl.
llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
llm-judge-kit eval cases.jsonl --fail-under 0.9 # exit non-zero in CI if pass rate drops
llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
llm-judge-kit report report.json --format html -o report.html
Same thing in code (examples/benchmark_report.py):
from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown
cases = load_dataset("cases.jsonl")
judge = Judge(provider="openai:gpt-5", rubric="factuality")
report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
print(report.pass_rate, report.mean_score)
print(render_markdown(report))
Why depend on this
- Easy to depend on — zero transitive deps in the core; provider SDKs are opt-in extras.
- Reproducible — deterministic offline
MockProvider; all unit tests run without network. - Typed —
mypy --strictclean; shipspy.typed. - Robust parsing — recovers JSON from markdown fences, prose, and trailing commas.
- Extensible — new provider / rubric / judge without core changes.
Development
uv sync --all-extras
uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing
See CONTRIBUTING.md. The plan of record is in ROADMAP.md.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_judge_kit-0.1.0.tar.gz.
File metadata
- Download URL: llm_judge_kit-0.1.0.tar.gz
- Upload date:
- Size: 30.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f37baa15517b4ec6859239fe225e6577cc2dddec68f14c1e2306de3c94c8544
|
|
| MD5 |
0b2aa283e50fd4163a91e233fe783c78
|
|
| BLAKE2b-256 |
0fed1938feca9d178b43a4d2588c4a21d11c521225d9dbc8bfbc17610b8450c4
|
File details
Details for the file llm_judge_kit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_judge_kit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ab0c940507a64379f7ac3129bde7086c42d49872d2c5356224d8520c4bac70e
|
|
| MD5 |
92baceed60170ed73a262b0ff83403ad
|
|
| BLAKE2b-256 |
2f0def141cb5f948cc322086016859e119be67cd4d7e2e43c357f9838ec2e293
|