Skip to main content

Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.

Project description

LLMJudge Kit

Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.

PyPI Python versions CI License: MIT Typed: mypy strict

LLMJudge Kit is one tiny, well-tested module for scoring model outputs with an LLM judge — the part most projects re-implement badly. The core has zero required runtime dependencies, a stable typed API, and runs fully offline in tests via a deterministic mock.

Install

pip install llm-judge-kit                 # core only, zero deps
pip install "llm-judge-kit[openai]"       # + OpenAI-compatible provider
pip install "llm-judge-kit[anthropic]"    # + Anthropic provider
pip install "llm-judge-kit[all]"          # all providers

Quickstart

This runs as-is — no API key, deterministic (examples/quickstart.py):

from llm_judge_kit import Judge, MockProvider

# MockProvider(fixed_score=...) keeps this example deterministic and offline.
judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")

result = judge.score(
    prompt="What is the capital of France?",
    response="The capital of France is Paris.",
)

assert result > 0.8  # a JudgeResult compares like its float score
print(f"score={result.score}  confidence={result.confidence}")
print(f"passed={result.passed()}  reason={result.reason!r}")

With a real model, swap the provider for a spec string — nothing else changes:

judge = Judge(provider="openai:gpt-5", rubric="factuality")
result = judge.score(prompt, response)
if not result.passed(0.7):
    print("Failed:", result.reason, result.violations)

Core concepts

Piece What it is
Judge(provider, rubric) Pairs a model backend with a rubric; score()JudgeResult.
JudgeResult Frozen, typed verdict: score, confidence, reason, evidence, violations, raw, metadata. Compares and casts like its score.
Provider A Protocol with one method, complete(prompt) -> ProviderResponse.
Rubric Declarative description of what to evaluate; renders a strict-JSON judging prompt.

JudgeResult ergonomics

r = judge.score(prompt, response)
r.score            # float in [0, 1]
float(r)           # same number
r > 0.8            # compares like its score
r.passed(0.7)      # bool against a threshold
r.reason           # short justification
r.evidence         # tuple of supporting quotes/facts
r.violations       # tuple of failed criteria
r.metadata         # provider, model, token usage, latency, cost

Built-in rubrics

factuality, groundedness (requires context=), relevance, instruction_following, safety, coherence, completeness. List them with llm_judge_kit.available_rubrics().

judge = Judge(provider="openai:gpt-5", rubric="groundedness")
result = judge.score(question, answer, context=retrieved_docs)  # RAG check

Consensus (vote across models)

Run several judge models and aggregate — confidence reflects how much they agree (examples/consensus.py):

judge = Judge.consensus(
    ["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
    rubric="factuality",
)
result = judge.score(prompt, response)
result.score          # mean (or median) of member scores
result.confidence     # high when members agree, low when they diverge
result.metadata["votes"]   # each member's score

Reliability & caching

Both wrappers are providers, so they compose around any backend (examples/reliability_and_cache.py):

from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider

provider = CachingProvider(                      # memoize identical calls
    RetryProvider(                               # retry w/ backoff + timeout
        OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
    )
)
judge = Judge(provider=provider, rubric="factuality")

The cache key is version + provider + model + prompt + kwargs, so it is stable and invalidates correctly across library versions. Logs are emitted on the llm_judge_kit logger (silent by default; call enable_debug_logging() to see them).

Integrations

pytest — eval as ordinary tests

Installing llm_judge_kit registers a pytest plugin (no conftest wiring). The llm_judge_kit fixture turns an eval into a normal test; a failure reads like any other failing assertion (score, reason, violations):

def test_answer_is_grounded(llm_judge_kit):
    llm_judge_kit.assert_passes(
        prompt="How tall is the Eiffel Tower?",
        response=my_rag_pipeline("How tall is the Eiffel Tower?"),
        rubric="groundedness",
        context=retrieved_docs,
        threshold=0.7,
    )

Pick the judge model once for the whole suite — it defaults to mock (offline), so tests are green until you point them at a real model:

pytest --llm-judge-kit-provider "openai:gpt-5"      # or: export LLM_JUDGE_KIT_PROVIDER=...

Runnable example: examples/test_with_pytest.py.

Any framework

LLMJudge Kit judges strings, so it drops into any stack — LangChain, LlamaIndex, DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:

output = my_chain.invoke(question)          # LangChain / LlamaIndex / your code
result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)

Extend without touching the core

Add a rubric (examples/custom_rubric.py):

from llm_judge_kit import Rubric, register_rubric

register_rubric(Rubric(
    name="conciseness",
    description="Whether the response is as short as possible while complete.",
    criteria=("No filler or repetition.", "Every sentence earns its place."),
))
judge = Judge(provider="openai:gpt-5", rubric="conciseness")

Add a provider — implement one method, optionally register a scheme:

from llm_judge_kit import ProviderResponse, register_provider

class MyProvider:
    name = "mine"
    def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
        return ProviderResponse(text=call_my_model(prompt))

register_provider("mine", lambda model: MyProvider())
judge = Judge(provider="mine:v1", rubric="relevance")

CLI & batch evaluation

Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is JSON Lines (prompt + response, optional context/reference/id); see examples/sample_dataset.jsonl.

llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
llm-judge-kit eval cases.jsonl --fail-under 0.9            # exit non-zero in CI if pass rate drops
llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
llm-judge-kit report report.json --format html -o report.html

Try it now — offline, on the shipped sample dataset (deterministic mock scores; point --provider at a real model for real verdicts):

$ llm-judge-kit eval examples/sample_dataset.jsonl --rubric factuality --format md
# LLMJudge Kit report

- **provider:** mock
- **rubric:** factuality
- **threshold:** 0.50
- **cases:** 3
- **passed:** 1 (33.3%)
- **mean score:** 0.596

| # | id | score | result | reason |
| --- | --- | ---: | :---: | --- |
| 1 | fr-capital | 0.997 | PASS | Deterministic mock verdict. |
| 2 | math | 0.349 | FAIL | Deterministic mock verdict. |
| 3 | eiffel | 0.442 | FAIL | Deterministic mock verdict. |

Swap --format html -o report.html for a shareable, self-contained HTML report — here is a generated sample.

Same thing in code (examples/benchmark_report.py):

from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown

cases = load_dataset("cases.jsonl")
judge = Judge(provider="openai:gpt-5", rubric="factuality")
report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
print(report.pass_rate, report.mean_score)
print(render_markdown(report))

Use it as a CI gate

Block a merge when answer quality regresses — --fail-under exits non-zero, and the core installs with no heavy dependency tree:

# .github/workflows/eval.yml  (steps inside your job)
- run: pip install "llm-judge-kit[openai]"
- run: llm-judge-kit eval cases.jsonl --rubric factuality --provider openai:gpt-5 --fail-under 0.9
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Or keep evals as ordinary pytest tests via the bundled plugin (see Integrations) and run them in your existing test job.

Why depend on this

  • Easy to depend on — zero transitive deps in the core; provider SDKs are opt-in extras.
  • Reproducible — deterministic offline MockProvider; all unit tests run without network.
  • Typedmypy --strict clean; ships py.typed.
  • Robust parsing — recovers JSON from markdown fences, prose, and trailing commas.
  • Extensible — new provider / rubric / judge without core changes.

Development

uv sync --all-extras
uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing

See CONTRIBUTING.md. The plan of record is in ROADMAP.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_judge_kit-0.1.2.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_judge_kit-0.1.2-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_judge_kit-0.1.2.tar.gz.

File metadata

  • Download URL: llm_judge_kit-0.1.2.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llm_judge_kit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 fb5b4785179a84fe9a3e718ac9e9322860157da4f20f2382d38bcdb2f0b8a25c
MD5 c50c4db5d627a2161039796438237c58
BLAKE2b-256 330d332880b2cb299ab83de39265a5d2ea724afe1702f6b1448ae3ae402771bb

See more details on using hashes here.

File details

Details for the file llm_judge_kit-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: llm_judge_kit-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llm_judge_kit-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dbc5d671b32cddec0f7096f3ee2b46adff8511a9151c82812e7c34b106a29247
MD5 caeebc237af157790483bdaf318e379e
BLAKE2b-256 cb24243a179241b9bdc90a227d741d4e3bdb01a9fae064a1d260ce328d51e043

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page