Skip to main content

Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.

Project description

LLMJudge Kit

Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.

CI Python License: MIT Typed

LLMJudge Kit is one tiny, well-tested module for scoring model outputs with an LLM judge — the part most projects re-implement badly. The core has zero required runtime dependencies, a stable typed API, and runs fully offline in tests via a deterministic mock.

Install

pip install llm-judge-kit                 # core only, zero deps
pip install "llm-judge-kit[openai]"       # + OpenAI-compatible provider
pip install "llm-judge-kit[anthropic]"    # + Anthropic provider
pip install "llm-judge-kit[all]"          # all providers

Quickstart

This runs as-is — no API key, deterministic (examples/quickstart.py):

from llm_judge_kit import Judge, MockProvider

# MockProvider(fixed_score=...) keeps this example deterministic and offline.
judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")

result = judge.score(
    prompt="What is the capital of France?",
    response="The capital of France is Paris.",
)

assert result > 0.8  # a JudgeResult compares like its float score
print(f"score={result.score}  confidence={result.confidence}")
print(f"passed={result.passed()}  reason={result.reason!r}")

With a real model, swap the provider for a spec string — nothing else changes:

judge = Judge(provider="openai:gpt-5", rubric="factuality")
result = judge.score(prompt, response)
if not result.passed(0.7):
    print("Failed:", result.reason, result.violations)

Core concepts

Piece What it is
Judge(provider, rubric) Pairs a model backend with a rubric; score()JudgeResult.
JudgeResult Frozen, typed verdict: score, confidence, reason, evidence, violations, raw, metadata. Compares and casts like its score.
Provider A Protocol with one method, complete(prompt) -> ProviderResponse.
Rubric Declarative description of what to evaluate; renders a strict-JSON judging prompt.

JudgeResult ergonomics

r = judge.score(prompt, response)
r.score            # float in [0, 1]
float(r)           # same number
r > 0.8            # compares like its score
r.passed(0.7)      # bool against a threshold
r.reason           # short justification
r.evidence         # tuple of supporting quotes/facts
r.violations       # tuple of failed criteria
r.metadata         # provider, model, token usage, latency, cost

Built-in rubrics

factuality, groundedness (requires context=), relevance, instruction_following, safety. List them with llm_judge_kit.available_rubrics().

judge = Judge(provider="openai:gpt-5", rubric="groundedness")
result = judge.score(question, answer, context=retrieved_docs)  # RAG check

Consensus (vote across models)

Run several judge models and aggregate — confidence reflects how much they agree (examples/consensus.py):

judge = Judge.consensus(
    ["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
    rubric="factuality",
)
result = judge.score(prompt, response)
result.score          # mean (or median) of member scores
result.confidence     # high when members agree, low when they diverge
result.metadata["votes"]   # each member's score

Reliability & caching

Both wrappers are providers, so they compose around any backend (examples/reliability_and_cache.py):

from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider

provider = CachingProvider(                      # memoize identical calls
    RetryProvider(                               # retry w/ backoff + timeout
        OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
    )
)
judge = Judge(provider=provider, rubric="factuality")

The cache key is version + provider + model + prompt + kwargs, so it is stable and invalidates correctly across library versions. Logs are emitted on the llm_judge_kit logger (silent by default; call enable_debug_logging() to see them).

Integrations

pytest — eval as ordinary tests

Installing llm_judge_kit registers a pytest plugin (no conftest wiring). The llm_judge_kit fixture turns an eval into a normal test; a failure reads like any other failing assertion (score, reason, violations):

def test_answer_is_grounded(llm_judge_kit):
    llm_judge_kit.assert_passes(
        prompt="How tall is the Eiffel Tower?",
        response=my_rag_pipeline("How tall is the Eiffel Tower?"),
        rubric="groundedness",
        context=retrieved_docs,
        threshold=0.7,
    )

Pick the judge model once for the whole suite — it defaults to mock (offline), so tests are green until you point them at a real model:

pytest --llm-judge-kit-provider "openai:gpt-5"      # or: export LLM_JUDGE_KIT_PROVIDER=...

Runnable example: examples/test_with_pytest.py.

Any framework

LLMJudge Kit judges strings, so it drops into any stack — LangChain, LlamaIndex, DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:

output = my_chain.invoke(question)          # LangChain / LlamaIndex / your code
result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)

Extend without touching the core

Add a rubric (examples/custom_rubric.py):

from llm_judge_kit import Rubric, register_rubric

register_rubric(Rubric(
    name="conciseness",
    description="Whether the response is as short as possible while complete.",
    criteria=("No filler or repetition.", "Every sentence earns its place."),
))
judge = Judge(provider="openai:gpt-5", rubric="conciseness")

Add a provider — implement one method, optionally register a scheme:

from llm_judge_kit import ProviderResponse, register_provider

class MyProvider:
    name = "mine"
    def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
        return ProviderResponse(text=call_my_model(prompt))

register_provider("mine", lambda model: MyProvider())
judge = Judge(provider="mine:v1", rubric="relevance")

CLI & batch evaluation

Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is JSON Lines (prompt + response, optional context/reference/id); see examples/sample_dataset.jsonl.

llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
llm-judge-kit eval cases.jsonl --fail-under 0.9            # exit non-zero in CI if pass rate drops
llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
llm-judge-kit report report.json --format html -o report.html

Same thing in code (examples/benchmark_report.py):

from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown

cases = load_dataset("cases.jsonl")
judge = Judge(provider="openai:gpt-5", rubric="factuality")
report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
print(report.pass_rate, report.mean_score)
print(render_markdown(report))

Why depend on this

  • Easy to depend on — zero transitive deps in the core; provider SDKs are opt-in extras.
  • Reproducible — deterministic offline MockProvider; all unit tests run without network.
  • Typedmypy --strict clean; ships py.typed.
  • Robust parsing — recovers JSON from markdown fences, prose, and trailing commas.
  • Extensible — new provider / rubric / judge without core changes.

Development

uv sync --all-extras
uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing

See CONTRIBUTING.md. The plan of record is in ROADMAP.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_judge_kit-0.1.1.tar.gz (31.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_judge_kit-0.1.1-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_judge_kit-0.1.1.tar.gz.

File metadata

  • Download URL: llm_judge_kit-0.1.1.tar.gz
  • Upload date:
  • Size: 31.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llm_judge_kit-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b935511efda7c9e222251d815b571388f07c830e0c22e382510510e3197d4a69
MD5 7a1651b8cc7a7f3bc8ea20fe7bc25ad2
BLAKE2b-256 64279f5252687e60dfac5c62e69ec41b704385d9869583d426934d41a0d96602

See more details on using hashes here.

File details

Details for the file llm_judge_kit-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_judge_kit-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llm_judge_kit-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db37fb0547d5c3f3f4b9bebc9830ceed4186f5d29c59bf17f8bfb8107a6ae785
MD5 74c9bc8c7fbf054cb0f747c72c0712d7
BLAKE2b-256 55045f802fc11cc9e8e980ad77af05e3de98060610923897e44ef4b64ac57dc8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page