Skip to main content

Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.

Project description

LLMJudge

Provider-agnostic, reproducible, typed LLM-as-a-judge — a small primitive you can depend on.

CI Python License: MIT Typed

LLMJudge is one tiny, well-tested module for scoring model outputs with an LLM judge — the part most projects re-implement badly. The core has zero required runtime dependencies, a stable typed API, and runs fully offline in tests via a deterministic mock.

Install

pip install llm-judge-kit                 # core only, zero deps
pip install "llm-judge-kit[openai]"       # + OpenAI-compatible provider
pip install "llm-judge-kit[anthropic]"    # + Anthropic provider
pip install "llm-judge-kit[all]"          # all providers

Quickstart

This runs as-is — no API key, deterministic (examples/quickstart.py):

from llm_judge_kit import Judge, MockProvider

# MockProvider(fixed_score=...) keeps this example deterministic and offline.
judge = Judge(provider=MockProvider(fixed_score=0.9), rubric="factuality")

result = judge.score(
    prompt="What is the capital of France?",
    response="The capital of France is Paris.",
)

assert result > 0.8  # a JudgeResult compares like its float score
print(f"score={result.score}  confidence={result.confidence}")
print(f"passed={result.passed()}  reason={result.reason!r}")

With a real model, swap the provider for a spec string — nothing else changes:

judge = Judge(provider="openai:gpt-5", rubric="factuality")
result = judge.score(prompt, response)
if not result.passed(0.7):
    print("Failed:", result.reason, result.violations)

Core concepts

Piece What it is
Judge(provider, rubric) Pairs a model backend with a rubric; score()JudgeResult.
JudgeResult Frozen, typed verdict: score, confidence, reason, evidence, violations, raw, metadata. Compares and casts like its score.
Provider A Protocol with one method, complete(prompt) -> ProviderResponse.
Rubric Declarative description of what to evaluate; renders a strict-JSON judging prompt.

JudgeResult ergonomics

r = judge.score(prompt, response)
r.score            # float in [0, 1]
float(r)           # same number
r > 0.8            # compares like its score
r.passed(0.7)      # bool against a threshold
r.reason           # short justification
r.evidence         # tuple of supporting quotes/facts
r.violations       # tuple of failed criteria
r.metadata         # provider, model, token usage, latency, cost

Built-in rubrics

factuality, groundedness (requires context=), relevance, instruction_following, safety. List them with llm_judge_kit.available_rubrics().

judge = Judge(provider="openai:gpt-5", rubric="groundedness")
result = judge.score(question, answer, context=retrieved_docs)  # RAG check

Consensus (vote across models)

Run several judge models and aggregate — confidence reflects how much they agree (examples/consensus.py):

judge = Judge.consensus(
    ["openai:gpt-5", "anthropic:claude-opus-4-8", "ollama:llama3"],
    rubric="factuality",
)
result = judge.score(prompt, response)
result.score          # mean (or median) of member scores
result.confidence     # high when members agree, low when they diverge
result.metadata["votes"]   # each member's score

Reliability & caching

Both wrappers are providers, so they compose around any backend (examples/reliability_and_cache.py):

from llm_judge_kit import Judge, OpenAIProvider, RetryProvider, CachingProvider

provider = CachingProvider(                      # memoize identical calls
    RetryProvider(                               # retry w/ backoff + timeout
        OpenAIProvider(model="gpt-5"), retries=3, timeout=30,
    )
)
judge = Judge(provider=provider, rubric="factuality")

The cache key is version + provider + model + prompt + kwargs, so it is stable and invalidates correctly across library versions. Logs are emitted on the llm_judge_kit logger (silent by default; call enable_debug_logging() to see them).

Integrations

pytest — eval as ordinary tests

Installing llm_judge_kit registers a pytest plugin (no conftest wiring). The llm_judge_kit fixture turns an eval into a normal test; a failure reads like any other failing assertion (score, reason, violations):

def test_answer_is_grounded(llm_judge_kit):
    llm_judge_kit.assert_passes(
        prompt="How tall is the Eiffel Tower?",
        response=my_rag_pipeline("How tall is the Eiffel Tower?"),
        rubric="groundedness",
        context=retrieved_docs,
        threshold=0.7,
    )

Pick the judge model once for the whole suite — it defaults to mock (offline), so tests are green until you point them at a real model:

pytest --llm-judge-kit-provider "openai:gpt-5"      # or: export LLM_JUDGE_KIT_PROVIDER=...

Any framework

LLMJudge judges strings, so it drops into any stack — LangChain, LlamaIndex, DSPy, a raw script — with no adapter. Whatever produces the output, pass it in:

output = my_chain.invoke(question)          # LangChain / LlamaIndex / your code
result = Judge(provider="openai:gpt-5", rubric="relevance").score(question, output)

Extend without touching the core

Add a rubric (examples/custom_rubric.py):

from llm_judge_kit import Rubric, register_rubric

register_rubric(Rubric(
    name="conciseness",
    description="Whether the response is as short as possible while complete.",
    criteria=("No filler or repetition.", "Every sentence earns its place."),
))
judge = Judge(provider="openai:gpt-5", rubric="conciseness")

Add a provider — implement one method, optionally register a scheme:

from llm_judge_kit import ProviderResponse, register_provider

class MyProvider:
    name = "mine"
    def complete(self, prompt: str, **kwargs: object) -> ProviderResponse:
        return ProviderResponse(text=call_my_model(prompt))

register_provider("mine", lambda model: MyProvider())
judge = Judge(provider="mine:v1", rubric="relevance")

CLI & batch evaluation

Score a whole dataset and get a report — JSON, Markdown, or HTML. A dataset is JSON Lines (prompt + response, optional context/reference/id); see examples/sample_dataset.jsonl.

llm-judge-kit eval cases.jsonl --provider openai:gpt-5 --rubric factuality --format md
llm-judge-kit eval cases.jsonl --fail-under 0.9            # exit non-zero in CI if pass rate drops
llm-judge-kit compare cases.jsonl --provider openai:gpt-5 --provider anthropic:claude-opus-4-8
llm-judge-kit report report.json --format html -o report.html

Same thing in code (examples/benchmark_report.py):

from llm_judge_kit import Judge, load_dataset, run_benchmark, render_markdown

cases = load_dataset("cases.jsonl")
judge = Judge(provider="openai:gpt-5", rubric="factuality")
report = run_benchmark(judge, cases, provider="openai:gpt-5", rubric="factuality")
print(report.pass_rate, report.mean_score)
print(render_markdown(report))

Why depend on this

  • Easy to depend on — zero transitive deps in the core; provider SDKs are opt-in extras.
  • Reproducible — deterministic offline MockProvider; all unit tests run without network.
  • Typedmypy --strict clean; ships py.typed.
  • Robust parsing — recovers JSON from markdown fences, prose, and trailing commas.
  • Extensible — new provider / rubric / judge without core changes.

Development

uv sync --all-extras
uv run ruff check . && uv run ruff format --check . && uv run mypy src && uv run pytest --cov=llm_judge_kit --cov-report=term-missing

See CONTRIBUTING.md. The plan of record is in ROADMAP.md.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_judge_kit-0.1.0.tar.gz (30.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_judge_kit-0.1.0-py3-none-any.whl (40.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_judge_kit-0.1.0.tar.gz.

File metadata

  • Download URL: llm_judge_kit-0.1.0.tar.gz
  • Upload date:
  • Size: 30.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llm_judge_kit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5f37baa15517b4ec6859239fe225e6577cc2dddec68f14c1e2306de3c94c8544
MD5 0b2aa283e50fd4163a91e233fe783c78
BLAKE2b-256 0fed1938feca9d178b43a4d2588c4a21d11c521225d9dbc8bfbc17610b8450c4

See more details on using hashes here.

File details

Details for the file llm_judge_kit-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_judge_kit-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.21 {"installer":{"name":"uv","version":"0.11.21","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for llm_judge_kit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ab0c940507a64379f7ac3129bde7086c42d49872d2c5356224d8520c4bac70e
MD5 92baceed60170ed73a262b0ff83403ad
BLAKE2b-256 2f0def141cb5f948cc322086016859e119be67cd4d7e2e43c357f9838ec2e293

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page