Skip to main content

Deterministic eval gates and reliability primitives for LLM pipelines

Project description

llm-evalkit

Deterministic eval gates and reliability primitives for LLM pipelines.

CI PyPI Python License: MIT


Most LLM eval tooling is either LLM-as-judge (non-deterministic, expensive, not CI-friendly) or a heavy enterprise suite. llm-evalkit is neither.

It gives you two things:

  • Eval gates: code-only quality dimensions that run the same way every time. Drop them into any pipeline, run them in CI, get a pass/fail with a reason.
  • Reliability primitives: retry with backoff, model fallback chains, and a circuit breaker. The building blocks for LLM pipelines that hold up in production.

Install

pip install llm-evalgate

Quickstart

Eval gates

from llm_evalkit import EvalHarness
from llm_evalkit.eval.dimensions import BlocklistDimension, ReadabilityDimension, SchemaComplianceDimension

harness = EvalHarness([
    BlocklistDimension(terms=["confidential", "internal use only"]),
    ReadabilityDimension(threshold=0.3),
    SchemaComplianceDimension(required_fields=["title:", "summary:"]),
])

report = harness.run(llm_output)

if not report.passed:
    print(report)
    # EvalReport: FAIL
    #   FAIL [blocklist] score=0.000 — prohibited terms found: ['confidential']
    #   PASS [readability] score=0.612 — Flesch ease=61.2, FK grade=8.4
    #   PASS [schema_compliance] score=1.000 — all 2 required fields present

Custom dimension

from llm_evalkit import Dimension

class JsonDimension(Dimension):
    def evaluate(self, text: str) -> tuple[float, str]:
        import json
        try:
            json.loads(text)
            return 1.0, "valid JSON"
        except json.JSONDecodeError as e:
            return 0.0, f"invalid JSON: {e}"

harness = EvalHarness([JsonDimension(threshold=1.0)])
report = harness.run('{"key": "value"}')
assert report.passed

Retry

from llm_evalkit.reliable import retry

@retry(max_attempts=3, backoff=2.0)
def call_llm(prompt: str) -> str:
    return client.messages.create(...)

Fallback chain

from llm_evalkit.reliable import with_fallback, with_fallback_chain

# two-model fallback
result = with_fallback(
    primary=lambda: call_model("claude-opus-4-8", prompt),
    fallback=lambda: call_model("claude-sonnet-4-6", prompt),
)

# ordered chain — first success wins
result = with_fallback_chain([
    lambda: call_model("claude-opus-4-8", prompt),
    lambda: call_model("claude-sonnet-4-6", prompt),
    lambda: call_model("claude-haiku-4-5", prompt),
])

Circuit breaker

from llm_evalkit.reliable import CircuitBreaker, CircuitOpenError

breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)

try:
    with breaker:
        result = call_llm(prompt)
except CircuitOpenError:
    result = cached_response  # serve from cache while circuit is open

Built-in dimensions

Dimension What it checks Default threshold
BlocklistDimension No prohibited terms in output 1.0 (zero tolerance)
ReadabilityDimension Flesch Reading Ease score 0.3 (college-level prose)
SchemaComplianceDimension Required fields are present 1.0 (all fields)
FactualGroundingDimension Numeric claims traceable to evidence 0.85

All dimensions follow the same interface: evaluate(text) -> (score, detail). Writing a new one is ten lines.

Why deterministic?

LLM-as-judge eval is useful for research. In production pipelines, you need:

  • The same input to produce the same pass/fail result every run
  • CI to catch regressions without burning tokens on every commit
  • An audit trail that doesn't depend on a model that may drift

llm-evalkit eval dimensions are pure functions. No model calls, no network, no randomness.

Composing with a pipeline

from llm_evalkit import EvalHarness
from llm_evalkit.eval.dimensions import BlocklistDimension, ReadabilityDimension
from llm_evalkit.reliable import retry, with_fallback

harness = EvalHarness([
    BlocklistDimension(terms=["[REDACTED]", "TODO"]),
    ReadabilityDimension(threshold=0.2),
])

@retry(max_attempts=3, backoff=2.0)
def generate(prompt: str) -> str:
    return with_fallback(
        primary=lambda: call_model("claude-opus-4-8", prompt),
        fallback=lambda: call_model("claude-sonnet-4-6", prompt),
    )

output = generate(prompt)
report = harness.run(output)
if not report.passed:
    raise ValueError(f"Output failed eval gate:\n{report}")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_evalgate-0.1.0.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_evalgate-0.1.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file llm_evalgate-0.1.0.tar.gz.

File metadata

  • Download URL: llm_evalgate-0.1.0.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_evalgate-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1815fe1dd37c19f0821e21a3bff1983aa37509d7058c0313d72306853624fb80
MD5 9e1cde877b1771dabee0baa9d32549c4
BLAKE2b-256 7210efec2a5ff09bb6c94705df73e7c6a22ed04e367c21eb91b02c0cdea271c8

See more details on using hashes here.

File details

Details for the file llm_evalgate-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llm_evalgate-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_evalgate-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 89b36213b66d58fefc14512d71b6e9c3e7eab219194390010bca5e344bebd8d7
MD5 892fc27891f820785ab4ce418e720b40
BLAKE2b-256 c87ce76aad778b23c101bba53ef55a8cc3dffd15228a6da4c9cf85e976b5a1d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page