Skip to main content

Deterministic eval gates and reliability primitives for LLM pipelines

Project description

llm-evalgate

Deterministic eval gates and reliability primitives for LLM pipelines.

CI PyPI Python License: MIT


Most LLM eval tooling is either LLM-as-judge (non-deterministic, expensive, not CI-friendly) or a heavy enterprise suite. llm-evalgate is neither.

It gives you two things:

  • Eval gates: code-only quality dimensions that run the same way every time. Drop them into any pipeline, run them in CI, get a pass/fail with a reason.
  • Reliability primitives: retry with backoff, model fallback chains, and a circuit breaker. The building blocks for LLM pipelines that hold up in production.

Install

pip install llm-evalgate

Quickstart

Eval gates

from llm_evalgate import EvalHarness
from llm_evalgate.eval.dimensions import BlocklistDimension, ReadabilityDimension, SchemaComplianceDimension

harness = EvalHarness([
    BlocklistDimension(terms=["confidential", "internal use only"]),
    ReadabilityDimension(threshold=0.3),
    SchemaComplianceDimension(required_fields=["title:", "summary:"]),
])

report = harness.run(llm_output)

if not report.passed:
    print(report)
    # EvalReport: FAIL
    #   FAIL [blocklist] score=0.000 — prohibited terms found: ['confidential']
    #   PASS [readability] score=0.612 — Flesch ease=61.2, FK grade=8.4
    #   PASS [schema_compliance] score=1.000 — all 2 required fields present

Custom dimension

from llm_evalgate import Dimension

class JsonDimension(Dimension):
    def evaluate(self, text: str) -> tuple[float, str]:
        import json
        try:
            json.loads(text)
            return 1.0, "valid JSON"
        except json.JSONDecodeError as e:
            return 0.0, f"invalid JSON: {e}"

harness = EvalHarness([JsonDimension(threshold=1.0)])
report = harness.run('{"key": "value"}')
assert report.passed

Retry

from llm_evalgate.reliable import retry

@retry(max_attempts=3, backoff=2.0)
def call_llm(prompt: str) -> str:
    return client.messages.create(...)

Fallback chain

from llm_evalgate.reliable import with_fallback, with_fallback_chain

# two-model fallback
result = with_fallback(
    primary=lambda: call_model("claude-opus-4-8", prompt),
    fallback=lambda: call_model("claude-sonnet-4-6", prompt),
)

# ordered chain — first success wins
result = with_fallback_chain([
    lambda: call_model("claude-opus-4-8", prompt),
    lambda: call_model("claude-sonnet-4-6", prompt),
    lambda: call_model("claude-haiku-4-5", prompt),
])

Circuit breaker

from llm_evalgate.reliable import CircuitBreaker, CircuitOpenError

breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)

try:
    with breaker:
        result = call_llm(prompt)
except CircuitOpenError:
    result = cached_response  # serve from cache while circuit is open

Built-in dimensions

Dimension What it checks Default threshold
BlocklistDimension No prohibited terms in output 1.0 (zero tolerance)
ReadabilityDimension Flesch Reading Ease score 0.3 (college-level prose)
SchemaComplianceDimension Required fields are present 1.0 (all fields)
FactualGroundingDimension Numeric claims traceable to evidence 0.85

All dimensions follow the same interface: evaluate(text) -> (score, detail). Writing a new one is ten lines.

Why deterministic?

LLM-as-judge eval is useful for research. In production pipelines, you need:

  • The same input to produce the same pass/fail result every run
  • CI to catch regressions without burning tokens on every commit
  • An audit trail that doesn't depend on a model that may drift

llm-evalgate eval dimensions are pure functions. No model calls, no network, no randomness.

Composing with a pipeline

from llm_evalgate import EvalHarness
from llm_evalgate.eval.dimensions import BlocklistDimension, ReadabilityDimension
from llm_evalgate.reliable import retry, with_fallback

harness = EvalHarness([
    BlocklistDimension(terms=["[REDACTED]", "TODO"]),
    ReadabilityDimension(threshold=0.2),
])

@retry(max_attempts=3, backoff=2.0)
def generate(prompt: str) -> str:
    return with_fallback(
        primary=lambda: call_model("claude-opus-4-8", prompt),
        fallback=lambda: call_model("claude-sonnet-4-6", prompt),
    )

output = generate(prompt)
report = harness.run(output)
if not report.passed:
    raise ValueError(f"Output failed eval gate:\n{report}")

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_evalgate-0.1.1.tar.gz (9.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llm_evalgate-0.1.1-py3-none-any.whl (11.3 kB view details)

Uploaded Python 3

File details

Details for the file llm_evalgate-0.1.1.tar.gz.

File metadata

  • Download URL: llm_evalgate-0.1.1.tar.gz
  • Upload date:
  • Size: 9.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_evalgate-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5622da5fbac24bd171b35b889fce76139f0f79aff49dd9bfa56ff5ff39763f6c
MD5 757ebb247deb4ca7fae5f4e652c5742c
BLAKE2b-256 4752d626391234231b7b45d6bac802d975120a877482042f34a3700cf3d02537

See more details on using hashes here.

File details

Details for the file llm_evalgate-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: llm_evalgate-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 11.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_evalgate-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e748fb77e36cce92eb7d57ca42965620aeff619bb9251eb4ef03eb2d9094e660
MD5 ebe1b7bce9d395911c06bbebe18195f3
BLAKE2b-256 886d500168423a907c55ec789520e9954954d128cf860e8981b207268ba5f4fc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page