Deterministic eval gates and reliability primitives for LLM pipelines

These details have not been verified by PyPI

Project links

Project description

llm-evalkit

Deterministic eval gates and reliability primitives for LLM pipelines.

Most LLM eval tooling is either LLM-as-judge (non-deterministic, expensive, not CI-friendly) or a heavy enterprise suite. llm-evalkit is neither.

It gives you two things:

Eval gates: code-only quality dimensions that run the same way every time. Drop them into any pipeline, run them in CI, get a pass/fail with a reason.
Reliability primitives: retry with backoff, model fallback chains, and a circuit breaker. The building blocks for LLM pipelines that hold up in production.

Install

pip install llm-evalgate

Quickstart

Eval gates

from llm_evalkit import EvalHarness
from llm_evalkit.eval.dimensions import BlocklistDimension, ReadabilityDimension, SchemaComplianceDimension

harness = EvalHarness([
    BlocklistDimension(terms=["confidential", "internal use only"]),
    ReadabilityDimension(threshold=0.3),
    SchemaComplianceDimension(required_fields=["title:", "summary:"]),
])

report = harness.run(llm_output)

if not report.passed:
    print(report)
    # EvalReport: FAIL
    #   FAIL [blocklist] score=0.000 — prohibited terms found: ['confidential']
    #   PASS [readability] score=0.612 — Flesch ease=61.2, FK grade=8.4
    #   PASS [schema_compliance] score=1.000 — all 2 required fields present

Custom dimension

from llm_evalkit import Dimension

class JsonDimension(Dimension):
    def evaluate(self, text: str) -> tuple[float, str]:
        import json
        try:
            json.loads(text)
            return 1.0, "valid JSON"
        except json.JSONDecodeError as e:
            return 0.0, f"invalid JSON: {e}"

harness = EvalHarness([JsonDimension(threshold=1.0)])
report = harness.run('{"key": "value"}')
assert report.passed

Retry

from llm_evalkit.reliable import retry

@retry(max_attempts=3, backoff=2.0)
def call_llm(prompt: str) -> str:
    return client.messages.create(...)

Fallback chain

from llm_evalkit.reliable import with_fallback, with_fallback_chain

# two-model fallback
result = with_fallback(
    primary=lambda: call_model("claude-opus-4-8", prompt),
    fallback=lambda: call_model("claude-sonnet-4-6", prompt),
)

# ordered chain — first success wins
result = with_fallback_chain([
    lambda: call_model("claude-opus-4-8", prompt),
    lambda: call_model("claude-sonnet-4-6", prompt),
    lambda: call_model("claude-haiku-4-5", prompt),
])

Circuit breaker

from llm_evalkit.reliable import CircuitBreaker, CircuitOpenError

breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)

try:
    with breaker:
        result = call_llm(prompt)
except CircuitOpenError:
    result = cached_response  # serve from cache while circuit is open

Built-in dimensions

Dimension	What it checks	Default threshold
`BlocklistDimension`	No prohibited terms in output	1.0 (zero tolerance)
`ReadabilityDimension`	Flesch Reading Ease score	0.3 (college-level prose)
`SchemaComplianceDimension`	Required fields are present	1.0 (all fields)
`FactualGroundingDimension`	Numeric claims traceable to evidence	0.85

All dimensions follow the same interface: evaluate(text) -> (score, detail). Writing a new one is ten lines.

Why deterministic?

LLM-as-judge eval is useful for research. In production pipelines, you need:

The same input to produce the same pass/fail result every run
CI to catch regressions without burning tokens on every commit
An audit trail that doesn't depend on a model that may drift

llm-evalkit eval dimensions are pure functions. No model calls, no network, no randomness.

Composing with a pipeline

from llm_evalkit import EvalHarness
from llm_evalkit.eval.dimensions import BlocklistDimension, ReadabilityDimension
from llm_evalkit.reliable import retry, with_fallback

harness = EvalHarness([
    BlocklistDimension(terms=["[REDACTED]", "TODO"]),
    ReadabilityDimension(threshold=0.2),
])

@retry(max_attempts=3, backoff=2.0)
def generate(prompt: str) -> str:
    return with_fallback(
        primary=lambda: call_model("claude-opus-4-8", prompt),
        fallback=lambda: call_model("claude-sonnet-4-6", prompt),
    )

output = generate(prompt)
report = harness.run(output)
if not report.passed:
    raise ValueError(f"Output failed eval gate:\n{report}")

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.0

Jun 4, 2026

0.2.0

Jun 2, 2026

0.1.1

Jun 1, 2026

This version

0.1.0

Jun 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm_evalgate-0.1.0.tar.gz (9.1 kB view details)

Uploaded Jun 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llm_evalgate-0.1.0-py3-none-any.whl (11.2 kB view details)

Uploaded Jun 1, 2026 Python 3

File details

Details for the file llm_evalgate-0.1.0.tar.gz.

File metadata

Download URL: llm_evalgate-0.1.0.tar.gz
Upload date: Jun 1, 2026
Size: 9.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_evalgate-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1815fe1dd37c19f0821e21a3bff1983aa37509d7058c0313d72306853624fb80`
MD5	`9e1cde877b1771dabee0baa9d32549c4`
BLAKE2b-256	`7210efec2a5ff09bb6c94705df73e7c6a22ed04e367c21eb91b02c0cdea271c8`

See more details on using hashes here.

File details

Details for the file llm_evalgate-0.1.0-py3-none-any.whl.

File metadata

Download URL: llm_evalgate-0.1.0-py3-none-any.whl
Upload date: Jun 1, 2026
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llm_evalgate-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`89b36213b66d58fefc14512d71b6e9c3e7eab219194390010bca5e344bebd8d7`
MD5	`892fc27891f820785ab4ce418e720b40`
BLAKE2b-256	`c87ce76aad778b23c101bba53ef55a8cc3dffd15228a6da4c9cf85e976b5a1d1`

See more details on using hashes here.

llm-evalgate 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llm-evalkit

Install

Quickstart

Eval gates

Custom dimension

Retry

Fallback chain

Circuit breaker

Built-in dimensions

Why deterministic?

Composing with a pipeline

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes