Deterministic eval gates and reliability primitives for LLM pipelines
Project description
llm-evalgate
Deterministic eval gates and reliability primitives for LLM pipelines.
Most LLM eval tooling is either LLM-as-judge (non-deterministic, expensive, not CI-friendly) or a heavy enterprise suite. llm-evalgate is neither.
It gives you two things:
- Eval gates: code-only quality dimensions that run the same way every time. Drop them into any pipeline, run them in CI, get a pass/fail with a reason.
- Reliability primitives: retry with backoff, model fallback chains, and a circuit breaker. The building blocks for LLM pipelines that hold up in production.
Install
pip install llm-evalgate
Quickstart
Eval gates
from llm_evalgate import EvalHarness
from llm_evalgate.eval.dimensions import BlocklistDimension, ReadabilityDimension, SchemaComplianceDimension
harness = EvalHarness([
BlocklistDimension(terms=["confidential", "internal use only"]),
ReadabilityDimension(threshold=0.3),
SchemaComplianceDimension(required_fields=["title:", "summary:"]),
])
report = harness.run(llm_output)
if not report.passed:
print(report)
# EvalReport: FAIL
# FAIL [blocklist] score=0.000 — prohibited terms found: ['confidential']
# PASS [readability] score=0.612 — Flesch ease=61.2, FK grade=8.4
# PASS [schema_compliance] score=1.000 — all 2 required fields present
Custom dimension
from llm_evalgate import Dimension
class JsonDimension(Dimension):
def evaluate(self, text: str) -> tuple[float, str]:
import json
try:
json.loads(text)
return 1.0, "valid JSON"
except json.JSONDecodeError as e:
return 0.0, f"invalid JSON: {e}"
harness = EvalHarness([JsonDimension(threshold=1.0)])
report = harness.run('{"key": "value"}')
assert report.passed
Retry
from llm_evalgate.reliable import retry
@retry(max_attempts=3, backoff=2.0)
def call_llm(prompt: str) -> str:
return client.messages.create(...)
Fallback chain
from llm_evalgate.reliable import with_fallback, with_fallback_chain
# two-model fallback
result = with_fallback(
primary=lambda: call_model("claude-opus-4-8", prompt),
fallback=lambda: call_model("claude-sonnet-4-6", prompt),
)
# ordered chain — first success wins
result = with_fallback_chain([
lambda: call_model("claude-opus-4-8", prompt),
lambda: call_model("claude-sonnet-4-6", prompt),
lambda: call_model("claude-haiku-4-5", prompt),
])
Circuit breaker
from llm_evalgate.reliable import CircuitBreaker, CircuitOpenError
breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=60)
try:
with breaker:
result = call_llm(prompt)
except CircuitOpenError:
result = cached_response # serve from cache while circuit is open
Built-in dimensions
| Dimension | What it checks | Default threshold |
|---|---|---|
BlocklistDimension |
No prohibited terms in output | 1.0 (zero tolerance) |
ReadabilityDimension |
Flesch Reading Ease score | 0.3 (college-level prose) |
SchemaComplianceDimension |
Required fields are present | 1.0 (all fields) |
FactualGroundingDimension |
Numeric claims traceable to evidence | 0.85 |
All dimensions follow the same interface: evaluate(text) -> (score, detail). Writing a new one is ten lines.
Why deterministic?
LLM-as-judge eval is useful for research. In production pipelines, you need:
- The same input to produce the same pass/fail result every run
- CI to catch regressions without burning tokens on every commit
- An audit trail that doesn't depend on a model that may drift
llm-evalgate eval dimensions are pure functions. No model calls, no network, no randomness.
Composing with a pipeline
from llm_evalgate import EvalHarness
from llm_evalgate.eval.dimensions import BlocklistDimension, ReadabilityDimension
from llm_evalgate.reliable import retry, with_fallback
harness = EvalHarness([
BlocklistDimension(terms=["[REDACTED]", "TODO"]),
ReadabilityDimension(threshold=0.2),
])
@retry(max_attempts=3, backoff=2.0)
def generate(prompt: str) -> str:
return with_fallback(
primary=lambda: call_model("claude-opus-4-8", prompt),
fallback=lambda: call_model("claude-sonnet-4-6", prompt),
)
output = generate(prompt)
report = harness.run(output)
if not report.passed:
raise ValueError(f"Output failed eval gate:\n{report}")
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_evalgate-0.1.1.tar.gz.
File metadata
- Download URL: llm_evalgate-0.1.1.tar.gz
- Upload date:
- Size: 9.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5622da5fbac24bd171b35b889fce76139f0f79aff49dd9bfa56ff5ff39763f6c
|
|
| MD5 |
757ebb247deb4ca7fae5f4e652c5742c
|
|
| BLAKE2b-256 |
4752d626391234231b7b45d6bac802d975120a877482042f34a3700cf3d02537
|
File details
Details for the file llm_evalgate-0.1.1-py3-none-any.whl.
File metadata
- Download URL: llm_evalgate-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e748fb77e36cce92eb7d57ca42965620aeff619bb9251eb4ef03eb2d9094e660
|
|
| MD5 |
ebe1b7bce9d395911c06bbebe18195f3
|
|
| BLAKE2b-256 |
886d500168423a907c55ec789520e9954954d128cf860e8981b207268ba5f4fc
|