Behavioral validation for LLM outputs in production workflows.
Project description
Gateframe
Behavioral validation for LLM outputs in production workflows.
Schema validation, "does this JSON have the right keys?", is a solved problem. Instructor, Pydantic AI, and similar tools handle it well. gateframe solves a different problem: does this output behave correctly given the context it was generated in? Does it stay within the decision boundaries this workflow requires? When it fails, does it fail in a way your system can recover from, or does it fail silently?
from pydantic import BaseModel
from gateframe import (
ValidationContract,
StructuralRule,
BoundaryRule,
ConfidenceRule,
AllowedValues,
FailureMode,
)
class TriageDecision(BaseModel):
action: str
priority: str
confidence: float
rationale: str
contract = ValidationContract(
name="triage_decision",
rules=[
StructuralRule(schema=TriageDecision),
BoundaryRule(
check=AllowedValues("action", {"treat", "observe", "refer", "discharge"}),
name="action_boundary",
failure_message="Action must be one of: treat, observe, refer, discharge.",
),
ConfidenceRule(field="confidence", minimum=0.7),
],
)
result = contract.validate({
"action": "prescribe", # not in allowed set -> HARD_FAIL
"priority": "high",
"confidence": 0.52, # below 0.7 -> SOFT_FAIL
"rationale": "...",
})
print(result.passed) # False
for failure in result.failures:
print(f"[{failure.failure_mode.value}] {failure.rule_name}: {failure.message}")
# [hard_fail] action_boundary: Action must be one of: treat, observe, refer, discharge.
# [soft_fail] confidence_check: Confidence 0.52 is below minimum threshold 0.7.
The problem
Most LLM pipelines validate outputs the same way: parse the JSON, check the schema, move on. That catches structural errors. It misses the errors that actually cause production incidents:
- A model recommends an action that is structurally valid but outside its authorized scope
- Confidence is low but the workflow proceeds as if it weren't
- A soft failure in step 2 silently degrades the reliability of everything downstream
- A validation failure gives you
False, and no context for debugging
gateframe makes these failures explicit, structured, and recoverable.
Failure modes
gateframe distinguishes four failure types instead of binary pass/fail.
HARD_FAIL, Stop. The output violates a hard constraint that cannot be auto-recovered.
# Model chose an action outside its authorized scope
BoundaryRule(
check=AllowedValues("action", {"treat", "observe", "refer"}),
failure_mode=FailureMode.HARD_FAIL, # default for BoundaryRule
)
SOFT_FAIL, Flag and continue with degraded confidence. Something is off but not critical enough to halt.
# Model confidence is low, continue but track the degradation
ConfidenceRule(
field="confidence",
minimum=0.7,
failure_mode=FailureMode.SOFT_FAIL, # default for ConfidenceRule
)
RETRY, Re-prompt with the failure context. The output is likely fixable by trying again.
# Malformed output that might parse correctly on a second attempt
StructuralRule(schema=MyOutput, failure_mode=FailureMode.RETRY)
SILENT_FAIL, The most dangerous kind. The output looks valid but violates a semantic or boundary rule. gateframe makes these visible instead of letting them pass through undetected.
SemanticRule(
check=lambda output, **ctx: output["severity"] != "low" or output["escalated"] is False,
failure_mode=FailureMode.SILENT_FAIL,
failure_message="Low-severity cases should not be auto-escalated.",
)
Multi-step workflow validation
Validation state carries forward across steps. A soft failure in step 2 degrades the confidence score that step 4 sees.
from gateframe import WorkflowContext, ValidationContract, EscalationRouter
from gateframe.audit.log import AuditLog
ctx = WorkflowContext(workflow_id="incident_response_001", escalation_threshold=0.5)
router = EscalationRouter()
audit = AuditLog()
# Step 1
result1 = contract_step1.validate(output1)
ctx.update(result1)
audit.record(result1, workflow_context=ctx)
# Step 2, ctx carries forward degraded confidence from step 1
result2 = contract_step2.validate(output2)
ctx.update(result2)
audit.record(result2, workflow_context=ctx)
print(ctx.confidence) # degraded from 1.0 by soft failures
print(ctx.threshold_breached) # True if confidence < escalation_threshold
if ctx.threshold_breached:
escalation = router.route_threshold_breach(ctx)
print(escalation.route.value) # "human_review", "abort", etc.
Provider integrations
gateframe validates outputs from any provider. Integrations are thin wrappers, gateframe does not import any LLM SDK at the core level.
# OpenAI
from gateframe.integrations.openai import OpenAIValidator
validator = OpenAIValidator(contract, parse_json=True)
result = validator.validate(openai_completion)
# Anthropic
from gateframe.integrations.anthropic import AnthropicValidator
validator = AnthropicValidator(contract, parse_json=True)
result = validator.validate(anthropic_message)
# LiteLLM
from gateframe.integrations.litellm import LiteLLMValidator
validator = LiteLLMValidator(contract, parse_json=True)
result = validator.validate(litellm_response)
# LangChain
from gateframe.integrations.langchain import LangChainValidator
validator = LangChainValidator(contract, parse_json=False)
result = validator.validate(chain_output)
Install the integration you need:
pip install "gateframe[openai]"
pip install "gateframe[anthropic]"
pip install "gateframe[litellm]"
pip install "gateframe[langchain]"
Audit trail
Every validation event is logged with structured context. Use the built-in exporters or implement your own.
from gateframe.audit.log import AuditLog
from gateframe.audit.exporters import JsonFileExporter
audit = AuditLog(exporters=[JsonFileExporter("audit.jsonl")])
audit.record(result, workflow_context=ctx)
audit.flush()
Each entry includes: timestamp, contract name, rules applied, rules failed, failure details, workflow ID, and accumulated confidence score.
When to use gateframe
Use it when:
- You need to validate LLM output behavior beyond schema checks, decision boundaries, scope enforcement, semantic constraints
- You need structured, recoverable failure records rather than bare exceptions
- You're running multi-step workflows where soft failures in early steps should affect confidence downstream
- You need an audit trail for post-incident debugging
Don't use it when:
- You only need schema extraction from LLM outputs, use Instructor or Pydantic AI
- You need offline model evaluation or benchmarking, use DeepEval or RAGAS
- You need content safety filtering, use a dedicated guardrails tool
Installation
pip install gateframe
For development:
git clone https://github.com/practicalmind-ai/gateframe.git
cd gateframe
pip install -e ".[dev]"
python -m pytest tests/ -v
Examples
triage_workflow, 3-step medical triage pipeline. Demonstrates StructuralRule, BoundaryRule, ConfidenceRule, and WorkflowContext together. Step 2 has confidence below threshold, shows how SOFT_FAIL degrades the workflow score without halting it.
rag_output, RAG answer validation with two scenarios. Scenario B demonstrates simultaneous soft failures (low confidence + ungrounded answer) and how they accumulate in the workflow context.
agent_pipeline, 4-step agent workflow with escalation. Demonstrates how multiple soft failures across steps push cumulative confidence below the escalation threshold.
CLI
# Inspect a contract file, lists all contracts and their rules
gateframe inspect contracts.py
# Replay an audit log
gateframe replay audit.jsonl
License
MIT, see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gateframe-0.2.0.tar.gz.
File metadata
- Download URL: gateframe-0.2.0.tar.gz
- Upload date:
- Size: 19.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09b65506c7114c9517ec3e32704aaddda7cb844114c86069717f740d21e8016f
|
|
| MD5 |
6639a8a91665b58f6a84d506cd196a3a
|
|
| BLAKE2b-256 |
14992074fb028aada94e556464f7c421f1c92afc1af0e6d75d4627142c9caba6
|
File details
Details for the file gateframe-0.2.0-py3-none-any.whl.
File metadata
- Download URL: gateframe-0.2.0-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07aa5a5471b9809a8cf42469be64fe60f02174ae424d42c26a2a55c29cc177e5
|
|
| MD5 |
00f18b126b16656d268af0ecdf546529
|
|
| BLAKE2b-256 |
8467d34e905d250b6c80756e5fa0f3b40469847571c4a0f62cdc64399f04a0b5
|