Skip to main content

Behavioral validation for LLM outputs in production workflows.

Project description

Gateframe

Behavioral validation for LLM outputs in production workflows.

Schema validation, "does this JSON have the right keys?", is a solved problem. Instructor, Pydantic AI, and similar tools handle it well. gateframe solves a different problem: does this output behave correctly given the context it was generated in? Does it stay within the decision boundaries this workflow requires? When it fails, does it fail in a way your system can recover from, or does it fail silently?

from pydantic import BaseModel
from gateframe import (
    ValidationContract,
    StructuralRule,
    BoundaryRule,
    ConfidenceRule,
    AllowedValues,
    FailureMode,
)

class TriageDecision(BaseModel):
    action: str
    priority: str
    confidence: float
    rationale: str

contract = ValidationContract(
    name="triage_decision",
    rules=[
        StructuralRule(schema=TriageDecision),
        BoundaryRule(
            check=AllowedValues("action", {"treat", "observe", "refer", "discharge"}),
            name="action_boundary",
            failure_message="Action must be one of: treat, observe, refer, discharge.",
        ),
        ConfidenceRule(field="confidence", minimum=0.7),
    ],
)

result = contract.validate({
    "action": "prescribe",       # not in allowed set -> HARD_FAIL
    "priority": "high",
    "confidence": 0.52,          # below 0.7 -> SOFT_FAIL
    "rationale": "...",
})

print(result.passed)             # False
for failure in result.failures:
    print(f"[{failure.failure_mode.value}] {failure.rule_name}: {failure.message}")
# [hard_fail] action_boundary: Action must be one of: treat, observe, refer, discharge.
# [soft_fail] confidence_check: Confidence 0.52 is below minimum threshold 0.7.

The problem

Most LLM pipelines validate outputs the same way: parse the JSON, check the schema, move on. That catches structural errors. It misses the errors that actually cause production incidents:

  • A model recommends an action that is structurally valid but outside its authorized scope
  • Confidence is low but the workflow proceeds as if it weren't
  • A soft failure in step 2 silently degrades the reliability of everything downstream
  • A validation failure gives you False, and no context for debugging

gateframe makes these failures explicit, structured, and recoverable.


Failure modes

gateframe distinguishes four failure types instead of binary pass/fail.

HARD_FAIL, Stop. The output violates a hard constraint that cannot be auto-recovered.

# Model chose an action outside its authorized scope
BoundaryRule(
    check=AllowedValues("action", {"treat", "observe", "refer"}),
    failure_mode=FailureMode.HARD_FAIL,  # default for BoundaryRule
)

SOFT_FAIL, Flag and continue with degraded confidence. Something is off but not critical enough to halt.

# Model confidence is low, continue but track the degradation
ConfidenceRule(
    field="confidence",
    minimum=0.7,
    failure_mode=FailureMode.SOFT_FAIL,  # default for ConfidenceRule
)

RETRY, Re-prompt with the failure context. The output is likely fixable by trying again.

# Malformed output that might parse correctly on a second attempt
StructuralRule(schema=MyOutput, failure_mode=FailureMode.RETRY)

SILENT_FAIL, The most dangerous kind. The output looks valid but violates a semantic or boundary rule. gateframe makes these visible instead of letting them pass through undetected.

SemanticRule(
    check=lambda output, **ctx: output["severity"] != "low" or output["escalated"] is False,
    failure_mode=FailureMode.SILENT_FAIL,
    failure_message="Low-severity cases should not be auto-escalated.",
)

Multi-step workflow validation

Validation state carries forward across steps. A soft failure in step 2 degrades the confidence score that step 4 sees.

from gateframe import WorkflowContext, ValidationContract, EscalationRouter
from gateframe.audit.log import AuditLog

ctx = WorkflowContext(workflow_id="incident_response_001", escalation_threshold=0.5)
router = EscalationRouter()
audit = AuditLog()

# Step 1
result1 = contract_step1.validate(output1)
ctx.update(result1)
audit.record(result1, workflow_context=ctx)

# Step 2, ctx carries forward degraded confidence from step 1
result2 = contract_step2.validate(output2)
ctx.update(result2)
audit.record(result2, workflow_context=ctx)

print(ctx.confidence)           # degraded from 1.0 by soft failures
print(ctx.threshold_breached)   # True if confidence < escalation_threshold

if ctx.threshold_breached:
    escalation = router.route_threshold_breach(ctx)
    print(escalation.route.value)  # "human_review", "abort", etc.

Provider integrations

gateframe validates outputs from any provider. Integrations are thin wrappers, gateframe does not import any LLM SDK at the core level.

# OpenAI
from gateframe.integrations.openai import OpenAIValidator
validator = OpenAIValidator(contract, parse_json=True)
result = validator.validate(openai_completion)

# Anthropic
from gateframe.integrations.anthropic import AnthropicValidator
validator = AnthropicValidator(contract, parse_json=True)
result = validator.validate(anthropic_message)

# LiteLLM
from gateframe.integrations.litellm import LiteLLMValidator
validator = LiteLLMValidator(contract, parse_json=True)
result = validator.validate(litellm_response)

# LangChain
from gateframe.integrations.langchain import LangChainValidator
validator = LangChainValidator(contract, parse_json=False)
result = validator.validate(chain_output)

Install the integration you need:

pip install "gateframe[openai]"
pip install "gateframe[anthropic]"
pip install "gateframe[litellm]"
pip install "gateframe[langchain]"

Audit trail

Every validation event is logged with structured context. Use the built-in exporters or implement your own.

from gateframe.audit.log import AuditLog
from gateframe.audit.exporters import JsonFileExporter

audit = AuditLog(exporters=[JsonFileExporter("audit.jsonl")])
audit.record(result, workflow_context=ctx)
audit.flush()

Each entry includes: timestamp, contract name, rules applied, rules failed, failure details, workflow ID, and accumulated confidence score.


When to use gateframe

Use it when:

  • You need to validate LLM output behavior beyond schema checks, decision boundaries, scope enforcement, semantic constraints
  • You need structured, recoverable failure records rather than bare exceptions
  • You're running multi-step workflows where soft failures in early steps should affect confidence downstream
  • You need an audit trail for post-incident debugging

Don't use it when:

  • You only need schema extraction from LLM outputs, use Instructor or Pydantic AI
  • You need offline model evaluation or benchmarking, use DeepEval or RAGAS
  • You need content safety filtering, use a dedicated guardrails tool

Installation

pip install gateframe

For development:

git clone https://github.com/practicalmind-ai/gateframe.git
cd gateframe
pip install -e ".[dev]"
python -m pytest tests/ -v

Examples

triage_workflow, 3-step medical triage pipeline. Demonstrates StructuralRule, BoundaryRule, ConfidenceRule, and WorkflowContext together. Step 2 has confidence below threshold, shows how SOFT_FAIL degrades the workflow score without halting it.

rag_output, RAG answer validation with two scenarios. Scenario B demonstrates simultaneous soft failures (low confidence + ungrounded answer) and how they accumulate in the workflow context.

agent_pipeline, 4-step agent workflow with escalation. Demonstrates how multiple soft failures across steps push cumulative confidence below the escalation threshold.


CLI

# Inspect a contract file, lists all contracts and their rules
gateframe inspect contracts.py

# Replay an audit log
gateframe replay audit.jsonl

License

MIT, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gateframe-0.2.0.tar.gz (19.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gateframe-0.2.0-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file gateframe-0.2.0.tar.gz.

File metadata

  • Download URL: gateframe-0.2.0.tar.gz
  • Upload date:
  • Size: 19.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for gateframe-0.2.0.tar.gz
Algorithm Hash digest
SHA256 09b65506c7114c9517ec3e32704aaddda7cb844114c86069717f740d21e8016f
MD5 6639a8a91665b58f6a84d506cd196a3a
BLAKE2b-256 14992074fb028aada94e556464f7c421f1c92afc1af0e6d75d4627142c9caba6

See more details on using hashes here.

File details

Details for the file gateframe-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: gateframe-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for gateframe-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 07aa5a5471b9809a8cf42469be64fe60f02174ae424d42c26a2a55c29cc177e5
MD5 00f18b126b16656d268af0ecdf546529
BLAKE2b-256 8467d34e905d250b6c80756e5fa0f3b40469847571c4a0f62cdc64399f04a0b5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page