Skip to main content

Semantic specification drift detector for LLM outputs — catches semantic violations Pydantic cannot see.

Project description

spec-drift

Your LLM outputs pass Pydantic. That's not enough.

spec-drift detects when your LLM outputs drift semantically from their declared specification — even when structural validation passes. It monitors continuous semantic compliance in production, generates drift reports, and provides a CI gate for semantic regression testing.


The Problem

Every team shipping LLM features uses Pydantic or JSON Schema to validate output structure. These tools are excellent — and completely blind to semantic drift.

Consider what happens in production LLM systems over time:

Silent model updates: Your LLM provider silently updates the underlying model. The API contract (field names, types, schema) doesn't change. But the distribution of values inside those fields shifts. Your "sentiment" field starts returning "ambivalent" where it previously returned "neutral." Your "risk_level" classification shifts its decision boundary. Pydantic sees nothing.

Prompt erosion: A prompt is modified through six iterations of "just a small tweak." Each tweak passes regression tests individually. But cumulatively, the semantic profile of outputs drifts. The "reasoning" field that used to average 120 words now averages 30. Your validation still passes.

Input distribution shift: A new user cohort or marketing campaign brings different input patterns. The same model, same prompt, same schema — but outputs drift from the spec because they were calibrated for a different input distribution.

spec-drift catches all of these. Pydantic catches none of them.


The Biblical Foundation

"Moses then said to Aaron, 'This is what the Lord spoke of when he said: Among those who approach me I will be proved holy.'" — Leviticus 10:3

Nadab and Abihu offered "unauthorized fire" — structurally correct (fire), instrumentally correct (censers), personally authorized (priests). But the semantic specification was violated. Every structural check passed. The semantic compliance check failed.

spec-drift applies this principle to LLM outputs: structural validation is necessary but not sufficient. Semantic specification must be declared and continuously monitored.

BibleWorld build — PAT-037, Pivot_Score 8.63


Installation

pip install spec-drift

Requirements: Python 3.10+, Pydantic v2


Quick Start

1. Declare a semantic spec

from pydantic import BaseModel
from spec_drift import spec, SemanticConstraint

@spec(
    category=SemanticConstraint.from_authorized_values(
        ["positive", "negative", "neutral"],
        tolerance=0.02,       # max 2% outputs outside authorized set
        alert_threshold=0.10  # alert if >10% observations violate
    ),
    reasoning=SemanticConstraint.from_length_bounds(
        min_words=30,
        max_words=300,
        alert_threshold=0.15
    ),
    score=SemanticConstraint.from_distribution(
        mean=6.5,
        std=2.0,
        drift_threshold=1.0,  # alert if mean shifts >1 sigma
        alert_threshold=0.20
    )
)
class SentimentAnalysis(BaseModel):
    category: str
    reasoning: str
    score: float

2. Wrap your LLM function

from spec_drift import DriftMonitor
import anthropic

client = anthropic.Anthropic()

monitor = DriftMonitor(
    spec=SentimentAnalysis,
    db_path="./spec_drift.db",
    model_version="claude-3-5-haiku-20241022",
)

@monitor.watch
def analyze_sentiment(text: str) -> SentimentAnalysis:
    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": f"Analyze: {text}"}]
    )
    return SentimentAnalysis.model_validate_json(response.content[0].text)

3. Check drift

# Terminal drift report (last 7 days)
spec-drift check --spec my_module.SentimentAnalysis --since 7d

# CI gate: fail build if >20% semantic violations
spec-drift ci \
  --spec my_module.SentimentAnalysis \
  --test-batch data/ci_batch.jsonl \
  --threshold 0.20 \
  --exit-code

API Reference

@spec(**constraints)

Attaches semantic constraints to a Pydantic model class.

@spec(field_name=SemanticConstraint.from_authorized_values([...]))
class MyModel(BaseModel):
    field_name: str

SemanticConstraint

SemanticConstraint.from_authorized_values(authorized, tolerance, alert_threshold)

Field values must be drawn from the authorized list (within tolerance).

  • authorized: list of permitted values
  • tolerance: float, max fraction of outputs outside authorized set before constraint flags
  • alert_threshold: float, fraction of rolling observations before alert fires

SemanticConstraint.from_length_bounds(min_words, max_words, alert_threshold)

String field word count must be within [min_words, max_words].

SemanticConstraint.from_distribution(mean, std, drift_threshold, alert_threshold)

Numeric field should follow a distribution near (mean, std). Alerts if observed mean shifts by more than drift_threshold standard deviations.

SemanticConstraint.from_pattern(regex, min_match_rate, alert_threshold)

String field should match the regex pattern at min_match_rate frequency.

DriftMonitor(spec, db_path, model_version, prompt_hash, alert_callback)

Runtime monitor for semantic specification compliance.

.watch (decorator)

Wraps an LLM function to automatically observe its return value.

.observe(output) -> output

Manually observe a Pydantic model instance. Returns the output unchanged.

.drift_report(since_hours) -> dict

Generate a semantic drift report for the last N hours.

{
    "spec": "SentimentAnalysis",
    "period_hours": 168.0,
    "observations": 4523,
    "violation_rate": 0.0312,
    "severity": "low",
    "field_violation_rates": {
        "category": 0.0089,
        "reasoning": 0.0221,
        "score": 0.0002
    }
}

run_ci_gate(monitor, test_outputs, threshold) -> (passed, report)

Run a CI gate on a batch of test outputs. Returns (passed, report).


CLI Reference

# Initialize spec-drift in a project
spec-drift init

# Calibrate a baseline from golden data
spec-drift calibrate \
  --spec my_module.SentimentAnalysis \
  --input-file data/golden_set.jsonl \
  --output baseline.db

# Drift report (table format, last 7 days)
spec-drift check \
  --spec my_module.SentimentAnalysis \
  --since 7d \
  --format table

# CI gate
spec-drift ci \
  --spec my_module.SentimentAnalysis \
  --test-batch data/ci_batch.jsonl \
  --threshold 0.20 \
  --exit-code

# Compare two model versions
spec-drift compare \
  --spec my_module.SentimentAnalysis \
  --baseline-a baseline_gpt4o.db \
  --baseline-b baseline_claude_haiku.db

# HTML report
spec-drift report \
  --spec my_module.SentimentAnalysis \
  --since 30d \
  --output report.html

GitHub Action

# .github/workflows/llm-spec-check.yml
name: LLM Semantic Spec Check

on: [push, pull_request]

jobs:
  spec-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: spec-drift/action@v1
        with:
          spec: my_module.SentimentAnalysis
          test-batch: data/ci_batch.jsonl
          threshold: '0.20'

Severity Levels

Violation Rate Severity Recommended Action
0% NONE No action needed
< 5% LOW Monitor, no immediate action
5-15% MEDIUM Investigate, create issue
15-30% HIGH Rollback or prompt fix recommended
> 30% CRITICAL Immediate rollback

Storage

spec-drift uses SQLite by default — zero infrastructure required.

# Local development (default)
monitor = DriftMonitor(spec=MyModel, db_path="./spec_drift.db")

# PostgreSQL for production (coming in v0.2)
monitor = DriftMonitor(
    spec=MyModel,
    db_url="postgresql://user:pass@host/db"
)

# In-memory for testing
monitor = DriftMonitor(spec=MyModel, db_path=":memory:")

Roadmap

v0.1 (this release)

  • Core @spec decorator + SemanticConstraint DSL
  • DriftMonitor with .watch and .observe
  • SQLite observation store
  • run_ci_gate function
  • CLI: check, ci, compare

v0.2

  • PostgreSQL support
  • Multi-field correlation monitoring
  • Automatic model version detection (via LLM API response headers)
  • Slack/PagerDuty alert integrations
  • HTML drift reports

v0.3

  • LLM-judge semantic constraint evaluation (for complex, prose-level constraints)
  • Baseline versioning with SemVer
  • Team dashboard (hosted cloud option)
  • Prometheus/Grafana metrics export

Comparison

Tool Structural validation Semantic spec monitoring Production continuous CI gate Open source
Pydantic YES NO NO NO YES
DeepEval No (batch eval) YES (point-in-time) NO YES YES
Evidently No (statistical drift) NO YES NO YES
Langfuse NO NO YES (observability) NO YES
spec-drift YES (via Pydantic) YES YES YES YES

Contributing

spec-drift is MIT licensed. Contributions welcome.

git clone https://github.com/bibleworld/spec-drift
cd spec-drift
pip install -e ".[dev]"
pytest tests/

License

MIT — free to use, modify, and distribute.


Built with BibleWorld — Pattern: Leviticus 10:1-3 (The Authorized Fire) "Among those who approach me I will be proved holy." — Leviticus 10:3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spec_drift-0.1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

spec_drift-0.1.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file spec_drift-0.1.0.tar.gz.

File metadata

  • Download URL: spec_drift-0.1.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for spec_drift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b2eb0aeb549d822b52132a46fc928491f1381a65dccf6766db457e82a0a5fc86
MD5 fc3bc7354184cfe76e524493362b2c90
BLAKE2b-256 a961e0bb15bdfa3cc5f7478e54e45cfacf0721b846d2105fed162b57f7b8df39

See more details on using hashes here.

File details

Details for the file spec_drift-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: spec_drift-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for spec_drift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 dd8f48e8b92ab6ce0db0ca435604640e94e09c53a2b8317c48ea640536719110
MD5 8e93fe8178f1c787dfd0a381c60e82e0
BLAKE2b-256 52fd6c5f137bc860d051933b0f748692f294643f2f727b3528bca3e3123f7d7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page