Semantic specification drift detector for LLM outputs — catches semantic violations Pydantic cannot see.
Project description
spec-drift
Your LLM outputs pass Pydantic. That's not enough.
spec-drift detects when your LLM outputs drift semantically from their declared specification — even when structural validation passes. It monitors continuous semantic compliance in production, generates drift reports, and provides a CI gate for semantic regression testing.
The Problem
Every team shipping LLM features uses Pydantic or JSON Schema to validate output structure. These tools are excellent — and completely blind to semantic drift.
Consider what happens in production LLM systems over time:
Silent model updates: Your LLM provider silently updates the underlying model. The API contract (field names, types, schema) doesn't change. But the distribution of values inside those fields shifts. Your "sentiment" field starts returning "ambivalent" where it previously returned "neutral." Your "risk_level" classification shifts its decision boundary. Pydantic sees nothing.
Prompt erosion: A prompt is modified through six iterations of "just a small tweak." Each tweak passes regression tests individually. But cumulatively, the semantic profile of outputs drifts. The "reasoning" field that used to average 120 words now averages 30. Your validation still passes.
Input distribution shift: A new user cohort or marketing campaign brings different input patterns. The same model, same prompt, same schema — but outputs drift from the spec because they were calibrated for a different input distribution.
spec-drift catches all of these. Pydantic catches none of them.
The Biblical Foundation
"Moses then said to Aaron, 'This is what the Lord spoke of when he said: Among those who approach me I will be proved holy.'" — Leviticus 10:3
Nadab and Abihu offered "unauthorized fire" — structurally correct (fire), instrumentally correct (censers), personally authorized (priests). But the semantic specification was violated. Every structural check passed. The semantic compliance check failed.
spec-drift applies this principle to LLM outputs: structural validation is necessary but not sufficient. Semantic specification must be declared and continuously monitored.
BibleWorld build — PAT-037, Pivot_Score 8.63
Installation
pip install spec-drift
Requirements: Python 3.10+, Pydantic v2
Quick Start
1. Declare a semantic spec
from pydantic import BaseModel
from spec_drift import spec, SemanticConstraint
@spec(
category=SemanticConstraint.from_authorized_values(
["positive", "negative", "neutral"],
tolerance=0.02, # max 2% outputs outside authorized set
alert_threshold=0.10 # alert if >10% observations violate
),
reasoning=SemanticConstraint.from_length_bounds(
min_words=30,
max_words=300,
alert_threshold=0.15
),
score=SemanticConstraint.from_distribution(
mean=6.5,
std=2.0,
drift_threshold=1.0, # alert if mean shifts >1 sigma
alert_threshold=0.20
)
)
class SentimentAnalysis(BaseModel):
category: str
reasoning: str
score: float
2. Wrap your LLM function
from spec_drift import DriftMonitor
import anthropic
client = anthropic.Anthropic()
monitor = DriftMonitor(
spec=SentimentAnalysis,
db_path="./spec_drift.db",
model_version="claude-3-5-haiku-20241022",
)
@monitor.watch
def analyze_sentiment(text: str) -> SentimentAnalysis:
response = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=500,
messages=[{"role": "user", "content": f"Analyze: {text}"}]
)
return SentimentAnalysis.model_validate_json(response.content[0].text)
3. Check drift
# Terminal drift report (last 7 days)
spec-drift check --spec my_module.SentimentAnalysis --since 7d
# CI gate: fail build if >20% semantic violations
spec-drift ci \
--spec my_module.SentimentAnalysis \
--test-batch data/ci_batch.jsonl \
--threshold 0.20 \
--exit-code
API Reference
@spec(**constraints)
Attaches semantic constraints to a Pydantic model class.
@spec(field_name=SemanticConstraint.from_authorized_values([...]))
class MyModel(BaseModel):
field_name: str
SemanticConstraint
SemanticConstraint.from_authorized_values(authorized, tolerance, alert_threshold)
Field values must be drawn from the authorized list (within tolerance).
authorized: list of permitted valuestolerance: float, max fraction of outputs outside authorized set before constraint flagsalert_threshold: float, fraction of rolling observations before alert fires
SemanticConstraint.from_length_bounds(min_words, max_words, alert_threshold)
String field word count must be within [min_words, max_words].
SemanticConstraint.from_distribution(mean, std, drift_threshold, alert_threshold)
Numeric field should follow a distribution near (mean, std). Alerts if observed mean shifts by more than drift_threshold standard deviations.
SemanticConstraint.from_pattern(regex, min_match_rate, alert_threshold)
String field should match the regex pattern at min_match_rate frequency.
DriftMonitor(spec, db_path, model_version, prompt_hash, alert_callback)
Runtime monitor for semantic specification compliance.
.watch (decorator)
Wraps an LLM function to automatically observe its return value.
.observe(output) -> output
Manually observe a Pydantic model instance. Returns the output unchanged.
.drift_report(since_hours) -> dict
Generate a semantic drift report for the last N hours.
{
"spec": "SentimentAnalysis",
"period_hours": 168.0,
"observations": 4523,
"violation_rate": 0.0312,
"severity": "low",
"field_violation_rates": {
"category": 0.0089,
"reasoning": 0.0221,
"score": 0.0002
}
}
run_ci_gate(monitor, test_outputs, threshold) -> (passed, report)
Run a CI gate on a batch of test outputs. Returns (passed, report).
CLI Reference
# Initialize spec-drift in a project
spec-drift init
# Calibrate a baseline from golden data
spec-drift calibrate \
--spec my_module.SentimentAnalysis \
--input-file data/golden_set.jsonl \
--output baseline.db
# Drift report (table format, last 7 days)
spec-drift check \
--spec my_module.SentimentAnalysis \
--since 7d \
--format table
# CI gate
spec-drift ci \
--spec my_module.SentimentAnalysis \
--test-batch data/ci_batch.jsonl \
--threshold 0.20 \
--exit-code
# Compare two model versions
spec-drift compare \
--spec my_module.SentimentAnalysis \
--baseline-a baseline_gpt4o.db \
--baseline-b baseline_claude_haiku.db
# HTML report
spec-drift report \
--spec my_module.SentimentAnalysis \
--since 30d \
--output report.html
GitHub Action
# .github/workflows/llm-spec-check.yml
name: LLM Semantic Spec Check
on: [push, pull_request]
jobs:
spec-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: spec-drift/action@v1
with:
spec: my_module.SentimentAnalysis
test-batch: data/ci_batch.jsonl
threshold: '0.20'
Severity Levels
| Violation Rate | Severity | Recommended Action |
|---|---|---|
| 0% | NONE | No action needed |
| < 5% | LOW | Monitor, no immediate action |
| 5-15% | MEDIUM | Investigate, create issue |
| 15-30% | HIGH | Rollback or prompt fix recommended |
| > 30% | CRITICAL | Immediate rollback |
Storage
spec-drift uses SQLite by default — zero infrastructure required.
# Local development (default)
monitor = DriftMonitor(spec=MyModel, db_path="./spec_drift.db")
# PostgreSQL for production (coming in v0.2)
monitor = DriftMonitor(
spec=MyModel,
db_url="postgresql://user:pass@host/db"
)
# In-memory for testing
monitor = DriftMonitor(spec=MyModel, db_path=":memory:")
Roadmap
v0.1 (this release)
- Core
@specdecorator +SemanticConstraintDSL DriftMonitorwith.watchand.observe- SQLite observation store
run_ci_gatefunction- CLI:
check,ci,compare
v0.2
- PostgreSQL support
- Multi-field correlation monitoring
- Automatic model version detection (via LLM API response headers)
- Slack/PagerDuty alert integrations
- HTML drift reports
v0.3
- LLM-judge semantic constraint evaluation (for complex, prose-level constraints)
- Baseline versioning with SemVer
- Team dashboard (hosted cloud option)
- Prometheus/Grafana metrics export
Comparison
| Tool | Structural validation | Semantic spec monitoring | Production continuous | CI gate | Open source |
|---|---|---|---|---|---|
| Pydantic | YES | NO | NO | NO | YES |
| DeepEval | No (batch eval) | YES (point-in-time) | NO | YES | YES |
| Evidently | No (statistical drift) | NO | YES | NO | YES |
| Langfuse | NO | NO | YES (observability) | NO | YES |
| spec-drift | YES (via Pydantic) | YES | YES | YES | YES |
Contributing
spec-drift is MIT licensed. Contributions welcome.
git clone https://github.com/bibleworld/spec-drift
cd spec-drift
pip install -e ".[dev]"
pytest tests/
License
MIT — free to use, modify, and distribute.
Built with BibleWorld — Pattern: Leviticus 10:1-3 (The Authorized Fire) "Among those who approach me I will be proved holy." — Leviticus 10:3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file spec_drift-0.1.0.tar.gz.
File metadata
- Download URL: spec_drift-0.1.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2eb0aeb549d822b52132a46fc928491f1381a65dccf6766db457e82a0a5fc86
|
|
| MD5 |
fc3bc7354184cfe76e524493362b2c90
|
|
| BLAKE2b-256 |
a961e0bb15bdfa3cc5f7478e54e45cfacf0721b846d2105fed162b57f7b8df39
|
File details
Details for the file spec_drift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: spec_drift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dd8f48e8b92ab6ce0db0ca435604640e94e09c53a2b8317c48ea640536719110
|
|
| MD5 |
8e93fe8178f1c787dfd0a381c60e82e0
|
|
| BLAKE2b-256 |
52fd6c5f137bc860d051933b0f748692f294643f2f727b3528bca3e3123f7d7f
|