Behavioral regression testing for LLMs. Capture outputs, diff behavior, detect drift — pytest for model upgrades.
Project description
modeldiffx
Behavioral regression testing for LLMs. Capture model outputs, diff behavior across versions, detect statistical drift, and fingerprint model personas — like pytest for model upgrades.
When you upgrade gpt-4-0613 → gpt-4-1106-preview, what actually changed? modeldiffx answers that with structured diffs, statistical drift detection, and model fingerprinting — all with zero required dependencies.
Why modeldiffx?
| Problem | modeldiffx Solution |
|---|---|
| Model upgrades silently break production prompts | Structured behavioral diffs with severity classification |
| "It feels different" — no way to quantify | Statistical drift detection (length, refusal, vocabulary, latency) |
| No baseline for model behavior | Snapshot capture & persistence for reproducible comparisons |
| Hard to characterize model personality | Fingerprinting with 8 behavioral dimensions |
| Evaluation suites are scattered/ad-hoc | 25 built-in prompts across 5 categories |
Installation
pip install modeldiffx # zero dependencies
pip install modeldiffx[cli] # + click, rich for terminal UI
pip install modeldiffx[metrics] # + rouge-score
pip install modeldiffx[all] # everything
Quick Start
1. Capture snapshots
from modeldiffx import Prompt, capture
prompts = [
Prompt(text="What is quantum entanglement?", category="knowledge"),
Prompt(text="Write a Python fibonacci function", category="code"),
Prompt(text="Summarize the French Revolution", category="knowledge"),
]
# Your model callable — any function that takes a string and returns a string
def call_model(text: str) -> str:
return my_api.complete(text)
snapshot = capture(prompts, call_model, model_name="gpt-4-0613")
snapshot.save("snapshots/gpt4_0613.json")
2. Diff two snapshots
from modeldiffx import diff_snapshots, Snapshot
snap_a = Snapshot.load("snapshots/gpt4_0613.json")
snap_b = Snapshot.load("snapshots/gpt4_1106.json")
report = diff_snapshots(snap_a, snap_b)
print(f"Changes: {report.n_changes}/{len(report.entries)}")
print(f"Change rate: {report.change_rate:.1%}")
print(f"Regression score: {report.regression_score:.2f}")
for entry in report.entries:
if entry.change_type.value != "identical":
print(f" [{entry.severity.value}] {entry.prompt.text[:50]}… → {entry.change_type.value}")
3. Detect drift
from modeldiffx import Snapshot
from modeldiffx.drift import full_drift_report
snap_a = Snapshot.load("snapshots/gpt4_0613.json")
snap_b = Snapshot.load("snapshots/gpt4_1106.json")
report = full_drift_report(snap_a, snap_b)
if report["length"]["drift_significant"]:
print(f"⚠ Length drift: {report['length']['drift_sigma']:.1f}σ")
if report["refusal"]["drift_significant"]:
print(f"⚠ Refusal rate changed: {report['refusal']['delta']:+.2f}")
if report["vocabulary"]["drift_significant"]:
print(f"⚠ Vocabulary overlap: {report['vocabulary']['jaccard_similarity']:.2f}")
4. Fingerprint a model
from modeldiffx import Snapshot
from modeldiffx.fingerprint import fingerprint, compare_fingerprints
snap = Snapshot.load("snapshots/gpt4_0613.json")
fp = fingerprint(snap)
print(f"Verbosity: {fp.dimensions['verbosity']:.2f}")
print(f"Refusal rate: {fp.dimensions['refusal_rate']:.2f}")
print(f"Formality: {fp.dimensions['formality']:.2f}")
print(f"Vocabulary richness: {fp.dimensions['vocabulary_richness']:.2f}")
5. Use built-in test suites
from modeldiffx import capture
from modeldiffx.suite import get_standard_suite, get_suite
# All 25 prompts across 5 categories
prompts = get_standard_suite()
# Or pick specific suites
safety_prompts = get_suite("safety")
code_prompts = get_suite("code")
snapshot = capture(prompts, call_model, model_name="gpt-4-turbo")
CLI
# Compare two snapshot files
modeldiffx diff snapshots/v1.json snapshots/v2.json
# Markdown output
modeldiffx diff snapshots/v1.json snapshots/v2.json --markdown
# Save JSON report
modeldiffx diff snapshots/v1.json snapshots/v2.json -o report.json
# Snapshot info
modeldiffx info snapshots/v1.json
# Drift analysis
modeldiffx drift snapshots/v1.json snapshots/v2.json
# List built-in suites
modeldiffx suites
API Reference
Change Types
| Type | Description | Typical Severity |
|---|---|---|
CONTENT |
Semantically different response | HIGH |
FORMAT |
Same content, different formatting | LOW |
REFUSAL |
One model refuses, other doesn't | CRITICAL |
LENGTH |
Significant length difference | MEDIUM |
STYLE |
Tone/verbosity shift | MEDIUM |
ERROR |
One model errors | HIGH |
IDENTICAL |
No change detected | — |
Regression Score
The regression score is a weighted severity metric (0.0 = no regressions, 1.0 = all critical):
- CRITICAL: weight 1.0 (refusal changes, safety regressions)
- HIGH: weight 0.6 (content changes)
- MEDIUM: weight 0.3 (style/length changes)
- LOW: weight 0.1 (formatting changes)
Fingerprint Dimensions
| Dimension | Range | Description |
|---|---|---|
verbosity |
0–1 | Average response length normalized to 500 words |
refusal_rate |
0–1 | Fraction of prompts refused |
error_rate |
0–1 | Fraction of prompts that errored |
vocabulary_richness |
0–1 | Type-token ratio |
avg_latency_ms |
0+ | Mean response latency |
length_consistency |
0–1 | 1 minus coefficient of variation |
formality |
0–1 | Ratio of formal to casual markers |
Architecture
modeldiffx/
├── _types.py # Core types: Prompt, Response, Snapshot, DiffReport
├── capture.py # Snapshot capture from model callables / files
├── diff.py # Behavioral diffing with similarity scoring
├── drift.py # Statistical drift detection
├── fingerprint.py # Model behavioral fingerprinting
├── suite.py # Built-in evaluation suites (25 prompts)
├── report.py # JSON/text/rich/markdown report formatting
└── cli.py # Click CLI interface
See Also
Part of the stef41 LLM toolkit — open-source tools for every stage of the LLM lifecycle:
| Project | What it does |
|---|---|
| tokonomics | Token counting & cost management for LLM APIs |
| datacrux | Training data quality — dedup, PII, contamination |
| castwright | Synthetic instruction data generation |
| datamix | Dataset mixing & curriculum optimization |
| toksight | Tokenizer analysis & comparison |
| trainpulse | Training health monitoring |
| ckpt | Checkpoint inspection, diffing & merging |
| quantbench | Quantization quality analysis |
| infermark | Inference benchmarking |
| vibesafe | AI-generated code safety scanner |
| injectionguard | Prompt injection detection |
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file modeldiffx-0.4.0.tar.gz.
File metadata
- Download URL: modeldiffx-0.4.0.tar.gz
- Upload date:
- Size: 59.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7dd29af29d746f5965cbdda01d473dc3102e3970fb8ea36c384b69b6945d7e85
|
|
| MD5 |
893ea99d7eb9062b912d5e480d9acfa3
|
|
| BLAKE2b-256 |
576a8bd2b6c6537f94e08b034d020dea16d4db78af69b6bb29183e644a89ae51
|
File details
Details for the file modeldiffx-0.4.0-py3-none-any.whl.
File metadata
- Download URL: modeldiffx-0.4.0-py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c17fe445558d94b6046bfc80bb1370d16c654c80db5b2d25d563c3988cf3198b
|
|
| MD5 |
61515b5febb92ed7891f87fcfc6ce53b
|
|
| BLAKE2b-256 |
d1b431099310bc5ff651948dd69c5c4fa23421a6b7d799c0225e3060491b58b5
|